Long future frame prediction using optical flow‐informed deep neural networks for enhancement of robotic teleoperation in high latency environments

High latency in teleoperation has a significant negative impact on operator performance. While deep learning has revolutionized many domains recently, it has not previously been applied to teleoperation enhancement. We propose a novel approach to predict video frames deep into the future using neural networks informed by synthetically generated optical flow information. This can be employed in teleoperated robotic systems that rely on video feeds for operator situational awareness. We have used the image‐to‐image translation technique as a basis for the prediction of future frames. The Pix2Pix conditional generative adversarial network (cGAN) has been selected as a base network. Optical flow components reflecting real‐time control inputs are added to the standard RGB channels of the input image. We have experimented with three data sets of 20,000 input images each that were generated using our custom‐designed teleoperation simulator with a 500‐ms delay added between the input and target frames. Structural Similarity Index Measures (SSIMs) of 0.60 and Multi‐SSIMs of 0.68 were achieved when training the cGAN with three‐channel RGB image data. With the five‐channel input data (incorporating optical flow) these values improved to 0.67 and 0.74, respectively. Applying Fleiss' κ gave a score of 0.40 for three‐channel RGB data, and 0.55 for five‐channel optical flow‐added data. We are confident the predicted synthetic frames are of sufficient quality and reliability to be presented to teleoperators as a video feed that will enhance teleoperation. To the best of our knowledge, we are the first to attempt to reduce the impacts of latency through future frame prediction using deep neural networks.


| INTRODUCTION
As the effective utilization of fully autonomous mobile robots in industrial and domestic arenas still requires significant research and development (Appelqvist et al., 2007), teleoperation will play a major role for years to come. Even for fully autonomous robotic systems, monitoring and intervention will still be required by human supervisors in some situations. We are currently witnessing an increased usage of teleoperated robots, more specifically ground robotic vehicles, in areas such as search and rescue, environmental observation, surveillance, space exploration, mining, and transportation (Moniruzzaman et al., 2021). Therefore, enhancing teleoperation by addressing the underlying issues that make teleoperation difficult for operators is one of the major research areas in the robotic research community.
Ground vehicle teleoperation can be negatively impacted by lowlevel operator perception of the environment, poor human-robot interfaces, the overwhelming amount of information from multimodal teleoperation systems, underlying control system issues, and finally, communication channel difficulties, such as latency, or intermittency (Moniruzzaman et al., 2021). For long-distance teleoperation, the most significant drawbacks are induced by communication latency.
Lag or latency as low as 10-20 ms can be noticeable by teleoperators, and more than 225 ms has been shown to increase operator action time by 64% and error rate by 214% (MacKenzie & Ware, 1993). Our recent study (Moniruzzaman et al., 2022) demonstrates that ground vehicle teleoperation at speeds above 10 kph becomes almost impossible if the latency exceeds 1 s. For teleoperating ground vehicles under these conditions, latencies of as low as 170 ms can significantly degrade the vehicle control, primarily due to the effects of oversteering and subsequent delayed compensation (Frank et al., 1988).
Considering the impacts of latency on teleoperation performance, a significant research effort has been made to the design of predictive interface-based teleoperation systems (Deng & Jagersand, 2003;Dybvik et al., 2020;Ha et al., 2018;Matheson et al., 2013;Ricks et al., 2004;C. Wang et al., 2016;Wilde et al., 2020;Witus et al., 2011;Y. Xiong et al., 2006). These predictive interfaces have attempted to compensate latency through first-order predictive interfaces that mostly offer the predicted state variables for the period of delay (Artstein, 1982;Manitius & Olbrot, 1979). However, recent improvements in digital image collection and processing (Johri et al., 2021;Little et al., 2020), high-performance computation capabilities (Sanzharov et al., 2020), and machine learning and deep learning techniques (Al-Garadi et al., 2020;Khan et al., 2020) have opened up doors for further cutting edge innovation. We believe that applying deep neural networks to develop and implement a predictive feedback-based interface will be able to compensate for the adverse impacts of high latencies and enhance long-distance teleoperation.
For most of the ground vehicle teleoperation systems, a constant two-dimensional (2D) video feed has become one of the most popular and preferred modes of providing perception about the remote environment to an operator. Therefore, predicted future video frames have great potential to aid teleoperators by compensating for latency. However, the prediction of future video frames is a new and developing field of research. The task of predicting future video frames is exceptionally challenging due to the level of uncertainty and change in dynamic motions of objects in a scene.
In the video feed from a teleoperated ground vehicle with reasonable ground speed, the uncertainty and dynamicity become even greater.
To compensate for latency during long-distance ground vehicle teleoperation, long future prediction is a must. While deep learningbased future-generation techniques could be utilized, a huge amount of data is required to train the model to achieve robust outcomes.
Therefore, significant research into long future prediction and integration of the prediction model to a teleoperation predictive interface is necessary.
In this paper, we describe a system that has the capability to address the data shortage issue for AI-based teleoperation enhancement research with low-cost equipment and easily implementable and safe techniques. We have developed a teleoperation simulation model that is capable of simulating teleoperation with controllable latency without the need to perform teleoperation tasks using a real robotic vehicle. This model is capable of creating and saving video data sets of ground vehicle teleoperation along with the control input time series for AI and deep learning-based model training. The model developed for and used in this research is an extension of the model discussed in our previous work (Moniruzzaman et al., 2022). In a novel approach to long future prediction through a generative adversarial network (GAN)-based deep neural network, we have used image-to-image translation techniques along with optical flow as supporting information for the network training and inference.
Optical flow generation is used to reflect the control input timeseries signals. In our model, this is generated by integrating an Unreal Engine (UE) driving simulation to our previously proposed teleoperation simulator (Moniruzzaman et al., 2022). For long future prediction, we have used the Pix2Pix deep network discussed in Isola et al. (2017). To evaluate the performance of the model and enhancement approach, and to fine-tune it, we have used pixel matching-based image quality measuring metrics including peak signal-to-noise ratio (PSNR; Johnson, 2006) and the structural difference matching-based metric, the Structural Similarity Index Measure (SSIM; Z. Wang et al., 2004). To measure the reliability of the conditional generative adversarial network (cGAN) model based on interrater agreements between the evaluation metrics, and complementarity of optical flow we have analyzed means, medians, standard deviations, skewness, kurtosis, p values, and Fleiss' κ values.
To the best of our knowledge, no other approach has proposed the integration of image-to-image translation and optical flow information together to construct long future predicted frames to design a predictive video interface for ground vehicle robotic teleoperation.
The main contributions of the paper are summarized below.
1. Presenting a universal cGAN image-to-image translation method for deep neural network-based future frame prediction as a technique to enhance teleoperability for long-distance and high latency teleoperation.
2. Using extracted optical flow information (both from naturaldelayed and synthetically generated frames) as an additional aid to the cGAN network to generate higher-quality synthetic future frames.
3. Designing a UE-based cosimulation Simulink model and creating three new virtual ground vehicle teleoperation data sets for training, and testing purposes. This model can be used for artificial teleoperation data set generation.
4. An in-depth discussion of the outcomes of the study and the implications for future teleoperation enhancement research.
The rest of the paper is structured as follows. Section 2 offers background about the current state-of-the-art predictive feedbackbased ground vehicle teleoperation techniques, optical flow generation methods, and future frame prediction techniques. Section 3 discusses the research methodology with detailed information about the UE-based simulator, data collection with the simulator, optical flow extraction, long future prediction with the deep neural network, and the performance analysis method. Section 4 discusses the results and findings of the experiments. Section 5 concludes the paper and provides recommendations for future research directions.

| BACKGROUND
Ground vehicle teleoperation systems relying on a video feed from the remote platform can be affected by many factors including camera viewpoint, inadequate operator depth perception, orientation, speed, and motion of the vehicles, and quality of the transmitted video feed (J. Y. Chen et al., 2007). However, the most significant challenges that operators face for long-distance direct ground vehicle teleoperation are the control problems that arise due to lag or latency. The robotic research community has investigated designing predictive displays to compensate for the latency in the communication loop. This section of the paper will highlight conventional predictive display-based enhancement techniques along with the current state-of-the-art in future frame prediction and optical flow extraction techniques.

| Predictive display-based enhancement
Conventional predictive teleoperation enhancement has been investigated for both 2D and 3D predictive displays. Different implementations for both of these predictive display-based approaches are summarized in Table 1 (adapted from Moniruzzaman et al., 2022) and discussed below.

| Two-dimensional predictive display-based approaches
Increasing perceptual awareness regarding remote environments is an effective way to enhance control over teleoperated ground robotic vehicles. However, communication delay between the robotic platform and the teleoperator is unavoidable for longer distant links and degrades performance significantly. Literature suggests that higher levels of jitter and longer delays (both constant and timevarying) can make teleoperation effectively impossible (Richard, 2003) in challenging environments. Predicting the evolution of the state variable of the robot for the period of delay is one prospective way to reduce the impact of teleoperation lag (Artstein, 1982;Manitius & Olbrot, 1979). Therefore, in the 1990s, the robotic research community has started developing predictive displays to counter the delay in the control loop. The concept of the predictive display has helped earlier approaches such as , T A B L E 1 Summary of predictive feedback-based teleoperation enhancement techniques (adapted from Moniruzzaman et al., 2022) Author and year Technique Robot type  , Buzan andSheridan (1989), andHirzinger et al. (1989) to develop teleoperation systems that provide the operator information about the response of the system before it actually happens and hence avoids possible collisions. Iconography has also been used as a means to implement robotic state prediction for more recent teleoperation user interfaces, such as Augmented Reality (AR) and Virtual Reality (VR; Witus et al., 2011). However, these early approaches do not provide an intuitive control interface as they were mostly predictions based on non-video-based feedback and are not suitable for higher-speed operation.
Predicting robot states or pose from 2D video feeds from the remote platform has also been used to enhance teleoperation affected by communication delay. The propagation stage prediction technique has been used by Ha et al. (2018)  For teleoperation enhancement, Deng and Jagersand (2003) and Ricks et al. (2004) incorporated 3D visual feedback with next-step prediction to their interface. A 3D ecological user interface was described by Ricks et al. (2004) that represents the remote environment using the sonar sensor feedback. Similar to a simple Kalman filter, the authors implemented first-order prediction on this user interface display. Their interface provides a 3D representation of the immediate last command calculating the time since the command was sent instead of the time it takes to process on the remote platform. A similar technique was described by Deng and Jagersand (2003) as well.
All of the above-mentioned predictive display-based approaches represent the visual impact of operator commands in real-time while teleoperating, to reduce the effect of latency.
However, for high latency long-distance teleoperation tasks, these first-order prediction techniques are not effective. Further, the state-of-the-art neural network and artificial intelligence (AI)based techniques are not easily transferable to and implemen-

| Image-to-image translation
Image-to-image translation is the task of translating or generating an output to a desired corresponding input image. According to Isola et al. (2017), image-to-image translation is the act of translating one possible representation of an image or scene to another. Image-toimage is simply the prediction of pixels from pixels. Although this is a relatively new field of research, a large number of computer vision and image processing problems can be described as an image translation problem.
Image-to-image translation tasks have frequently been described as regression or per-pixel classification (Iizuka et al., 2016;Larsson et al., 2016;Long et al., 2015;Xie & Tu, 2015). The computer vision research community has tried to address the image-to-image translation or prediction task by using convolutional neural networks (CNNs; Pathak et al., 2016;R. Zhang et al., 2016). CNNs use Euclidean distance between the predicted and the ground-truth pixels as a loss function. However, Euclidean distance tends to generate blurry results (Pathak et al., 2016;R. Zhang et al., 2016). Therefore, the research community was looking for a high-level solution, that is capable of generating outputs that are indistinguishable from the target and, in the process learning an appropriate loss function. GANs (Denton et al., 2015;Gauthier, 2014) (Karacan et al., 2016;Reed et al., 2016), predict future frames (Mathieu et al., 2015), generate synthetic product photographs (Yoo et al., 2016), and perform simple but versatile image-to-image translation .

| Optical flow
Estimating optical flow from consecutive images is one of the oldest, yet still one of the most active research areas in computer vision (Fortun et al., 2015). As optical flow offers valuable information about objects' spatial arrangements in a scene and the rate of change, it has a wide range of applications across different domains. In computer vision, the optical flow has widely been used in object tracking (Xiao & Jae Lee, 2016;Yin et al., 2016), action recognition (Simonyan & Zisserman, 2014), and segmentation tasks (Tsai et al., 2016). Video compression (Jakubowski & Pastuszak, 2013), indexing and retrieval (Hu et al., 2011;Su et al., 2007), image interpolation (K. Chen & Lorenz, 2011;Makansi et al., 2017), and super-resolution tasks have used optical flow as a determinant feature. Motion flow analysis of fluids and satellite imagery has important application in fluid mechanics, oceanography, meteorology, and aerodynamics (Corpetti et al., 2002;Héas et al., 2007;Heitz et al., 2010;T. Liu & Shen, 2008).
Autonomous robotic navigation and driverless car technology research domain has recently found optical flow as a useful technique due to its applicability in obstacle detection and avoidance (Chao et al., 2014;Giachetti et al., 1998;Sun et al., 2006).
There have been a small number of future video prediction approaches that utilized multiple sequential past frames along with the optical flow to assist with the prediction of future frames (Liang et al., 2017;Wei et al., 2018). However, to the best of our knowledge, for image-to-image translation and future frame prediction, the optical flow has not previously been used. For our use case, the discussion of optical flow is relevant as we wanted to incorporate the real-time operator control input as a determinant factor for predicting a future frame from a past frame. The challenge for this is that the time-variant control signal is difficult to directly incorporate with 2D image data to train a deep neural network. However, the change of motion of objects in a scene for teleoperation is dictated by the operator control.
Therefore, the optical flow of frames can be considered as the reflection of the operator control. Moreover, optical flow extracted from a visual simulation of any form that represents the real-time control signals by the operator should provide the future prediction network with valuable information about the anticipated changes in objects' motion and location in the frame. Gibson (1950) proposed the concept of optical flow and an initial approach to computing the motion of pixels was proposed by Poggio and Reiehardt (1976). According to the physiological description of the visual perception of the world, the displacement of the intensity pattern of a scene refers to optical flow (Fortun et al., 2015).
Therefore, between two successive image frames, the 2D displacement field describing the apparent motion of brightness patterns is called optical flow (Horn & Schunck, 1981). On the basis of this definition, the intensity of moving pixels remains constant in motion as the optical flow is caused by the relative motion of the object and the observer (Verri & Poggio, 1989). Using this assumption of projected scene flow, the true motion of objects inside a scene is also considered as optical flow (Barron et al., 1994;Tu et al., 2019).
• The Lucas-Kanade method is a popular gradient-based or differential method. The local energy optimization (Lucas & Kanade, 1981) method has been used to improve the accuracy of the method. This method uses spatiotemporal derivatives of image intensity to compute optical flow. Although the Lucas-Kanade method is very robust to image noise, it cannot generate dense optical flow (Barron et al., 1994).
• The Horn-Schunck method is also a very popular gradient-based method. This method uses global energy minimization for performance optimization (Horn & Schunck, 1981). This method can be used for dense optical flow generation, however, it is prone to noise (Barron et al., 1994).
• The Farnebäck method is a widely used dense optical flow extraction method from consecutive image frames. This method uses a polynomial expansion transform to approximate each neighborhood of the consecutive frames by quadratic polynomials.
• Optical flow extraction was carried out from inertial measurements for small autonomous UAVs using a block-matching algorithm (Kendoul et al., 2009). The block-matching algorithm attempts to minimize the sum of the absolute difference or the sum of the squared differences.
• An image interpolation-based optical flow extraction method has been used for the navigation of small UAVs that do not possess high computation facilities (Srinivasan, 1994;Zufferey & Floreano, 2004). This method does not require image velocity calculations or any specific feature tracking to extract optical flow.
To get a complete reflection of the teleoperator's control input as a form of optical flow from sequential video frames from a robotic teleoperation session, it is necessary to know the motion of the individual pixel in the video frame. Sparse optical flow extraction techniques only compute the motion vector for a specific set of objects. Therefore, dense optical flow extraction techniques are the focus in this research.

| METHODOLOGY
Our recently published comprehensive survey of the state-of-the-art teleoperation enhancement techniques, Moniruzzaman et al. (2021) support the idea that deep learning, more specifically the use of deep neural networks to generate long future predicted frames, would be a practical way of compensating for the latency in the communication loop. However, deep learning techniques are data-hungry, requiring a large and diverse data set to achieve high accuracy and robustness.
Conventional robot in-the-loop teleoperation systems are not ideal for large scale image data collection considering the cost, time, and associated safety concerns for operation in a diverse and challenging remote environment. For a real robot in-the-loop teleoperation system both the cost and risk factors would be even more significant.
Therefore, we have designed a system that can simulate high latency teleoperation. Our previous paper (Moniruzzaman et al., 2022) proposes a conventional video transformation technique to aid teleoperators in high latency teleoperation conditions. In this manuscript we aim to show that neural network-predicted and generated future frames can achieve similar or even better results considering the pixel and structural integrity-based comparison of predicted future frames with the ground truth. An overall flow diagram of the proposed approach is illustrated in Figure 1. A detailed description has also been provided in this section as follows.

| Updated simulator design
The simulator used for data collection for this paper is an extension of the work described in Moniruzzaman et al. (2022). In this updated simulator, a UE module has been incorporated to produce optical flow reflecting the real-time control inputs. Figure 2 shows a schematic diagram of the simulator and brief descriptions of all the major components are described below.
F I G U R E 1 Flow diagram of the proposed methodology. A detailed explanation of the functionality of each of the blocks is provided in Sections 3.1-3.5.

| Simulation platform
We have selected "Forza Horizon 4," a commercially available racing game engine as a matrix to create a teleoperation platform. We have used this platform as the remote robotic environment. Microsoft Studios developed and released the game engine in October 2018. It offers several places of Great Britain such as Ambleside, Cotswolds, Edinburgh, and the Lake District in a fictionalized form to provide a realistic driving experience. Both off-road and on-road facilities have been incorporated to offer a comprehensive driving experience.
More importantly, Forza Horizon 4 has coupled reasonably realistic ground vehicle physics along with a near-photorealistic visual representation of the environment. Therefore, it is an exemplary candidate to be used as a virtual environment and simulate a teleoperation task.

| Control input equipment
We have used "Logitech G29 Driving Force Racing Wheel" that has a brake and acceleration pedals ( Figure 2)  3.1.4 | Latency controller and delayed feed display Any robotic teleoperation system is unavoidably affected by some degree of lag or delay. The degree and impact of this lag or latency are more pronounced for long-distance teleoperation scenarios.
Therefore, the ability to induce and control delay are major aspects of any teleoperation simulator. An operator should feel the impacts of these induced delays when the simulation is running. The simulator also needs to be capable of compiling, displaying, and saving control inputs, such as accelerator and brake pedal inputs and steering wheel F I G U R E 2 System diagram of the teleoperation simulator. The components and functionality of each block are described in Sections 3.1.1-3.1.6. The dotted box represents the updated Unreal Engine (UE) cosimulation module that is elaborated in Figure 3.
rotation. Finally, it should be able to receive, display, process, and save the visual feed for future analytical purposes.
We have developed a Simulink model that can mimic "operator in the loop" robotic teleoperation by offering all of the above-  Figure 3. Ground Following," "Transformation Display," and "Image Display" Simulink blocks. Other blocks to design this cosimulation module include "Simulation 3D Actor Transform Get" and "Simulation 3D Camera Get" blocks.
The above-described model has been developed using conventional off-the-shelf equipment, commercially available game engine, easy-to-integrate Simulink blocks, UE-based cosimulation platform, and some custom video and control signal transformation algorithms.   For this data set, the latency has also been set to 0.5 s.

| Data collection and preprocessing
F I G U R E 4 Sample ground truth (GT), delayed (DL), and Unreal Engine (UE) simulation images that correspond to each other from three of our data sets

| Optical flow extraction
In this work we have aimed to apply a deep neural network to the task of predicting and generating frames around 0.5 s into the future. We have hypothesized that as an additional aid to the neural network along with the 3 channels of the RGB image input data set, if we provide a converted form of real-time control signals that are not affected by the control loop delay as an additional input to the network, we will be able to produce predicted frames that reflect the impacts of the changing control inputs and will thus be more (1) Applying Taylor series approximation to the right-hand side, removing common terms and dividing by dt, we can find Here, F I G U R E 5 An example of two consecutive frames and the corresponding magnitude and angle component of the optical flow between them 3.4 | Future frame prediction through image-toimage translation Conventional future frame prediction models receive multiple input frames and try to predict one or more subsequent frames. Such prediction networks themselves require enough time to receive multiple frames, however. By the time the subsequent frames have been generated, these predicted frames will already have become past frames in the time spectrum. Moreover, these future prediction models are not designed for a constant influx of frames.
For a real-time ground vehicle teleoperation scenario, the video feed is delayed but offers a constant influx of incoming frames. We hypothesized that information necessary for future frame synthesis is largely contained in the prior frames, and thus if the influx of the delayed frames can be translated into frames that are reasonably closely matched to the original present-time frames, then, that would solve the latency problem. For our use case, and based on the experimental setup we have designed, the goal of the future prediction is to generate a new frame that is close to the ground truth (real-time nondelayed driving simulator feed frames) taking the delayed frame as an input. Therefore, instead of a conventional future video prediction scenario, the problem can be described as an image-to-image translation problem.

| Network architecture selection
As the conventional future frame prediction networks require multiple sequential past frames as input, to add these networks to a real-time robotic teleoperation system will require an intermediate buffer data loader where multiple frames from the robot end will arrive until the required number of frames has been received to be able to generate desired future frame/frames. This buffer will add more latency to the system. Therefore, a network is ideally needed that is capable of receiving single frames and translating them to frames with an acceptable level of similarity to frames deep into the future in a FIFO fashion. We also require a network that generates future frames fast enough to create a video feed smooth enough to provide the operators with an acceptable visual perception of the remote environment. We also require a network that is robust enough to accept 2D image data with varying scene conditions, because in robotic teleoperation in a challenging environment, scenes will have a high degree of uncertainty.
Considering the above-mentioned requirements, instead of a conventional future frame prediction network, image-to-image translation is more suited for the task of future frame generation for the purpose of latency compensation. However, finding a suitable network for our use case requires careful consideration. In general, all of the current image-to-image translation networks can be categorized as either unsupervised translation or supervised translation networks (Alotaibi, 2020). Unsupervised translation networks such as CycleGAN The remaining supervised image-to-image translation networks can be divided into two more categories: Directional translation and Bidirectional translation networks (Alotaibi, 2020). Bidirectional networks such as BicycleGAN (Zhu, Zhang, et al., 2017)  Pix2Pix was therefore selected as the only suitable network for this application.
In recent times ensembling is a popular concept to improve the results for discriminative CNNs, particularly for image classification (Neena & Geetha, 2018), object detection (X. Wang & Gupta, 2015), or medical image segmentation tasks (Altaf et al., 2021;Kavur, Gezer, et al., 2020;Kavur, Kuncheva, et al., 2020;Menze et al., 2014). Ensembling of GANs has also been experimented with for imbalanced image classification (Ermaliuc et al., 2021;Huang et al., 2020). Although ensembling can be a relatively easy way to increase the classification accuracy or generated results from a model, it is computationally expensive and requires more time for both training and generating outcomes. Ensembling was investigated as a potential method for improving the outcomes of this study, but was rejected for two reasons. First, ensembling requires two or more suitable networks, but based on our literature survey, we were not able to identify any other viable candidate options for our use case. Additionally, the goal of our research is to enhance teleoperability by predicting future frames in real-time, and any additional computation time that adds further delay to the communication loop runs counter to this goal.
Considering the above factors, for our future frame prediction task, we have limited our experimentation to the Pix2Pix network, the cGAN described in Isola et al. (2017)

| Loss functions
The future frame generation network learns with two loss functions: the generator loss function and the discriminator loss function. Conventional GANs learn losses that are data specific.
However, conditional GANs, like, the one used in this work, learn a loss that penalizes, and adjust the learning based on a possible output structure that has differences to the target image. For our network, the primary generator loss is a sigmoid cross-entropy loss between the generated image and an array of ones and is called gan loss _ . To make the generated image structurally similar to the target image, Isola et al. (2017) used L1 loss along with the gan loss _ . The L1 loss is the mean absolute error between the target image and the generator output. The total generator loss is defined as total generator loss gan loss λ L _ _ = _ +( × 1 ) .
Here the value of λ is a constant. Isola et al. (2017) found that the network performs better, for a λ value of exactly 100.
The discriminator part of the network is provided with the target image and the generator output image. The discriminator loss consists of two elements as well: the real loss _ and the generated loss _ . Here, the real loss _ is the sigmoid cross-entropy of the target image and an array of ones. The generated loss _ is the sigmoid cross-entropy of the generator output and an array of zeros.
Here the array is made up of zeros as the generator output is a synthesized (fake) image. The total discriminator loss is then the sum of these total discriminator loss real loss generated loss _ _ = _ + _ .
(6) Figure 6 shows the stages of the training process for the future frame prediction network. Once the input image is passed through the generator, a synthesized image is generated, afterwards the discriminator compares the target and the synthesized image and learns by attempting to minimize the discriminator loss, the generator learns based on the GAN_loss and L1 loss.

F I G U R E 6
The training process and the generator and discriminator output after passing through one input and one target image MONIRUZZAMAN ET AL.

| Accommodating optical flow
In the context of teleoperation, which is the focus of this work, we aim to show that along with the delayed frames, additional information, such as optical flow that represents the real-time control input signal can improve the outcome. However, the network described in Isola et al. (2017) can only take RGB images as input. To accommodate the additional optical flow information, we have made changes to the data loader function as well as the input layer of the core network. The procedures for optical flow extraction and data set preparation have been discussed in Sections 3.2 and 3.3. The optical flow of two consecutive frames has magnitude and angle components with the same dimension as the input frames. We have stacked these angle and magnitude components as the respective fourth and fifth channels along with the R, G, and B channels of the input image. As a result, the input data have been converted to five-channel data as illustrated in Figure 7. The input layers of the generator have also been modified so that these two additional channels can be accepted by the network.
To increase the robustness of the network with this type of data, some preprocessing is carried out just before feeding the

| Training parameters
The preprocessed data are fed to the cGAN Network for training.
We have used one of two different desktop computers for all the training sessions for our three data set. One of the desktops has a 48-GB NVIDIA RTX A6000 graphics processing unit (GPU) and the other has a 24-GB NVIDIA GeForce TITAN RTX GPU. Each training data set comprises a total of 20,000 images. As the GPUs for both of the available training desktops were sufficiently wide, we have set the training buffer size to 20,000, meaning that, for each step of training, the whole data set is shuffled. We have kept the batch size at one (1), as it produces better results for the U-Net described in Isola et al. (2017). For all of our data sets, we have trained the network for 100 epochs which is 2,000,000 Here, μ x is the average of x, μ y is the average of y, σ x 2 is the variance of x, σ y 2 is the variance of y, σ xy is the covariance of x and y, and c 1 -c 2 are the variables to stabilize the division. The variance and covariance part of the SSIM algorithm accounts for the structural change between images.
While comparing multiple different image frames with a groundtruth frame, the higher the SSIM, the closer the frame is to the ground truth. To compare the structural differences we have experimented with the SSIM to compare the frames generated from the different data sets with and without the use of optical flow as an input to the network. The SSIM comparisons are shown in Figure A7 in Appendix A.
On the basis of this, we have used it to cross-compare our research outcomes. SSIM and MS-SSIM share the same core algorithm. However, MS-SSIM conducts its operation over multiple scales using a process of multiple subsampling stages.
While comparing using the MS-SSIM process, the system downsamples the images by a factor of two and passes them through a low-pass filter. We have compared the same synthetic frames with the ground-truth video frames using MS-SSIM that was fed to the SSIM. Figure A8 in Appendix A shows the comparison for all three data sets.

| RESULTS AND DISCUSSION
The primary motivation for this research is to determine whether artificial neural network generated predicted future frames have the ability to enhance operator control in a high latency teleoperation scenario. If the predicted future frames are found to have reasonably close structural and pixel similarity with the ground-truth future image frames, this can be considered as a preliminary proof of this concept. Therefore, the motivation for and feasibility of this new direction of research can be established. We have used a cGAN as our future frame generator for translation of the delayed frame to a future frame. This generic cGAN has proven to be successful for style, map, or discrete scene translation. However, for our use case, the translated frame needs to overcome a long delay by predicting the changes in pixels and structures over multiple consecutive video frames from the point of view of a vehicle in motion with unpredictable control inputs. Due to the significant additional challenges this imposes, we wanted to find out whether additional time-series information such as optical flow from the delayed frames or transformed control signal input converted to a form of optical flow through a UE-based subsystem can improve the accuracy and similarity of the generated frames to the ground truth. To make our findings more robust and reliable, we generated three data sets: Forza_GT + UE + Std_DL, Forza_GT + UE + VT_DL, and Forza_GT + UE + Forza_DL. The details of these data sets are discussed in Once all training sessions were completed, the cGAN networks were used to generate the synthesized future frames using the delayed frames as three-channel RGB input from the test data set. As discussed in Section 3.2, the test data sets contain 2000 image frames of ground truth, delayed, and UE simulation. The corresponding magnitude and angle components of optical flow for ground truth, delayed, and UE frames are also present in the test data as additional input for nine out of 12 trained models. The remaining three models were not trained with optical flow, and will not require optical flow to generate predicted future frames. All three test data sets are created from random videos of teleoperation sessions we performed to create our data bank. For a particular data set (out of three), both the MONIRUZZAMAN ET AL.
| 407 training and test sets are collected in a way where the delayed frames have the same characteristics (i.e., Std_DL, VT_DL, or Forza_DL). The predicted and synthetically generated frames with our trained models were then compared with the ground-truth frames of the test sets based on the pixel and structural similarity indexes described in Section 3.5. Table 2 provides the mean PSNR, SSIM, and Multi-SSIM values for all of the predicted synthetic frames from the 12 trained cGAN models. Figures A6-A8 in Appendix A show the PSNR, SSIM, and multiscale SSIM values for the generated future frames from the cGAN models trained with all of the different test input data and optical flow combinations described above to show the comparison.
To further evaluate the outcomes of our research and offer better insight into the complementarity between the pixel and image structural analysis, we have also provided a statistical analysis in Section 4.2.

| Statistical analysis
The image analysis results in Table 2 show that, for all three test data sets, both with and without optical flow, the cGAN-generated synthetic future frames have sufficient pixel and structural soundness to provide a continuous and meaningful video feed for teleoperation.
However, looking only at the mean values of the image quality analysis metrics, the impacts of the training parameters, especially, the impact of optical flow in the predicted frames, cannot be adequately determined. Moreover, the usability and robustness of the cGAN deep network can be better affirmed with further statistical analysis of the cumulative test sets. Table 3 presents the results from this statistical analysis. Unlike Table 2, Table 3 Table 4 we can see that for predicted frames generated without the aid of any optical flow the Fleiss' κ is 0.40 (for 95% confidence interval, the lower and upper bounds are 0.39 and 0.41).
According to the κ value classification by Bland and Altman (1999) and Landis and Koch (1977)

| Discussion
From Table 2, it is evident that the performance of the cGAN model is influenced by the data set, even though all three data sets have been collected using the same simulator described in Moniruzzaman et al. (2022). For the Forza_GT + UE + Std_DL data set, the input frames are the delayed video feed that has been passed through the Simulink delay system, which results in a drop in image quality, and frame rate, simulating a real-life teleoperation scenario. As a result, the cGAN has to predict future frames from a lower-quality image feed. In this data set, some frames are repeated or corrupted and thus do not resemble either the input or output data. To keep the input, output, and optical flow frames synchronized, we have recorded all three windows (ground-truth game screen, delayed screen, and UE simulation screen) using OBS, with a recording frame rate of 30 fps.
However, the delayed window has an underlying frame rate of 14-18 fps limited by the Simulink update rate, and the UE simulator also has a similar underlying frame rate. This means that when these recorded videos are converted to individual frames, the transition between the ground-truth frames is smooth, however, a repeated frame is added every few frames for both the delayed and UE optical flow data. This is the case for the Forza_GT + UE + VT_DL data set as well. For the Forza_GT + UE + Forza_DL data set both the delayed input and real-time ground truth, come from the recorded HD Forza game screen with the same frame rate and image resolution.
Therefore, the cGAN trained with this data set has generated better synthetic future frames than Forza_GT + UE + Std_DL. For a standard teleoperation test condition, we can consider the PSNR, SSIM, and Multi-SSIM values of the Forza_GT + UE + Forza_DL data as typical reference values, because there will be no video recording and resulting frame rate disparity issue in a real-life teleoperation system.
As described above, the Forza_GT + UE + VT_DL data set suffers from the same frame rate disparity-related noise issue similar to that of Forza_GT + UE + Std_DL. However, Table 2 shows, that the generated future frames have the highest level of similarity with the ground-truth frames for this data set, measured both by the pixel values and the structural similarity indexes. For Forza_GT + UE + VT_DL data sets, the ground truth is from the HD Forza real-time game screen, similar to the other data sets. However, the input-delayed frames are passed through a video transformation process, which has been discussed in Moniruzzaman et al. (2022) to enhance teleoperation. The better results achieved using these transformed frames to generate the synthetic future frames support the claim of our previous paper (Moniruzzaman et al., 2022) that simple video transformation can be used to enhance the teleoperation operator experience.
Additionally, a deep neural network-based future frame generation model on top of the video transformation system has the potential to enhance high latency teleoperation even further. The concept of optical flow and its uses in diverse domains is more than half a century old. Using optical flow for autonomous robotic navigation is also common. However, using optical flow as an aid to predict the future to address communication latency is a new concept and to the best knowledge of the authors, has never been described in the literature. In addition to the visual representation in Figures A3-A5, the results presented in Table 2 and the graph comparisons in Figures A6-A8 indicate that the inclusion of optical T A B L E 4 Fleiss' κ analysis for the predicted future frames from the no-optical flow, delayed optical flow, and real-time optical flow assisted cGAN models flow information to the base three-channel data has positively impacted the generated future frames. Statistical analysis of the three test sets cumulative outputs presented in Tables 3 and 4 has provided insight into the complementarity of optical flow for future frames prediction by the cGAN network. The boxplot presented in Figure 8 indicates that frames predicted with the help of optical flow have clear quality improvements as measured by all three evaluation metrics. Additionally, for future predictions without optical flow, the number of outliers is higher. The κ values presented in Table 4 show that the models trained with additional optical flow information, generated frames that were more reliable, with better interrater agreements.
F I G U R E 8 Boxplot representations of the (a) PSNR, (b) SSIM, and (c) Multi-SSIM values for a total of 6000 predicted synthetic frames (across all three data sets) generated by the cGAN models trained without any optical flow, with DL optical flow, with GT optical flow, and with UE optical flow. cGAN, conditional generative adversarial network; DL, delayed; GT, ground truth; PSNR, peak signal-to-noise ratio; SSIM, Structural Similarity Index Measure; UE, Unreal Engine.
The optical flow offers additional information about the relative motion of the individual objects in a scene. Therefore, The models trained with the five-channel input data that included some form of optical flow have generated more structurally similar future frames relative to the ground truth. From the table and figures, apparently, the impact of optical flow from the UE simulation was subtle.
However, optical flow from the UE is the one that will theoretically provide the cGAN network with a latency-free reflection of the control input changes, and so is probably the most important for a practical enhancement system. In a live teleoperation context, only the control signal-derived optical flow will provide a nondelayed prediction of what the motion of the objects in the scene should be, and this is what the synthesized frames will need to match. The optical flow offers information about the relative motion of the individual objects in a scene rather than the individual pixel's motion.
Our UE cosimulation offers the change of motion of a gridded scene.
Although it does not match with the scene of the real teleoperation environment, the fact that results from the UE-generated optical flow are comparable to that from the delayed feed-generated optical flow and better than the no-optical flow-based predicted frames is significant. Moreover, our result from the real-time optical flowbased input outperforms all the other three input data sets. This suggests that the optical flow that reflects the real-time control input from the operator will be an effective means to synthetically generate the equivalent results of the ground truth. Nonetheless, for a live teleoperation session affected by latency, optical flow from future frames (ground-truth frames) will not be available in a real system context. Therefore, the experimentation conducted in this paper has opened up further research avenues regarding developing methods to generate optical flow that reflects the nondelayed control signals as well as the scene configuration of the remote environment of a teleoperation session. Other associated future research can focus on ways to more optimally incorporate optical flow into the U-Net structure. Further investigation is also necessary to determine F I G U R E 9 Comparison of (a) PSNR, (b) SSIM, and (c) Multi-SSIM values between transformed delayed frames tested by human operators in Moniruzzaman et al. (2022) (left) and frames predicted and synthetically generated by cGAN using Forza_GT + UE + VT_DL data set (right). cGAN, conditional generative adversarial network; DL, delayed; GT, ground truth; PSNR, peak signal-to-noise ratio; SSIM, Structural Similarity Index Measure; UE, Unreal Engine.
whether there is a more appropriate layer of the network to feed the optical flow information into instead of the input layer.
Irrespective of the above points, the most important question to ask is whether the predicted and synthetically generated future frames are of a sufficient quality to provide as a video feed to a human operator to effectively enhance their teleoperation experience in high latency settings. In Moniruzzaman et al. (2022), we have tested the teleoperation simulator with and without video transformation of the delayed feed with human operators. Figure 9 shows the PSNR, SSIM, and Multi-SSIM graphs from Moniruzzaman et al. (2022) alongside the similar graphs for the cGAN-based synthetically This indicates that the synthetic predicted future frames generated by the cGAN network should be of sufficient quality to be fed as video streams to human operators, although this will need to be verified with further human operator trials.
A point that should be noted is that the three data sets collected for this paper are highly complex, with multiple simultaneous changes in vehicle speed, direction, and path to emulate highly challenging remote teleportation situations. We are confident that, for more standard teleoperation scenarios, when the change of speed, and vehicle direction are not so frequent and unpredictable, the generated future frames would have even higher levels of similarity with the ground truth. Additionally, if the additional noise introduced due to the variations in frame rate between the ground-truth and the delayed feeds was eliminated, further improvements in the achievable levels of similarity are anticipated. Further refinements to the simulation platform can be explored to achieve this.

| CONCLUSION
This paper has described a novel approach to mitigating the impacts of latency and enhancing teleoperation specifically for robotic ground vehicles. In our previous work (Moniruzzaman et al., 2021), we proposed that future frame prediction can be a new direction of research for enhancing teleoperation, especially for challenging environments with high latency communication channels. This paper has offered further support for this position. Future video frame prediction is itself a relatively new research domain. To the best of the authors' knowledge, the concept of future frame prediction has never previously been applied to latency mitigation or any other related tasks in the field of robotics. Moreover, this paper has presented a novel approach to long future frame prediction, using the image-to-image translation technique to predict and generate frames far into the future (approximately 0.5± s). For the base image translation task we have used a conditional GAN with U-Net architecture called Pix2Pix described in Isola et al. (2017). Besides the conventional three-channel input frames, we have experimented with five-channel input data where the last two channels are the magnitude and angle component of optical flow, calculated from two consecutive frames using the Gunner-Farnback method. In this work, we have experimented on the effectiveness of optical flow extracted from the delayed frame sequence, from the ground-truth frame sequence, and from a completely separate but cosimulated UE simulation reflecting real-time control input changes. The outcome of our experiments shows that future prediction using the delayed frames transformed through the video transformation algorithm That means the model trained with optical flow generates a more reliable outcome, with less outliers, and less variation of quality in the predicted frames. On the basis of the image and statistical analysis, the authors of this paper are confident that these pixel and structural similarity index values are high enough to feed these synthetically generated future frames to a human operator in a real teleoperation scenario.
Substantial future research opportunities exist towards achieving a highly accurate and structurally detailed synthetic future frames generation model that can be integrated into a high latency teleoperation control and communication loop. Our future research will focus on improving the similarity index and perceived quality of the predicted frames and the ability to predict even further into the future to allow compensation for higher latencies. Investigations into alternative deep network architectures for frame generation and identification of appropriate layers to tunnel additional information, such as optical flow, or other cues will also be explored. Alternative methods for the prediction of future optical flow can also be investigated. Finally, the integration of the neural network model into a complete real-time teleoperation system will require extensive further work. Once this integration has been realized, we intend to more comprehensively assess the effectiveness of the deep neural network-predicted future frames-based teleoperation enhancement technique, and its impacts on human operator performance through extensive user testing.

ACKNOWLEDGMENTS
This research has been funded by the ECU-DSTG Industry Ph.D.
scholarship. The authors would also like to thank Mr. Hassan Mahmood for a number of helpful discussions regarding the research and ideas. Open access publishing facilitated by Edith Cowan University, as part of the Wiley-Edith Cowan University agreement via the Council of Australian University Librarians.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request. F I G U R E A6 PSNR value-based comparison of the predicted synthetic future frames generated by the cGAN against the ground truth for (a) Forza_GT + UE + Forza_DL, (b) Forza_GT + UE + VT_DL, and (c) Forza_GT + UE + Std_DL test sets. cGAN, conditional generative adversarial network; DL, delayed; GT, ground truth; PSNR, peak signal-to-noise ratio; UE, Unreal Engine.