So, this is an mirror of a post from John Carmack. Recently I learned that his articles on #AltDevBlog are no longer acessible. So, in order to archive them, I am re-posting them here. These articles are definitely good reads and worth to be preserved.
Virtual reality (VR) is one of the most demanding human-in-the-loop applications from a latency standpoint. The latency between the physical movement of a user’s head and updated photons from a head mounted display reaching their eyes is one of the most critical factors in providing a high quality experience.
Human sensory systems can detect very small relative delays in parts of the visual or, especially, audio fields, but when absolute delays are below approximately 20 milliseconds they are generally imperceptible. Interactive 3D systems today typically have latencies that are several times that figure, but alternate configurations of the same hardware components can allow that target to be reached.
A discussion of the sources of latency throughout a system follows, along with techniques for reducing the latency in the processing done on the host system.
Updating the imagery in a head mounted display (HMD) based on a head tracking sensor is a subtly different challenge than most human / computer interactions. With a conventional mouse or game controller, the user is consciously manipulating an interface to complete a task, while the goal of virtual reality is to have the experience accepted at an unconscious level.
Users can adapt to control systems with a significant amount of latency and still perform challenging tasks or enjoy a game; many thousands of people enjoyed playing early network games, even with 400+ milliseconds of latency between pressing a key and seeing a response on screen.
If large amounts of latency are present in the VR system, users may still be able to perform tasks, but it will be by the much less rewarding means of using their head as a controller, rather than accepting that their head is naturally moving around in a stable virtual world. Perceiving latency in the response to head motion is also one of the primary causes of simulator sickness. Other technical factors that affect the quality of a VR experience, like head tracking accuracy and precision, may interact with the perception of latency, or, like display resolution and color depth, be largely orthogonal to it.
A total system latency of 50 milliseconds will feel responsive, but still subtly lagging. One of the easiest ways to see the effects of latency in a head mounted display is to roll your head side to side along the view vector while looking at a clear vertical edge. Latency will show up as an apparent tilting of the vertical line with the head motion; the view feels “dragged along” with the head motion. When the latency is low enough, the virtual world convincingly feels like you are simply rotating your view of a stable world.
Extrapolation of sensor data can be used to mitigate some system latency, but even with a sophisticated model of the motion of the human head, there will be artifacts as movements are initiated and changed. It is always better to not have a problem than to mitigate it, so true latency reduction should be aggressively pursued, leaving extrapolation to smooth out sensor jitter issues and perform only a small amount of prediction.
It is not usually possible to introspectively measure the complete system latency of a VR system, because the sensors and display devices external to the host processor make significant contributions to the total latency. An effective technique is to record high speed video that simultaneously captures the initiating physical motion and the eventual display update. The system latency can then be determined by single stepping the video and counting the number of video frames between the two events.
In most cases there will be a significant jitter in the resulting timings due to aliasing between sensor rates, display rates, and camera rates, but conventional applications tend to display total latencies in the dozens of 240 fps video frames.
On an unloaded Windows 7 system with the compositing Aero desktop interface disabled, a gaming mouse dragging a window displayed on a 180 hz CRT monitor can show a response on screen in the same 240 fps video frame that the mouse was seen to first move, demonstrating an end to end latency below four milliseconds. Many systems need to cooperate for this to happen: The mouse updates 500 times a second, with no filtering or buffering. The operating system immediately processes the update, and immediately performs GPU accelerated rendering directly to the framebuffer without any page flipping or buffering. The display accepts the video signal with no buffering or processing, and the screen phosphors begin emitting new photons within microseconds.
In a typical VR system, many things go far less optimally, sometimes resulting in end to end latencies of over 100 milliseconds.
Detecting a physical action can be as simple as a watching a circuit close for a button press, or as complex as analyzing a live video feed to infer position and orientation.
In the old days, executing an IO port input instruction could directly trigger an analog to digital conversion on an ISA bus adapter card, giving a latency on the order of a microsecond and no sampling jitter issues. Today, sensors are systems unto themselves, and may have internal pipelines and queues that need to be traversed before the information is even put on the USB serial bus to be transmitted to the host.
Analog sensors have an inherent tension between random noise and sensor bandwidth, and some combination of analog and digital filtering is usually done on a signal before returning it. Sometimes this filtering is excessive, which can contribute significant latency and remove subtle motions completely.
Communication bandwidth delay on older serial ports or wireless links can be significant in some cases. If the sensor messages occupy the full bandwidth of a communication channel, latency equal to the repeat time of the sensor is added simply for transferring the message. Video data streams can stress even modern wired links, which may encourage the use of data compression, which usually adds another full frame of latency if not explicitly implemented in a pipelined manner.
Filtering and communication are constant delays, but the discretely packetized nature of most sensor updates introduces a variable latency, or “jitter” as the sensor data is used for a video frame rate that differs from the sensor frame rate. This latency ranges from close to zero if the sensor packet arrived just before it was queried, up to the repeat time for sensor messages. Most USB HID devices update at 125 samples per second, giving a jitter of up to 8 milliseconds, but it is possible to receive 1000 updates a second from some USB hardware. The operating system may impose an additional random delay of up to a couple milliseconds between the arrival of a message and a user mode application getting the chance to process it, even on an unloaded system.
On old CRT displays, the voltage coming out of the video card directly modulated the voltage of the electron gun, which caused the screen phosphors to begin emitting photons a few microseconds after a pixel was read from the frame buffer memory.
Early LCDs were notorious for “ghosting” during scrolling or animation, still showing traces of old images many tens of milliseconds after the image was changed, but significant progress has been made in the last two decades. The transition times for LCD pixels vary based on the start and end values being transitioned between, but a good panel today will have a switching time around ten milliseconds, and optimized displays for active 3D and gaming can have switching times less than half that.
Modern displays are also expected to perform a wide variety of processing on the incoming signal before they change the actual display elements. A typical Full HD display today will accept 720p or interlaced composite signals and convert them to the 1920×1080 physical pixels. 24 fps movie footage will be converted to 60 fps refresh rates. Stereoscopic input may be converted from side-by-side, top-down, or other formats to frame sequential for active displays, or interlaced for passive displays. Content protection may be applied. Many consumer oriented displays have started applying motion interpolation and other sophisticated algorithms that require multiple frames of buffering.
Some of these processing tasks could be handled by only buffering a single scan line, but some of them fundamentally need one or more full frames of buffering, and display vendors have tended to implement the general case without optimizing for the cases that could be done with low or no delay. Some consumer displays wind up buffering three or more frames internally, resulting in 50 milliseconds of latency even when the input data could have been fed directly into the display matrix.
Some less common display technologies have speed advantages over LCD panels; OLED pixels can have switching times well under a millisecond, and laser displays are as instantaneous as CRTs.
A subtle latency point is that most displays present an image incrementally as it is scanned out from the computer, which has the effect that the bottom of the screen changes 16 milliseconds later than the top of the screen on a 60 fps display. This is rarely a problem on a static display, but on a head mounted display it can cause the world to appear to shear left and right, or “waggle” as the head is rotated, because the source image was generated for an instant in time, but different parts are presented at different times. This effect is usually masked by switching times on LCD HMDs, but it is obvious with fast OLED HMDs.
The classic processing model for a game or VR application is:
Read user input -> run simulation -> issue rendering commands -> graphics drawing -> wait for vsync -> scanout
I = Input sampling and dependent calculation
S = simulation / game execution
R = rendering engine
G = GPU drawing time
V = video scanout time
All latencies are based on a frame time of roughly 16 milliseconds, a progressively scanned display, and zero sensor and pixel latency.
If the performance demands of the application are well below what the system can provide, a straightforward implementation with no parallel overlap will usually provide fairly good latency values. However, if running synchronized to the video refresh, the minimum latency will still be 16 ms even if the system is infinitely fast. This rate feels good for most eye-hand tasks, but it is still a perceptible lag that can be felt in a head mounted display, or in the responsiveness of a mouse cursor.
Ample performance, vsync: ISRG------------|VVVVVVVVVVVVVVVV| .................. latency 16 – 32 milliseconds
Running without vsync on a very fast system will deliver better latency, but only over a fraction of the screen, and with visible tear lines. The impact of the tear lines are related to the disparity between the two frames that are being torn between, and the amount of time that the tear lines are visible. Tear lines look worse on a continuously illuminated LCD than on a CRT or laser projector, and worse on a 60 fps display than a 120 fps display. Somewhat counteracting that, slow switching LCD panels blur the impact of the tear line relative to the faster displays.
If enough frames were rendered such that each scan line had a unique image, the effect would be of a “rolling shutter”, rather than visible tear lines, and the image would feel continuous. Unfortunately, even rendering 1000 frames a second, giving approximately 15 bands on screen separated by tear lines, is still quite objectionable on fast switching displays, and few scenes are capable of being rendered at that rate, let alone 60x higher for a true rolling shutter on a 1080P display.
Ample performance, unsynchronized: ISRG VVVVV ..... latency 5 – 8 milliseconds at ~200 frames per second
In most cases, performance is a constant point of concern, and a parallel pipelined architecture is adopted to allow multiple processors to work in parallel instead of sequentially. Large command buffers on GPUs can buffer an entire frame of drawing commands, which allows them to overlap the work on the CPU, which generally gives a significant frame rate boost at the expense of added latency.
CPU:ISSSSSRRRRRR----| GPU: |GGGGGGGGGGG----| VID: | |VVVVVVVVVVVVVVVV| .................................. latency 32 – 48 milliseconds
When the CPU load for the simulation and rendering no longer fit in a single frame, multiple CPU cores can be used in parallel to produce more frames. It is possible to reduce frame execution time without increasing latency in some cases, but the natural split of simulation and rendering has often been used to allow effective pipeline parallel operation. Work queue approaches buffered for maximum overlap can cause an additional frame of latency if they are on the critical user responsiveness path.
CPU1:ISSSSSSSS-------| CPU2: |RRRRRRRRR-------| GPU : | |GGGGGGGGGG------| VID : | | |VVVVVVVVVVVVVVVV| .................................................... latency 48 – 64 milliseconds
Even if an application is running at a perfectly smooth 60 fps, it can still have host latencies of over 50 milliseconds, and an application targeting 30 fps could have twice that. Sensor and display latencies can add significant additional amounts on top of that, so the goal of 20 milliseconds motion-to-photons latency is challenging to achieve.
Latency Reduction Strategies
Prevent GPU buffering
The drive to win frame rate benchmark wars has led driver writers to aggressively buffer drawing commands, and there have even been cases where drivers ignored explicit calls to glFinish() in the name of improved “performance”. Today’s fence primitives do appear to be reliably observed for drawing primitives, but the semantics of buffer swaps are still worryingly imprecise. A recommended sequence of commands to synchronize with the vertical retrace and idle the GPU is:
While this should always prevent excessive command buffering on any conformant driver, it could conceivably fail to provide an accurate vertical sync timing point if the driver was transparently implementing triple buffering.
To minimize the performance impact of synchronizing with the GPU, it is important to have sufficient work ready to send to the GPU immediately after the synchronization is performed. The details of exactly when the GPU can begin executing commands are platform specific, but execution can be explicitly kicked off with glFlush() or equivalent calls. If the code issuing drawing commands does not proceed fast enough, the GPU may complete all the work and go idle with a “pipeline bubble”. Because the CPU time to issue a drawing command may have little relation to the GPU time required to draw it, these pipeline bubbles may cause the GPU to take noticeably longer to draw the frame than if it were completely buffered. Ordering the drawing so that larger and slower operations happen first will provide a cushion, as will pushing as much preparatory work as possible before the synchronization point.
Run GPU with minimal buffering: CPU1:ISSSSSSSS-------| CPU2: |RRRRRRRRR-------| GPU : |-GGGGGGGGGG-----| VID : | |VVVVVVVVVVVVVVVV| ................................... latency 32 – 48 milliseconds
Tile based renderers, as are found in most mobile devices, inherently require a full scene of command buffering before they can generate their first tile of pixels, so synchronizing before issuing any commands will destroy far more overlap. In a modern rendering engine there may be multiple scene renders for each frame to handle shadows, reflections, and other effects, but increased latency is still a fundamental drawback of the technology.
High end, multiple GPU systems today are usually configured for AFR, or Alternate Frame Rendering, where each GPU is allowed to take twice as long to render a single frame, but the overall frame rate is maintained because there are two GPUs producing frames
Alternate Frame Rendering dual GPU: CPU1:IOSSSSSSS-------|IOSSSSSSS-------| CPU2: |RRRRRRRRR-------|RRRRRRRRR-------| GPU1: | GGGGGGGGGGGGGGGGGGGGGGGG--------| GPU2: | | GGGGGGGGGGGGGGGGGGGGGGG---------| VID : | | |VVVVVVVVVVVVVVVV| .................................................... latency 48 – 64 milliseconds
Similarly to the case with CPU workloads, it is possible to have two or more GPUs cooperate on a single frame in a way that delivers more work in a constant amount of time, but it increases complexity and generally delivers a lower total speedup.
An attractive direction for stereoscopic rendering is to have each GPU on a dual GPU system render one eye, which would deliver maximum performance and minimum latency, at the expense of requiring the application to maintain buffers across two independent rendering contexts.
The downside to preventing GPU buffering is that throughput performance may drop, resulting in more dropped frames under heavily loaded conditions.
Late frame scheduling
Much of the work in the simulation task does not depend directly on the user input, or would be insensitive to a frame of latency in it. If the user processing is done last, and the input is sampled just before it is needed, rather than stored off at the beginning of the frame, the total latency can be reduced.
It is very difficult to predict the time required for the general simulation work on the entire world, but the work just for the player’s view response to the sensor input can be made essentially deterministic. If this is split off from the main simulation task and delayed until shortly before the end of the frame, it can remove nearly a full frame of latency.
Late frame scheduling: CPU1:SSSSSSSSS------I| CPU2: |RRRRRRRRR-------| GPU : |-GGGGGGGGGG-----| VID : | |VVVVVVVVVVVVVVVV| .................... latency 18 – 34 milliseconds
Adjusting the view is the most latency sensitive task; actions resulting from other user commands, like animating a weapon or interacting with other objects in the world, are generally insensitive to an additional frame of latency, and can be handled in the general simulation task the following frame.
The drawback to late frame scheduling is that it introduces a tight scheduling requirement that usually requires busy waiting to meet, wasting power. If your frame rate is determined by the video retrace rather than an arbitrary time slice, assistance from the graphics driver in accurately determining the current scanout position is helpful.
An alternate way of accomplishing a similar, or slightly greater latency reduction Is to allow the rendering code to modify the parameters delivered to it by the game code, based on a newer sampling of user input.
At the simplest level, the user input can be used to calculate a delta from the previous sampling to the current one, which can be used to modify the view matrix that the game submitted to the rendering code.
Delta processing in this way is minimally intrusive, but there will often be situations where the user input should not affect the rendering, such as cinematic cut scenes or when the player has died. It can be argued that a game designed from scratch for virtual reality should avoid those situations, because a non-responsive view in a HMD is disorienting and unpleasant, but conventional game design has many such cases.
A binary flag could be provided to disable the bypass calculation, but it is useful to generalize such that the game provides an object or function with embedded state that produces rendering parameters from sensor input data instead of having the game provide the view parameters themselves. In addition to handling the trivial case of ignoring sensor input, the generator function can incorporate additional information such as a head/neck positioning model that modified position based on orientation, or lists of other models to be positioned relative to the updated view.
If the game and rendering code are running in parallel, it is important that the parameter generation function does not reference any game state to avoid race conditions.
View bypass: CPU1:ISSSSSSSSS------| CPU2: |IRRRRRRRRR------| GPU : |--GGGGGGGGGG----| VID : | |VVVVVVVVVVVVVVVV| .................. latency 16 – 32 milliseconds
The input is only sampled once per frame, but it is simultaneously used by both the simulation task and the rendering task. Some input processing work is now duplicated by the simulation task and the render task, but it is generally minimal.
The latency for parameters produced by the generator function is now reduced, but other interactions with the world, like muzzle flashes and physics responses, remain at the same latency as the standard model.
A modified form of view bypass could allow tile based GPUs to achieve similar view latencies to non-tiled GPUs, or allow non-tiled GPUs to achieve 100% utilization without pipeline bubbles by the following steps:
Inhibit the execution of GPU commands, forcing them to be buffered. OpenGL has only the deprecated display list functionality to approximate this, but a control extension could be formulated.
All calculations that depend on the view matrix must reference it independently from a buffer object, rather than from inline parameters or as a composite model-view-projection (MVP) matrix.
After all commands have been issued and the next frame has started, sample the user input, run it through the parameter generator, and put the resulting view matrix into the buffer object for referencing by the draw commands.
Kick off the draw command execution.
Tiler optimized view bypass: CPU1:ISSSSSSSSS------| CPU2: |IRRRRRRRRRR-----|I GPU : | |-GGGGGGGGGG-----| VID : | | |VVVVVVVVVVVVVVVV| .................. latency 16 – 32 milliseconds
Any view frustum culling that was performed to avoid drawing some models may be invalid if the new view matrix has changed substantially enough from what was used during the rendering task. This can be mitigated at some performance cost by using a larger frustum field of view for culling, and hardware clip planes based on the culling frustum limits can be used to guarantee a clean edge if necessary. Occlusion errors from culling, where a bright object is seen that should have been occluded by an object that was incorrectly culled, are very distracting, but a temporary clean encroaching of black at a screen edge during rapid rotation is almost unnoticeable.
If you had perfect knowledge of how long the rendering of a frame would take, some additional amount of latency could be saved by late frame scheduling the entire rendering task, but this is not practical due to the wide variability in frame rendering times.
Late frame input sampled view bypass: CPU1:ISSSSSSSSS------| CPU2: |----IRRRRRRRRR--| GPU : |------GGGGGGGGGG| VID : | |VVVVVVVVVVVVVVVV| .............. latency 12 – 28 milliseconds
However, a post processing task on the rendered image can be counted on to complete in a fairly predictable amount of time, and can be late scheduled more easily. Any pixel on the screen, along with the associated depth buffer value, can be converted back to a world space position, which can be re-transformed to a different screen space pixel location for a modified set of view parameters.
After drawing a frame with the best information at your disposal, possibly with bypassed view parameters, instead of displaying it directly, fetch the latest user input, generate updated view parameters, and calculate a transformation that warps the rendered image into a position that approximates where it would be with the updated parameters. Using that transform, warp the rendered image into an updated form on screen that reflects the new input. If there are two dimensional overlays present on the screen that need to remain fixed, they must be drawn or composited in after the warp operation, to prevent them from incorrectly moving as the view parameters change.
Late frame scheduled time warp: CPU1:ISSSSSSSSS------| CPU2: |RRRRRRRRRR----IR| GPU : |-GGGGGGGGGG----G| VID : | |VVVVVVVVVVVVVVVV| .... latency 2 – 18 milliseconds
If the difference between the view parameters at the time of the scene rendering and the time of the final warp is only a change in direction, the warped image can be almost exactly correct within the limits of the image filtering. Effects that are calculated relative to the screen, like depth based fog (versus distance based fog) and billboard sprites will be slightly different, but not in a manner that is objectionable.
If the warp involves translation as well as direction changes, geometric silhouette edges begin to introduce artifacts where internal parallax would have revealed surfaces not visible in the original rendering. A scene with no silhouette edges, like the inside of a box, can be warped significant amounts and display only changes in texture density, but translation warping realistic scenes will result in smears or gaps along edges. In many cases these are difficult to notice, and they always disappear when motion stops, but first person view hands and weapons are a prominent case. This can be mitigated by limiting the amount of translation warp, compressing or making constant the depth range of the scene being warped to limit the dynamic separation, or rendering the disconnected near field objects as a separate plane, to be composited in after the warp.
If an image is being warped to a destination with the same field of view, most warps will leave some corners or edges of the new image undefined, because none of the source pixels are warped to their locations. This can be mitigated by rendering a larger field of view than the destination requires; but simply leaving unrendered pixels black is surprisingly unobtrusive, especially in a wide field of view HMD.
A forward warp, where source pixels are deposited in their new positions, offers the best accuracy for arbitrary transformations. At the limit, the frame buffer and depth buffer could be treated as a height field, but millions of half pixel sized triangles would have a severe performance cost. Using a grid of triangles at some fraction of the depth buffer resolution can bring the cost down to a very low level, and the trivial case of treating the rendered image as a single quad avoids all silhouette artifacts at the expense of incorrect pixel positions under translation.
Reverse warping, where the pixel in the source rendering is estimated based on the position in the warped image, can be more convenient because it is implemented completely in a fragment shader. It can produce identical results for simple direction changes, but additional artifacts near geometric boundaries are introduced if per-pixel depth information is considered, unless considerable effort is expended to search a neighborhood for the best source pixel.
If desired, it is straightforward to incorporate motion blur in a reverse mapping by taking several samples along the line from the pixel being warped to the transformed position in the source image.
Reverse mapping also allows the possibility of modifying the warp through the video scanout. The view parameters can be predicted ahead in time to when the scanout will read the bottom row of pixels, which can be used to generate a second warp matrix. The warp to be applied can be interpolated between the two of them based on the pixel row being processed. This can correct for the “waggle” effect on a progressively scanned head mounted display, where the 16 millisecond difference in time between the display showing the top line and bottom line results in a perceived shearing of the world under rapid rotation on fast switching displays.
Continuously updated time warping
If the necessary feedback and scheduling mechanisms are available, instead of predicting what the warp transformation should be at the bottom of the frame and warping the entire screen at once, the warp to screen can be done incrementally while continuously updating the warp matrix as new input arrives.
Continuous time warp: CPU1:ISSSSSSSSS------| CPU2: |RRRRRRRRRRR-----| GPU : |-GGGGGGGGGGGG---| WARP: | W| W W W W W W W W| VID : | |VVVVVVVVVVVVVVVV| ... latency 2 – 3 milliseconds for 500hz sensor updates
The ideal interface for doing this would be some form of “scanout shader” that would be called “just in time” for the video display. Several video game systems like the Atari 2600, Jaguar, and Nintendo DS have had buffers ranging from half a scan line to several scan lines that were filled up in this manner.
Without new hardware support, it is still possible to incrementally perform the warping directly to the front buffer being scanned for video, and not perform a swap buffers operation at all.
A CPU core could be dedicated to the task of warping scan lines at roughly the speed they are consumed by the video output, updating the time warp matrix each scan line to blend in the most recently arrived sensor information.
GPUs can perform the time warping operation much more efficiently than a conventional CPU can, but the GPU will be busy drawing the next frame during video scanout, and GPU drawing operations cannot currently be scheduled with high precision due to the difficulty of task switching the deep pipelines and extensive context state. However, modern GPUs are beginning to allow compute tasks to run in parallel with graphics operations, which may allow a fraction of a GPU to be dedicated to performing the warp operations as a shared parameter buffer is updated by the CPU.
View bypass and time warping are complementary techniques that can be applied independently or together. Time warping can warp from a source image at an arbitrary view time / location to any other one, but artifacts from internal parallax and screen edge clamping are reduced by using the most recent source image possible, which view bypass rendering helps provide.
Actions that require simulation state changes, like flipping a switch or firing a weapon, still need to go through the full pipeline for 32 – 48 milliseconds of latency based on what scan line the result winds up displaying on the screen, and translational information may not be completely faithfully represented below the 16 – 32 milliseconds of the view bypass rendering, but the critical head orientation feedback can be provided in 2 – 18 milliseconds on a 60 hz display. In conjunction with low latency sensors and displays, this will generally be perceived as immediate. Continuous time warping opens up the possibility of latencies below 3 milliseconds, which may cross largely unexplored thresholds in human / computer interactivity.
Conventional computer interfaces are generally not as latency demanding as virtual reality, but sensitive users can tell the difference in mouse response down to the same 20 milliseconds or so, making it worthwhile to apply these techniques even in applications without a VR focus.
A particularly interesting application is in “cloud gaming”, where a simple client appliance or application forwards control information to a remote server, which streams back real time video of the game. This offers significant convenience benefits for users, but the inherent network and compression latencies makes it a lower quality experience for action oriented titles. View bypass and time warping can both be performed on the server, regaining a substantial fraction of the latency imposed by the network. If the cloud gaming client was made more sophisticated, time warping could be performed locally, which could theoretically reduce the latency to the same levels as local applications, but it would probably be prudent to restrict the total amount of time warping to perhaps 30 or 40 milliseconds to limit the distance from the source images.
Zenimax for allowing me to publish this openly.
Hillcrest Labs for inertial sensors and experimental firmware.
Emagin for access to OLED displays.
Oculus for a prototype Rift HMD.
Nvidia for an experimental driver with access to the current scan line number.