134 PERFORMANCE AND SCALABILITY CHAPTER 6 of wr iting boasts VGA true color displays, powered by dedicated GPUs and 600MHz multicore ARM11 processors with vector floating-point units. Currently only the expen- sive smart phones have dedicated graphics processors, but the situation is changing rapidly with ever-cheaper GPU designs entering the feature phone market. Programming standards such as OpenGL ES attempt to unify the variety of devices by providing a common interface for accessing the underlying graphics architecture: they act as hardware abstraction layers. This is important, as now the set of available graphics features is reasonably constant from the programmer’s point of view. Apart from the API and feature set these standards unify a third important factor: the under- lying rendering model. Both OpenGL ES and M3G build on the shoulders of desktop OpenGL by adopting its rendering paradigms as well as its well-specified and documented pipeline. So, even though a programmer can assume to have more or less the same fea- ture set on a low-end and a high-end device, use the same APIs to program both, and have some expectations about the rendering quality, one thing cannot be guaranteed: performance. 6.1 SCALABILITY When building a scalable 3D application two major factors need to be taken into account. First of all, the application should have maximum graphics performance; no major bottle- necks or loss of performance should exist. This is extremely important as the lowest-end mobile phones being targeted have very limited capabilities. The second thing to con- sider is identifying all aspects of the rendering process that can be scaled. Scaling in this context means that once an application runs adequately on the lowest-end de vice being targeted, the application can be made more interesting on devices that have better render- ing performance by adding geometric detail, using higher-quality textures, more complex special effects, better screen resolution, more accurate physics, more complex game logic, and so forth. In other words, you should always scale applications upward by adding eye candy, because the opposite—that is downscaling a complex application—is much more difficult to accomplish. 3D content is reasonably easy to scale using either automated or manually controlled offline tools. For example, most modeling packages support automatic generation of low-polygon-count models. This allows exporting the same scene using different triangle budgets. Methods such as texture-based illumination, detail textures, and bump mapping make it possible to use fewer triangles to express complex shapes; these were covered earlier in Section 3.4.3. Texture maps are highly scalable, and creating smaller textures is a trivial operation supported by all image-editing programs. The use of compressed texture formats [BAC96, Fen03, SAM05] reduces the memory requirements even further. Figure 6.1 illustrates how few triangles are needed for creating a compelling 3D game. SECTION 6.1 SCALABILITY 135 Figure 6.1: Low-polygon models from a golf game by Digital Chocolate. 6.1.1 SPECIAL EFFECTS Most game applications contain highly scalable visual elements that do not have any impact on the game play. For example, bullet holes on walls, skid marks left by a race car, and drifting clouds in the sky are typical examples of eye candy that could be reduced or dropped altogether without altering the fundamentals of the game. Whether a special effect is a game play element depends on the context. As an example, fog is often used to mask the popping rendering artifacts caused by geometric level-of-detail optimizations and culling of distant objects. It is also a visual effect that makes scenes moodier and more atmospheric. On the other hand, fog may make enemies more difficult to spot in a shooter game—removing the fog would clearly affect the game play. Ensuring that the game play is not distur bed is especially important in multiplayer games as players should not need to suffer from unfair disadvantages due to scaling of special effects. If you want to expose performance controls to the user, special effects are one of the prime candidates for this. Most users can understand the difference between rendering bullet holes and not rendering them, whereas having to make a choice between bilinear and trilinear filtering is not for the uninitiated. One family of effects that can be made fully scalable are particle systems such as explosions, water effects, flying leaves, or fire, as shown in Figure 6.2. The number of particles, the complexity of the particle simulation, and the associated visuals can all be scaled based on the graphics capabilities of the device. Furthermore, one can allocate a shared budget for all particle systems: this ensures that the load on the graphics system is controlled dynamically, and that the maximum load can be bounded. A similar approach is often used for sounds, e.g., during an intense firefight the more subtle sound effects are skipped, as the y would get drowned by the gunshots anyway. 136 PERFORMANCE AND SCALABILITY CHAPTER 6 Figure 6.2: Particle effects can be used to simulate natural phenomena, such as fire, that are not easily represented as polygonal surfaces. (Image copyright c AMD.) 6.1.2 TUNING DOWN THE DETAILS Other scalable elements include noncritical detail objects and background elements. In many 3D environments the most distant elements are rendered using 2D back- drops instead of true 3D objects. In this technique faraway objects are collapsed into a single panoramic sky cube at the expense of losing parallax effects between and within those objects. Similarly, multi-pass detail textures can be omitted on low-end devices. The method selected for rendering shadows is another aspect that can be scaled. On a high-performance device it may be visually pleasing to use stencil shadows [Cro77, EK02] for some or all of the game objects. This is a costly approach, and less photorealistic meth- ods, such as rendering shaded blobs under the main characters, should be utilized on less capable systems. Again, one should be careful to make sure that shadows are truly just a visual detail as in some games they can affect the game play. 6.2 PERFORMANCE OPTIMIZATION The most important thing to do when attempting to optimize the performance of an application is profiling. Modern graphics processors are complex devices, and the inter- action between them and other hardware and software components of the system is not trivial. This makes predicting the impact of program optimizations difficult. The only effective way for finding out how changes in the program code affect application perfor- mance is measuring it. SECTION 6.2 PERFORMANCE OPTIMIZATION 137 The tips and tricks provided in this chapter are good rules of thumb but by no means gospel. Following these rules is likely to increase overall rendering performance on most devices, but the task of identifying device-specific bottlenecks is always left to the applica- tion programmer. Problems in performance particular to a phone model often arise from system integr ation issues rather than deficiencies in the rendering hardware. This means that the profiling code must be run on the actual target device; it is not sufficient just to obtain similar hardware. Publicly available benchmark programs such as those from FutureMark 1 or JBenchmark 2 are useful for assessing approximate graphics processing performance of a device. However, they may not pinpoint individual bottlenecks that may ruin the performance of a particular application. Performance problems of a 3D graphics application can be classified into three groups: pixel pipeline, vertex pipeline, and application bottlenecks. These groups can be then fur- ther partitioned into different pipeline stages. The overall pipeline runs only as fast as its slowest stage, which forms a bottleneck. However, regardless of the source of the bot- tleneck, the strategy for dealing with one is straightforward (see Figure 6.3). First, you should locate the bottleneck. Then, you should try to eliminate it and move to the next one. Locating bottlenecks for a single rendering task is simple. You should go through each pipeline stage and reduce its workload. If the performance changes significantly, you have found the bottleneck. Otherwise, you should move to the next pipeline stage. How- ever, it is good to understand that the bottleneck often changes within a single frame that contains multiple different primitives. For example, if the application first renders a group of lines and afterward a group of lit and shaded triangles, we can expect the bottleneck to change. In the following we study the main pipeline groups in more detail. 6.2.1 PIXEL PIPELINE Whether an application’s performance is bound by the pixel pipeline can be found out by changing the rendering resolution—this is easiest done by scaling the viewport. If the per- formance scales directly with the screen resolution, the bottleneck is in the pixel pipeline. After this, further testing is needed for identifying the exact pipeline stage (Figure 6.4). To determine if memory bandwidth is the limiting factor, you should try using smaller pixel formats for the different buffers and textures, or disable texturing altogether. If a performance difference is observed, you are likely to be bandwidth-bound. Other factors contributing to the memory bandwidth include blending operations and depth buffer- ing. Try disabling these features to see if there is a difference. Another culprit for slow fragment processing may be the texture filtering used. Test the application with nonfiltered textures to find out if the performance increases. 1 www.futuremark.com 2 www.jbenchmark.com 138 PERFORMANCE AND SCALABILITY CHAPTER 6 Eliminate all draw calls Limited by graphics Limited by rendering Limited by pixel processing Limited by geometry processing Limited by buffer swap Limited by application processing Faster Faster Faster No effect No effect No effect Only clear, draw one small triangle, and swap Set viewport to 8 3 8 pixels Reduce resolution or frame rate Figure 6.3: Determining whether the bottleneck is in application processing, buffer swapping, geometry processing, or fragment processing. Limited by pixel processing Disable texturing Faster Limited by frame buffer access Disable blending, fragment tests Limited by frame buffer ops Limited by color buffer bandwidth Use fewer ops, render in front-to-back order User smaller resolution, color depth, or viewport Faster No effect Limited by texturing Reduce textures to 1 ϫ 1 pixel Use smaller textures, compressed textures, nearest filtering, mipmaps Replace textures with baked-in vertex colors, use nearest filtering Limited by texture memory bandwidth Limited by texture mapping logic Faster No effect No effect Figure 6.4: Finding the performance bottleneck in fill rate limited rendering. SECTION 6.2 PERFORMANCE OPTIMIZATION 139 To summarize: in order to speed up an application where the pixel pipeline is the bottleneck, you have to either use a smaller screen resolution, render fewer objects, use simpler data formats, utilize smaller texture maps, or perform less complex fragment and texture processing. Many of these optimizations are covered in more detail later in this chapter. 6.2.2 VERTEX PIPELINE Bottlenecks in the vertex pipeline can be found by making two tests (Figure 6.5). First, you should try rendering only every other triangle but keeping the vertex arrays used intact. Second, you should try to reduce the complexity of the t ransformation and lighting pipeline. If both of these changes show performance improvements, the application is bound by vertex processing. If only the reduced triangle count shows a difference, we have a submission bottleneck, i.e., we are bound by how fast the vertex and primitive data can be transferred from the application. When analyzing the vertex pipeline, you should always scale the viewport to make the rendering resolution small in order to keep the cost of pixel processing to a minimum. A good size for the current mobile phone display resolutions would be 8 × 8 pixels or Limited by geometry processing Limited by T&L Limited by the lighting pipeline Limited by the vertex pipeline Limited by triangle setup Reduce the number of triangles Use fewer triangles Use fewer and simpler lights Use fewer triangles, 8/16-bit vertices Disable lighting Faster Faster No effect No effect Figure 6.5: Finding the performance bottleneck in geometry-limited rendering. 140 PERFORMANCE AND SCALABILITY CHAPTER 6 so. A resolution smaller than this might cause too many triangles to become subpixel- sized; optimized drivers would cull them and skip their vertex processing, complicating the analysis. Submission bottlenecks can be addressed by using smaller data formats, by organizing the vertices and primitives in a more cache-friendly manner, by storing the data on the server rather than in the client, and of course by using simplified meshes that have fewer triangles. On the other hand, if vertex processing is the cause for the slowdown, the remedy is to reduce complexity in the transformation and lighting pipeline. This is best done by using fewer and simpler light sources, or avoiding dynamic lighting altogether. Also, disabling fog, providing prenormalized vertex normals, and avoiding the use of texture mat rices and floating-point vertex data formats are likely to reduce the geometry workload. 6.2.3 APPLICATION CODE Finally, it may be that the bottleneck is not in the rendering part at all. Instead, the application code itself may be slow. To determine if this is the case, you should turn off all application logic, i.e., just execute the code that performs the per-frame rendering. If significant performance differences can be observed, you have an application bottleneck. Alternatively, you could just comment out all rendering calls, e.g., glDrawElements in OpenGL ES. If the frame rate does not change much, the application is not rendering-bound. A more fine-grained analysis is needed for pinpointing the slow parts in an application. The best tool for this analysis is a profiler that shows how much time is spent in each func- tion or line of code. Unfortunately hardware profilers for real mobile phones are both very expensive and difficult to obtain. This means that applications need to be either executed on other similar hardware, e.g., Lauterbach boards 3 are commonly used, or they may be compiled and executed on a desktop computer where software-based profilers are readily available. When profiling an application on anything except the real target device, the data you get is only indicative. However, it may g ive you valuable insights into where time is potentially spent in the application, the complexities of the algorithms used, and it may even reveal some otherwise hard-to-find bugs. As floating-point code tends to be emulated on many embedded devices, slowdowns are often caused by innocent-looking routines that perform math processing for physics simulation or game log ic. Re-writing these sections using integer arithmetic may yield sig- nificant gains in performance. Appendix A provides an introduction to fixed-point pro- gramming. Java programs have their own performance-related pitfalls. These are covered in more detail in Appendix B. 3 www.lauterbach.com SECTION 6.2 PERFORMANCE OPTIMIZATION 141 6.2.4 PROFILING OPENGL ES APPLICATIONS Before optimizing your code you should always clean it up. This means that you should first fix all graphics-related errors, i.e., make sure no OpenGL ES errors are raised. Then you should take a look at the OpenGL ES call logs generated by your application. You will need a separate tool for this: we will introduce one below. From the logs you will get the list of OpenGL ES API calls made by your application. You should verify that they are what you expect, and remove any redundant ones. At this stage you should trap typical programming mistakes such as clearing the buffers multiple times, or enabling unnecessary rendering states. One potentially useful commercial tool for profiling your application is gDEBugger ES from Graphic Remedy. 4 It is an OpenGL ES debugger and profiler that traces application activity on top of the OpenGL ES APIs to provide the application behavior information you need to find bugs and to optimize application performance (see Figure 6.6). gDEBug- ger ES essentially transforms the debugging task of graphics applications from a “black box” into a “white box” model; it lets you peer inside the OpenGL ES usage to see how individual commands affect the graphic pipeline implementation. The profiler enables viewing context state v ariables (Figure 6.7), texture data and properties, performance counters, and OpenGL ES function call history. It allows adding breakpoints on OpenGL ES commands, forcing the application’s raster mode and render target, and breaking on OpenGL ES errors. Another useful tool for profiling the application code is Carbide IDE From Nokia for S60 and UIQ Symbian devices. With commercial versions of Carbide you can do on-target debugging, performance profiling, and power consumption analysis. See Figure 6.8 for an example view of the performance investigator. Figure 6.6: gDEBugger ES is a tool for debugging and profiling the OpenGL ES graphics driver. 4 www.gremedy.com 142 PERFORMANCE AND SCALABILITY CHAPTER 6 Figure 6.7: gDEBugger ES showing the state variables of the OpenGL ES context. 6.2.5 CHECKLISTS This section provides checklists for reviewing a graphics application for high perfor- mance, quality, portability, and lower power usage. Tables 6.1–6.4 contain questions that should be asked in a review, and the “correct” answers to those questions. The appli- cability of each issue is characterized as ALL, MOST, or SOME to indicate whether the question applies to practically all implementations and platforms, or just some of them. For example, on some platforms enabling perspective, correction does not reduce perfor- mance while on others you will have to pay a performance penalty. Note that even though we are using OpenGL ES and EGL terminology and function names in the tables, most of the issues also apply to M3G. SECTION 6.2 PERFORMANCE OPTIMIZATION 143 Figure 6.8: Carbide showing one of the performance analysis views. (Image copyright c Nokia.) Table 6.1 contains a list of basic questions to go through for a quick performance analysis. The list is by no means exhaustive, but it contains the most common pitfalls that cause performance issues. A checklist of features affecting rendering quality can be found in Table 6.2. Questions in the table highlight quality settings that improve quality but do not have any nega- tive performance impact on typical graphics hardware. However, the impact on software implementations may be severe. In a similar fashion, Table 6.3 provides checks for efficient power usage, and finally, Table 6.4 covers programming practices and features that may cause portability problems. . performance counters, and OpenGL ES function call history. It allows adding breakpoints on OpenGL ES commands, forcing the application’s raster mode and render target, and breaking on OpenGL ES errors. Another. Apart from the API and feature set these standards unify a third important factor: the under- lying rendering model. Both OpenGL ES and M3G build on the shoulders of desktop OpenGL by adopting. smart phones have dedicated graphics processors, but the situation is changing rapidly with ever-cheaper GPU designs entering the feature phone market. Programming standards such as OpenGL ES attempt