3D Graphics with OpenGL ES and M3G- P17 pot

144 PERFORMANCE AND SCALABILITY CHAPTER 6 Table 6.1: Performance checklist. Check item OK Answer Applicability Do you use full-screen window surfaces? Yes ALL Do you use glReadPixels? No ALL Do you use eglCopyBuffers? No MOST Do you use glCopyTex(Sub)Image2D? No MOST Do you change texture data of existing texture? No ALL Do you load textures during the rendering pass? No MOST Do you use render-to-texture results during the same frame? No SOME Do you clear the whole depth buffer at the start of a frame? Yes SOME Do you use mipmapping? Yes ALL Do you use vertex buffer objects? Yes ALL Do you use texture compression? Yes SOME Is any unnecessary state enabled? No ALL Do you use auto mipmap generation or change filter modes? No SOME Do you use perspective correction? No SOME (SW) Do you use bilinear or trilinear filtering? No SOME (SW) Do you use floating-point vertex data? No SOME Table 6.2: Quality checklist. Check item OK Answer Applicability Do you use multisampling? Yes MOST (HW) Do you use LINEAR MIPMAP NEAREST? Yes MOST (HW) Do you have enough depth buffer bits? Yes ALL Do you have enough color buffer bits? Yes ALL Have you enabled perspective correction? Yes ALL Table 6.3: Power usage checklist. Check item OK Answer Applicability Do you terminate EGL when application is idling? Yes MOST (HW) Do you track the focus and halt rendering if focus is lost? Yes ALL Do you limit your frame rate? Yes ALL SECTION 6.3 CHANGING AND QUERYING THE STATE 145 Table 6.4: Portability checklist. Check item OK Answer Applicability Do you use writable static data? No SOME (OS) Do you handle display layout changes? Yes SOME (OS) Do you depend on pixmap surface support? No SOME Do you use EGL from another thread than main? No SOME Do you specify surface type when asking for a config? Yes MOST Do you require exact number of samples for multi-sampling? No SOME 6.3 CHANGING AND QUERYING THE STATE Modern rendering pipelines are one-way streets: data keeps flowing in, it gets buffered, number-crunching occurs, and eventually some pixels come out. State changes and dynamic state queries are operations that disturb this flow. In the worst case a client-server roundtrip is required. For example, if the application wants to read back the contents of the color buffer, the application (the “client”) has to stall until the graphics hardware (the “server”) has processed all of the buffered primitives—and the buffers in modern hardware, especially tile-based devices, can be very long. An example of an extreme state change is modifying the contents of a texture map mid-frame as this may lead to internal duplication of the image data by the underlying driver. While having some state changes is unavoidable in any realistic applications, you should steer clear of dynamic state queries, if possible. Applications should shadow the relevant state in their own code rather than query it from the graphics driver, e.g., the application should know whether a particular light source is enabled or not. Dynamic queries should only be utilized when keeping an up-to-date copy of the graphics driver’s state is cumbersome, for example when combining application code with third-party middle- ware libraries that communicate directly with the underlying OpenGL ES or M3G layers. If for some reason dynamic state queries are absolutely needed, they should all be executed together once p er frame, so that only a single pipeline stall is generated. Smaller state changes, such as operations that alter the transformation and lighting pipeline or the fragment processing, affect the performance in various ways. Changing state that is typically set only during initialization, such as the size of the viewport or scissor rectangle, may cause a pipeline flush and may therefore be costly. State changes and under-the-hood synchronization may also happen when an application uses different APIs to access the same graphics resources. For example, you may be tempted to mix 2D and 3D functionality provided by different APIs. This is more than likely to be extremely slow, as the entire 3D pipeline may have to be completely flushed before the 2D operations 146 PERFORMANCE AND SCALABILITY CHAPTER 6 can take place and vice versa. The implementations of the graphics libraries may well come from different vendors, and their interaction can therefore be nonoptimal. This is a significant problem in the Java world, as the whole philosophy of Java programming is to be able to mix and match different libraries. 6.3.1 OPTIMIZING STATE CHANGES The rule of thumb for all state changes is to minimize the number of stalls created by them. This means that changes should be grouped and executed together. An easy way to do this is to group related state changes into “shaders” (we use the term here to indi- cate a collection of distinct pieces of the rendering state, corresponding roughly with the Appearance class of M3G), and to organize the rendering so that all objects sharing a shader are rendered together. It is a good idea to expose this shader-based approach in the artists’ modeling tools as well. If one lets the artists tweak attributes that can create state changes, the end result is likely to be a scene where each object has slightly different materials and frag ment pipelines, and the application needs to do a large number of stage changes to render the objects. It is therefore better to just let the artist pick shaders from a predefined list. Also, it is important to be aware that the more complex a shader is, the slower it is likely to be. Even though graphics hardware may perform some operations “for free” due to its highly parallel nature, in a software implementation everything has an associated cost: enabling texture mapping is going to take dozens of CPU cycles for every pixel rendered, bilinear filtering of textures is considerably more expensive than point sampling, and using blending or fog will definitely slow down a software renderer. For this reason, it is crucial that the application disables all operations that are not going to have an impact on the final rendered image. As an example, it is typical that applications draw over- lay images after the 3D scene has been rendered. People often forget to disable the fog operation when drawing the overlays as the fog usually does not affect objects placed at the near clipping plane. However, the underlying rendering engine does not know this, and has to perform the expensive fog computations for every pixel rendered. Disabling the fog for the overlays in this case may have a significant performance impact. In general, simplifying shaders is more important for software implementations of the rendering pipeline, whereas keeping the number of state changes low i s more important for GPUs. 6.4 MODEL DATA The way the vertex and triangle data of the 3D models is organized has a significant impact on the rendering performance. Although the internal caching rules vary from one rendering pipeline implementation to another, straightforward rules of thumb for presentation of data exist: keep vertex and triangle data short and simple, and make as few rendering calls as possible. SECTION 6.4 MODEL DATA 147 In addition to the layout and format of the vertex and triangle data used, where the data is stored plays an important role. If it is stored in the client’s memory, the application has more flexibility to modify the data dynamically. However, since the data is now transferred from the client to the server during every render call, the server loses its opportunity for optimizing and analyzing the data. On the other hand, when the mesh data is stored by the server, it is possible to perform even expensive analysis of the data, as the cost is amortized over multiple rendering operations. In general, one should always use such server-stored buffer objects whenever provided by the rendering API. OpenGL ES supports buffer objects from version 1.1 onward, and M3G implementations may support them in a completely transparent fashion. 6.4.1 VERTEX DATA Optimization of model data is an offline process that is best performed in the exporting pipeline of a modeling tool. The most impor tant optimization that should be done is vertex welding, that is, finding shared vertices and removing all but one of them. In a finely tessellated grid each vertex is shared by six triangles. This means an effective vertices- per-triangle ratio of 0.5. For many real-life meshes, ratios between 0.6 and 1.0 are obtained. This is a major improvement over the naive approach of using three individual vert ices for each triangle, i.e., a ratio of 3.0. The fastest and easiest way for implementing welding is to utilize a hash table where vertices are hashed based on their attributes, i.e., position, normal, texture coordinates, and color. Any reasonably complex 3D scene will use large amounts of memory for storing its vertex data. To reduce the consumption, one should always try to use the smallest data formats possible, i.e., bytes and shorts instead of integers. Because quantization of floating-point vertex coordinates into a smaller fixed-point representation may introduce artifacts and gaps between objects, controllingthe quantization shouldbemade explicit in the modeling and exporting pipeline. All interconnecting “scene” geomet ry could be represented with a higher accuracy (16-bit coordinates), and all smaller and moving objects could be expressed with lower accuracy (8-bit coordinates). For vertex positions this quantization is typically done by scanning the axis-aligned bounding box of an object, re-scaling the bounding [min,max] range for each axis into [−1, +1], and converting the resulting values into signed fixed-point values. Vertex normals usually survive quantization into 8 bits per component rather well, whereas texture coordinates often require 16 bits per component. In general, one should always prefer integer formats over floating-point ones, as the y are likely to be processed faster by the transformation and lighting pipeline. Favoring small formats has another advantage: when vertex data needs to be copied over to the rendering hardware, less memory bandwidth is needed to transfer smaller data elements. This improves the performance of applications running on top of both hardware and software renderers. Also, in order to increase cache-coherency, one should interleave vertex data if 148 PERFORMANCE AND SCALABILITY CHAPTER 6 possible. This means that all data of a single vertex is stored together in memory, followed by all of the data of the next vertex, and so forth. 6.4.2 TRIANGLE DATA An important offline optimization is ordering the triangle data in a coherent way so that subsequent triangles share as many vertices as possible. Since we cannot know the exact rules of the vertex caching algorithm used by the graphics driver, we need to come up with a generally good ordering. This can be achieved by sorting the triangles so that they refer to vert ices that have been encountered recently. Once the triangles have been sorted in a coherent fashion, the vertex indices are remapped and the vertex arrays are re-indexed to match the order of referral. In other words, the first triangle should have the indices 0, 1, and 2. Assuming the second triangle shares an edge with the first one, it will introduce one new vertex, which in this scheme gets the index 3. The subsequent triangles then refer to these vertices and introduce new vertices 4, 5, 6, and so forth. The triangle index array can be expressed in several different formats: triangle lists, strips, and fans. Strips and fans have the advantage that they use fewer indices per triangle than triangle lists. However, you need to watch out that you do not create too many rendering calls. You can “stitch” two disjoint strips together by replicating the last vertex of the first strip and the first vertex of the second strip, which creates two degenerate triangles in the middle. In general, using indexed rendering allows you to take full advantage of vertex caching, and you should sort the tr iangles as described above. Whether triangle lists or strips perform better depends on the implementation, and you should measure your platform to find out the winner. 6.5 TRANSFORMATION PIPELINE Because many embedded devices lack floating-point units, the transformation pipeline can easily become the bottleneck as matrix manipulation operations need to be performed using emulated floating-point operations. For this reason it is important to minimize the number of times the matrix stack is modified. Also, expressing all object ver tex data in fixed point rather than floating point can produce savings, as a much simpler transformation pipeline can then be utilized. 6.5.1 OBJECT HIERARCHIES When an artist models a 3D scene she typically expresses the world as a complex hierarchy of nodes. Objects are not just collections of triangles. Instead, they have internal str uc- ture, and often consist of multiple subobjects, each with its own materials, transformation matrices and other attributes. This flexible approach makes a lot of sense when modeling a world, but it is not an optimal presentation for the rendering pipeline, as unnecessary matrix processing is likely to happen. SECTION 6.5 TRANSFORMATION PIPELINE 149 A better approach is to create a small piece of code that is executed when the data is exported from the modeling tool. This code should find objects in the same hierarchy sharing the same transfor mation matrices and shaders, and combine them together. The code should also “flatten” static transformation hier archies, i.e., premultiply hierarchical transformations together. Also, if the scene contains a large number of replicated static objects such as low-polygon count trees forming a forest or the kinds of props shown in Figure 6.9, it makes sense to combine the objects into a single larger one by transforming all of the objects into the same coordinate space. 6.5.2 RENDERING ORDER The rendering order of objects has implications to the rendering performance. In general, objects should be rendered in an approximate front-to-back order. The reason for this is that the z-buffering algorithm used for hidden surface removal can quickly discard covered fragments. If the occluding objects are rasterized first, many of the hidden fragments require less processing. Modern GPUs often perform the depth buffering in a hierarchical fashion, discarding hidden blocks of 4 × 4 or 8 × 8 pixels at a time. The best practical way to exploit this early culling is to sort the objects of a scene in a coarse fashion. Tile-based rendering architectures such as MBX of Imagination Technologies and Mali of ARM buffer the scene geometry before the rasterization stage and are thus able to perform the hidden surface removal efficiently regardless of the object ordering. However, other GPU architectures can benefit greatly if the objects are in a rough front-to-back order. Depth ordering is not the only important sorting criterion—the state changes should be kept to a minimum as well. This suggests that one should first group objects based on their materials and shaders, then render the groups in depth order. Figure 6.9: Low-polygon in-game objects. (Images copyright c  Digital Chocolate.) 150 PERFORMANCE AND SCALABILITY CHAPTER 6 Figure 6.10: Occlusion culling applied to a complex urban environment consisting of thousands of buildings. Left: view frustum intersecting a city as seen from a third person view. Right: wireframe images of the camera’s view without (top ) and with ( bottom ) occlusion culling. Here culling reduces the number of objects rendered by a factor of one hundred. (Image copyright c  NVidia.) 6.5.3 CULLING Conservative culling strategies are ones that reduce the number of rendered objects without introducing any artifacts. Frustum culling is used to remove objects falling outside the view frustum, and occlusion culling to discard objects hidden completely by others. Frustum culling is best performed using conservatively computed bounding volumes for objects. This can be further optimized by organizing the scene graph into a bounding volume hierarchy and performing the culling using the hierarchy. Frustum culling is a trivial optimization to implement, and should be used by any rendering application— practically all scene graph engines support this, including all real-world M3G implementations. Occlusion culling algorithms, on the other hand, are complex, and often difficult to implement (see Figure 6.10). Of the various different algorithms, two are particularly suited for handheld 3D applications: pre-computed: Potentially Visible Sets (PVSs) and portal rendering. Both have modest run-time CPU requirements [Air90, LG95]. When an application just has too much geometry to render, aggressive culling strategies need to be employed. There are several different options for choosing which objects are not rendered. Commonly used methods include distance-based culling where faraway objects are discarded, and detail culling, where objects having small screen footprints after projection are removed. Distance-based culling creates annoying popping artifacts which are often reduced either by bringing the far clipping plane closer, by using fog effects to SECTION 6.6 LIGHTING 151 mask the transition, or by using distance-based alpha blending to fade faraway objects into full transparency. The popping can also be reduced by level-of-detail rendering, i.e., by switching to simplified versions of an object as its screen area shrinks. 6.6 LIGHTING The fixed-functionality lighting pipeline of OpenGL ES and M3G is fairly limited in its capabilities and it inherits the basic problems inherent in the or iginal OpenGL lighting model. The fundamental problem is that it is vertex-based, and thus fine tessellation of meshes is required for reducing the artifacts due to sparse lighting sampling. Also, the lighting model used in the mobile APIs is somewhat simplified; some important aspects such as properly modeled specular illumination have been omitted. Driver implementations of the lighting pipeline are notoriously poor, and often very slow except for a few hand-optimized fast paths. In practice a good bet is that a single directional light will be properly accelerated, and more complex illumination has a good chance of utilizing slower code paths. In any case the cost will increase at least linearly with the number of lights, and the more complex lighting features you use, the slower your application runs. When the vertex lighting pipeline is utilized, you should always attempt to simplify its workload. For example, prenormalizing vertex normals is likely to speed up the lighting computations. In a similar fashion, you should avoid using truly homogeneous vertex positions, i.e., those that have w components other than zero or one, as these require a more complex lighting pipeline. Specular illumination computations of any kind are rather expensive, so disabling them may increase the performance. The same advice applies to distance attenuation: disabling it is likely to result in performance gains. How- ever, if attenuating light sources are used, a potential optimization is completely disabling faraway lights that contribute little or nothing to the illumination of an object. This can be done using trivial bounding sphere overlap tests between the objects and the light sources. 6.6.1 PRECOMPUTED ILLUMINATION The quality problems of the limited OpenGL lighting model will disappear once pro- grammable shaders are supported, though even then you will pay the execution time penalty of complex lighting models and of multiple light sources. However, with fixed- functionality pipelines of OpenGL ES 1.x and M3G 1.x one should primar ily utilize texture-based and precomputed illumination, and try to minimize the application’s reliance on the vertex-based lighting pipeline. For static lighting, precomputed vertex-based illumination is a cheap and good option. The lighting is computed only once as a part of the modeling phase, and the vertex illumination is exported along with the mesh. This may also reduce the memory consumption of 152 PERFORMANCE AND SCALABILITY CHAPTER 6 the meshes, as vertex normals do not need to be exported if dynamic lighting is omitted. OpenGL ES supports a concept called color material tracking which allows changing a material’s diffuse or ambient component separ ately for each vertex of a mesh. This allows combining precomputed illumination with dynamic vertex-based lighting. 6.7 TEXTURES Texturing plays an especially important role in mobile graphics, as it makes it possible to push lighting computations from the vertex pipeline to the fragment pipeline. Thisreduces the pressure to tessellate geometry. Also, it is more likely that the fragment pipeline is accelerated; several commonly deployed hardware accelerators such as MBX Lite perform the entire transformation and lighting pipeline on the CPU but have fast pixel-processing hardware. Software and hardware implementations of texture mapping have rather different performance characteristics. A software implementation will take a serious performance hit whenever linear blending between mipmap levels or texels is used. Also, disabling perspective correct texture interpolation may result in considerable speed-ups when a software rasterizer is used. Mipmapping , on the other hand, is almost always a good idea, as it makes texture caching more efficient for both software and hardware implementations. It should be kept in mind that modifying texture data has almost always a significant negative performance impact. Because rendering pipelines are generally deeply buffered, there are two things that a driver may do when a texture is modified by the application. Either the entire pipeline is flushed—this means that the client and the server cannot executeinparallel,orthetextureimageandassociatedmipmaplevelsneedtobeduplicated. In either case, the performance is degraded. The latter case also temporarily increases the driver’s memory usage. Multi-texturing should be always preferred over multi-pass rendering. There are several good reasons for this. Z-fighting artifacts can be avoided this way, as the textures are combined before the color buffer write is performed. Also, the number of render state changes is reduced, and an expensive alpha blending pass is avoided altogether. Finally, the number of draw calls is reduced by half. 6.7.1 TEXTURE STORAGE Both OpenGL ES and M3G abstract out completely how the driver caches textures inter- nally. However, the application has still some control over the data layout, and this may have a huge impact on performance. Deciding the correct sizes for texture maps, and combining smaller maps used together into a single larger texture can be significant opti- mizations. The “correct size” is the one where the texture map looks good under typical viewing conditions—in other words, one where the ratio between the texture’s texels SECTION 6.7 TEXTURES 153 and the screen’s pixels approaches 1.0. Using a larger texture map is a waste of memory. A smaller one just deteriorates the quality. The idea of combining multiple textures into a single texture map is an important one, and is often used when rendering fonts, animations, or light maps. Such texture atlases are also commonly used for storing the different texture maps used by a complex object (see Figure 6.11). This technique allows sw itching between texture maps without actually performing a state change—only the texture coordinates of the object need to vary. Long strings of text or complex objects using multiple textures can thus be rendered using a single rendering call. Texture image data is probably the most significant consumer of memory in a graphics- intensive application. As the memory capacity of a mobile device is still oftenrather limited, it is important to pay attention to the texture formats and layouts used. Both OpenGL ES and M3G provide support for compressed texture formats—although only via palettes and vendor-specific extensions. Nevertheless, compressed formats should be utilized whenever possible. Only in cases where artifacts gener ated by the compression are visually disturbing, or when the texture is often modified manually, should noncompressed formats be used. Even then, 16-bit texture formats should be favored over 32-bit ones. Also, one should take advantage of the intensity-only and alpha-only formats in cases where the texture data is monochrome. In addition to saving valuable RAM, the use of compressed textures reduces the internal memory bandwidth, which in turn is likely to improve the rendering performance. Figure6.11: An example of automatically packing textures into a texture atlas (refer to Section 6.7.1). Image courtesy of Bruno Levy. (See the color plate.) . models and of multiple light sources. However, with fixed- functionality pipelines of OpenGL ES 1.x and M3G 1.x one should primar ily utilize texture-based and precomputed illumination, and try. modeling and exporting pipeline. All interconnecting “scene” geomet ry could be represented with a higher accuracy (16-bit coordinates), and all smaller and moving objects could be expressed with. refer to these vertices and introduce new vertices 4, 5, 6, and so forth. The triangle index array can be expressed in several different formats: triangle lists, strips, and fans. Strips and fans

Định dạng
Số trang	10
Dung lượng	270,27 KB