22 Performance

22.010 What do I need to know about performance?

First, read chapters 11 through 14 of the book OpenGL on Silicon Graphics Systems. Although some of the information is SGI machine specific, most of the information applies to OpenGL programming on any platform. It's invaluable reading for the performance-minded OpenGL programmer.

Consider a performance tuning analogy: A database application spends 5 percent of its time looking up records and 95 percent of its time transmitting data over a network. The database developer decides to tune the performance. He sits down and looks at the code for looking up records and sees that with a few simple changes he can reduce the time it’ll take to look up records by more than 50 percent. He makes the changes, compiles the database, and runs it. To his dismay, there's little or no noticeable performance increase!

What happened? The developer didn't identify the bottleneck before he began tuning. The most important thing you can do when attempting to boost your OpenGL program’s performance is to identify where the bottleneck is.

Graphics applications can be bound in several places. Generally speaking, bottlenecks fall into three broad categories: CPU limited, geometry limited, and fill limited.

CPU limited is a general term. Specifically, it means performance is limited by the speed of the CPU. Your application may also be bus limited, in which the bus bandwidth prevents better performance. Cache size and amount of RAM can also play a role in performance. For a true CPU-limited application, performance will increase with a faster CPU. Another way to increase performance is to reduce your application’s demand on CPU resources.

A geometry limited application is bound by how fast the computer or graphics hardware can perform vertex computations, such as transformation, clipping, lighting, culling, vertex fog, and other OpenGL operations performed on a per vertex basis. For many very low-end graphics devices, this processing is performed in the CPU. In this case, the line between CPU limited and geometry limited becomes fuzzy. In general, CPU limited implies that the bottleneck is CPU processing unrelated to graphics.

In a fill-limited application, the rate you can render is limited by how fast your graphics hardware can fill pixels. To go faster, you'll need to find a way to either fill fewer pixels, or simplify how pixels are filled, so they can be filled at a faster rate.

It’s usually quite simple to discern whether your application is fill limited. Shrink the window size, and see if rendering speeds up. If it does, you're fill limited.

If you're not fill limited, then you're either CPU limited or geometry limited. One way to test for a CPU limitation is to change your code, so it repeatedly renders a static, precalculated scene. If the performance is significantly faster, you're dealing with a CPU limitation. The part of your code that calculates the scene or does other application-specific processing is causing your performance hit. You need to focus on tuning this part of your code.

If it's not fill limited and not CPU limited, congratulations! It's geometry limited. The per vertex features you’ve enabled or the shear volume of vertices you're rendering is causing your performance hit. You need to reduce the geometry processing either by reducing the number of vertices or reducing the calculations OpenGL must use to process each vertex.

22.020 How can I measure my application's performance?

You usually do this by getting the system time, doing some rendering, and getting the system time again. The difference between the two time measurements tells you how long it took to render. You can do other quick calculations to determine frames per second, triangles per second, and vertices per second.

Calculating pixels per second is a little tougher. The easiest way to calculate it is to write a small benchmark program that renders primitives of a known pixel size.

Some benchmark software is free to download. GLUT 3.7 comes with a benchmark called progs/bucciarelli/gltest that measures OpenGL rendering performance. You can also visit the Standard Performance Evaluation Corporation, which has many benchmarks you can download and the latest performance results from several OpenGL hardware vendors.

22.030 Which primitive type is the fastest?

GL_TRIANGLE_STRIP is generally recognized as the most optimal OpenGL primitive type. Be aware that the primitive type might not make a difference unless you're geometry limited.

22.040 What's the cost of redundant calls?

While some OpenGL implementations make redundant calls as cheap as possible, making redundant calls generally is considered bad practice. Certainly you shouldn't count on redundant calls as being cheap. Good application developers avoid them when possible.

22.050 I have (n) lights on, and when I turned on (n+1), suddenly performance dramatically drops. What happened?

Your graphics device supports (n) lights in hardware, but because you turned on more lights than what's supported, you were kicked off the hardware and are now rendering in the software. The only solution to this problem, except to use less lights, is to buy better hardware.

22.060 I'm using (n) different texture maps and when I started using (n+1) instead, performance drastically drops. What happened?

Your graphics device has a limited amount of dedicated texture map memory. Your (n) textures fit well in the texture memory, but there wasn't room left for any more texture maps. When you started using (n+1) textures, suddenly the device couldn't store all the textures it needed for a frame, and it had to swap them in from the computer’s system memory. The additional bus bandwidth required to download these textures in each frame killed your performance.

You might consider using smaller texture maps at the expense of image quality.

22.070 Why are glDrawPixels() and glReadPixels() so slow?

While performance of the OpenGL 2D path (as its called) is acceptable on many higher-end UNIX workstation-class devices, some implementations (especially low-end inexpensive consumer-level graphics cards) never have had good 2D path performance. One can only expect that corners were cut on these devices or in the device driver to bring their cost down and decrease their time to market. When this was written (early 2000), if you purchase a graphics device for under $500, chances are the OpenGL 2D path performance will be unacceptably slow.

If your graphics system should have decent performance but doesn’t, there are some steps you can take to boost the performance.

First, all glPixelTransfer() state should be set to their default values. Also, glPixelStore() should be set to its default value, with the exception of GL_PACK_ALIGNMENT and GL_UNPACK_ALIGNMENT (whichever is relevant), which should be set to 8. Your data pointer will need to be correspondingly double- word aligned.

Second, examine the parameters to glDrawPixels() or glReadPixels(). Do they correspond to the framebuffer layout? Think about how the framebuffer is configured for your application. For example, if you know you're rendering into a 24-bit framebuffer with eight bits of destination alpha, your type parameter should be GL_RGBA, and your format parameter should be GL_UNSIGNED_BYTE. If your type and format parameters don't correspond to the framebuffer configuration, it's likely you'll suffer a performance hit due to the per pixel processing that's required to translate your data between your parameter specification and the framebuffer format.

Finally, make sure you don't have unrealistic expectations. Know your system bus and memory bandwidth limitations.

22.080 Is it faster to use absolute coordinates or to use relative coordinates?

By using absolute (or “world”) coordinates, your application doesn't have to change the ModelView matrix as often. By using relative (or “object”) coordinates, you can cut down on data storage of redundant primitives or geometry.

A good analogy is an architectural software package that models a hotel. The hotel model has hundreds of thousands of rooms, most of which are identical. Certain features are identical in each room, and maybe each room has the same lamp or the same light switch or doorknob. The application might choose to keep only one doorknob model and change the ModelView matrix as needed to render the doorknob for each hotel room door. The advantage of this method is that data storage is minimized. The disadvantage is that several calls are made to change the ModelView matrix, which can reduce performance. Alternatively, the application could instead choose to keep hundreds of copies of the doorknob in memory, each with its own set of absolute coordinates. These doorknobs all could be rendered with no change to the ModelView matrix. The advantage is the possibility of increased performance due to less matrix changes. The disadvantage is additional memory overhead. If memory overhead gets out of hand, paging can become an issue, which certainly will be a performance hit.

There is no clear answer to this question. It's model- and application-specific. You'll need to benchmark to determine which method is best for your model or application.

22.090 Are display lists or vertex arrays faster?

Which is faster varies from system to system.

If your application isn't geometry limited, you might not see a performance difference at all between display lists, vertex arrays, or even immediate mode.