Tuning OpenGL performance for geometry throughput_问答_开发者

This has probably been asked over and over but I couldn't find anything useful so here it goes again...

In my application I need to render a fairly large mesh (a couple of million triangles or m开发者_如何学JAVAore) and I'm having some problems getting decent frame rates out of it. The CPU is pretty much idling so I'm definitely GPU-bound. Changing the resolution doesn't affect performance, so it's not fragment- or raster-bound.

The mesh is dynamic (but locally static) so I cannot store the whole thing in the video card and render it with one call. For application specific reasons the data is stored as an octree with voxels in the leafs, with means I get frustum culling basically for free. The vertex data consist of coordinates, normals and colors - no textures or shaders are used.

My first approach was to just render everything from memory using one big STREAM_DRAW VBO, which turned out to be too slow. My initial thought was that I was perhaps overtaxing the bus (pushing ~150 MiB per frame), so I implemented a caching scheme that stores geometry recently used to render the object in static VBOs on the graphics card, with each VBO storing a couple of 100 KiB to a couple of MiB worth of data (storing more per VBO gives more cache thrashing, so there's a trade-off here). The picture below is an example of what the data looks like, where everything colored red is drawn from cached VBOs.

Tuning OpenGL performance for geometry throughput

_{(source: sourceforge.net)}

As the numbers below show, I don't see a spectacular increase in performance when using the cache. For a fully static mesh of about 1 million triangles I get the following frame rates:

Without caching: 1.95 Hz
Caching using vertex arrays: 2.0 Hz (>75% of the mesh is cached)
Caching using STATIC_DRAW VBOs: 2.4 Hz

So my questions is how do I speed this up? I.e.:

What's the recommended vertex format to get decent performance? I use interleaved storage with positions and normals as GL_FLOAT and GL_UNSIGNED_BYTE for colors, with one padding byte to get 4-byte alignment (28 bytes/vertex total).
Whether using the same buffer for normals for all my boxes might help (all boxes are axis-aligned so I can allocate a normal buffer the size of the largest cache entry and use it for them all).
How do I know which part of the pipeline is the bottleneck? I don't have a spectacular video card (Intel GM965 with open source Linux drivers) so it's possible that I hit its limit. How much throughput can I expect from typical hardware (2-3 year old integrated graphics, modern integrated graphics, modern discrete graphics)?
Any other tips on how you would tackle this, pitfalls, etc.

I'm not interested in answers suggesting LOD (I already tested this), vendor-specific tips or using OpenGL features from anything later than 1.5.

You're probably not going to like this response....

I've found your problem: Intel GM965 with open source Linux drivers

While my current job does not hit your volume of data, we've rendered several million vertexes in VBO and Intel graphics hardware/drivers have proven useless. Get yourself an NVidia card (and get over having to use the binary driver, it just works) and you'll be all set. Doesn't even have to be current generation though a top end Quadro (if work is paying) or top end GTX 400 series (if you're paying or just trying to save some bucks at work) should do just fine w/ the latest drivers. You could also try to find a machine w/ this hardware to test on if upgrading your machine is not an option.

I would use a performance profiler first (like gDEBugger), so you can figure out if you are vertex, fragment or bus limited, etc. It's hard to guess what optimizations to perform in such a particular case (Intel + open source drivers).

Did you also try VA mode? Are you using glDrawElements? glDrawArrays? Is the data vertex-cache friendly (pre and post transform)?

I don't know about your "mesh" but it seems like they are all cubes. If it is possible for you, render a single union cube to a display list and render a scaled version of that display list. That often gives a 10x speedup, since the buss is not pumped with vertex data or the video memory exhausted.

Of course that depends on your ability to change the data. It might not be the case if it really is not like on the picture.