The Littlest CPU Rasterizer

Okay, so I’ve been declaring for a long time that I’m going to blog about some of the stuff I’ve developed for my personal project, but I never seem to get around to it.

Finally I thought I’d just start off with a cute little trick: here’s how I’m doing “image-based” occlusion for my sky light. A little while ago, while I was thinking about ambient lighting, I realized I actually had a simple enough problem that I could render very small cubemaps, on the CPU, in vast numbers.

Here’s the basic underlying idea: A 16-by-16 black-and-white image is 256 bits, or 32 bytes. That’s exactly two SSE registers, or one AVX register.

That is, you can store tiny cubemaps whole in your CPU’s SIMD registers!

Next, I have a very limited problem domain. I’m working with a cube-world, so that helps. I’m only going to render from the cube vertices, so I have a very limited number of relative offsets between my camera and other cubes – few enough that I can actually exhaustively enumerate them.

So I wrote a little bit of code to render out 4096 (that’s 16x16x16) black-and-white images of cubes, and toss them into a header file. Here’s what that looks like:

static const U16 kCubeRenders[] = {
    
.... twenty thousand lines or so ....

    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000011100000000,
    0b0000011111000000,
    0b0000011111000000,
    0b0000001111000000,
    0b0000001111000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,

    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000011100000,
    0b0000000011111000,
    0b0000000011111000,
    0b0000000011111000,
    0b0000000001111000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,

    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000011100,
    0b0000000000011111,
    0b0000000000011111,
    0b0000000000011111,
    0b0000000000000111,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,
    0b0000000000000000,

.... and on, and on, and on ....

To render an image, I just zero two registers, then loop over the world searching for any filled cubes, and logical OR my 32-byte image with the prerendered images of any cubes I find. Essentially nothing but this:

// this is a renderer
for(I32 zz=zStart; zz<zEnd; ++zz)
{
    for(I32 yy=yStart; yy<yEnd; ++yy)
    {
        for(I32 xx=xStart; xx<xEnd; ++xx)
        {
            U8 terr = pBlock->Get(xx, yy, zz);
            if(terr != kTerrainTypeEmpty)
            {        
                U32 ind = (xx-xStart + (yy-yStart)*16 + (zz-zStart)*16*16) * 2;
        
                bits0 = _mm_or_si128(bits0, pRenders[ind]);
                bits1 = _mm_or_si128(bits1, pRenders[ind+1]);
            }
        }
    }
}

(That’s a bit of a simplification, since I actually render three cubemap faces at once. I also have special cases for cubes in the 0,0,0 position, that is, actually touching the camera. Cubes in that spot block the entire octant, so you can skip the whole render.)

So now I have a 1-bit cube map; all that remains is to convert it to a spherical harmonic. I’m using first-order harmonics – just 4 floats – because I think it looks fine and I’m cheap.) For this I have 32-bit float cubemaps of the spherical harmonic bases, and I convolve them with the image using code along these lines:

// four copies of the first 32 bits of the image
const __m128i splat = _mm_shuffle_epi32(render[0],  _MM_SHUFFLE(0,0,0,0));  

// mask off everything from word 0 except bit 0, word 1 gets only bit 1, etc...
const __m128i mask = _mm_set_epi32( 1<<0,  1<<1,  1<<2,  1<<3);
const __m128i sel = _mm_and_si128(splat, mask);  

// set word 0 to 0xffffffff if image bit 0 was 0
const __m128i nmask = _mm_cmpeq_epi32(sel, _mm_set1_epi32(0));  

// mask the first four pixels of the SPH basis
const __m128 maskedImage = _mm_and_ps( sphBasis[0], reinterpret_cast(nmask) ); 

// finally add the mask pixels to a running sum
sum = _mm_add_ps( sum, maskedImage )

This conversion is actually the slowest part of the whole technique – it takes more time than rendering the cubemap! – so if I’m lucky, this post will nerd-snipe somebody who actually knows anything about SSE and get me a faster routine. (The compiler does seem to be converting the constants into vector loads.)

It’s still fast enough, though, that this step isn’t a significant part of my terrain generation. My pokey 2.53 GHz Core 2 Macbook can render around 100 of these spherical harmonics per core per millisecond. That’s much too slow to run every frame for everything, but running it once on each new block of terrain isn’t a concern.

Once I have a harmonic at each vertex, I linearly interpolate them to get a field of harmonics for the entire world.

So, results! Here’s the underlying voxel field I’m generating cubemaps of:

A boring voxel image.

And here’s the detailed geometry, with sky occlusion only:

I've actually never seen this before! Never had a reason to look at it.

The ambient contribution I’m actually using (occlusion, convolved with the surface normal, convolved with the sky dome):

With normals

The full lighting (above plus the sun and a little bit of nonphysical nonsense):

Lighting component

And final render, in all its Perlin-noise-overuse glory:

Final render

For this sort of simple terrain, it seems to work very nicely! There are some artifacts where the lumpy terrain goes “too deep” inside its corresponding voxel and you get inaccurate deep black, but it’s still not unconvincing to my eye.

Caves, obviously, are hard. I only render out to a distance of 16 cubes, so light is going to leak through any object larger than 16 voxels – I haven’t decided what to do about that. Presumably you could create a “mip-mapping” scheme – convert the world into “megacubes” and do a second render – if you figured out what you were going to do with partially filled areas. (My instinct says you can probably fish something out of your ear that’ll work OK…)

Similarly, if you decided what to do about partially filled blocks, you could presumably use a technique like this on a world voxelized from polygons rather than on a world polygonized from voxels.

And of course the day is probably coming when you just render a zillion images on the GPU, or cone trace everything, or whatever you do on monster machines — but I have a soft spot for specialized software renderers and I think this one is especially adorable.

Anyone else ever done anything like this?

Posted in Lighting, Voxels | 15 Comments