Texture-space Decals

I wanted to write up an interesting method of projected decal rendering I’ve been working on recently.

decal_many

The basic idea is (relatively) straightforward. To place a decal onto a mesh:
– Render the decal into a texture, projected onto the mesh’s UVs.
– Render the mesh using the decal texture as an overlay, mask, etc.

Naturally, the details get a bit more complex.

(I implemented this in Unity; unlike the rest of this blog, code samples will be using C# and Cg.)

Texture-space rendering

Rendering into texture space is not a new technique, but it is rather uncommon. (The canonical runtime usage is subsurface scattering: render diffuse lighting into a texture, blur it, and sample the result in the main shading pass.)

It is simpler than it may seem: we base the vertex shader output on the mesh UV coordinates, rather than positions:

struct appdata
{
    float4 vertex : POSITION;
    float2 uv : TEXCOORD0;
};

struct v2f
{
    float4 vertex : SV_POSITION;
};

v2f vert (appdata v)
{
    v2f o;
    // Remap UV from [0,1] to [-1,1] clip space (inverting Y along the way.)
    o.vertex = float4(v.uv.x * 2.0 - 1.0, 1.0 - v.uv.y * 2.0, 0.0, 1.0);
    return o;
}

(The Y coordinate of UVs is inverted here; this may vary depending on your renderer.)

In order to project the decal into the texture, we add a matrix uniform for the projection and a texture sampler for the decal texture itself. The vertex shader then projects the vertex position using this matrix, which gives us a UV coordinate for sampling the decal texture.

(Yes, input UVs become output positions and input positions become output UVs here. It’s a bit confusing.)

The fragment shader takes this projected UV coordinate, and performs a range check. If every component is within the range [0, 1], then the we sample the decal texture. Otherwise, the fragment is outside the projection and the shader discards the fragment.

struct appdata
{
    float4 vertex : POSITION;
    float2 uv : TEXCOORD0;
};

struct v2f
{
    float3 uvw0 : TEXCOORD0;
    float4 vertex : SV_POSITION;
};

sampler2D _MainTex;
float4 _MainTex_ST;
float4 _Color;

float4x4 _DecalProjection0;

v2f vert (appdata v)
{
    v2f o;
    // Remap UV from [0,1] to [-1,1] clip space (inverting Y along the way.)
    o.vertex = float4(v.uv.x * 2.0 - 1.0, 1.0 - v.uv.y * 2.0, 0.0, 1.0);

    float4 world_vert = mul(unity_ObjectToWorld, v.vertex);
    // We assume the decal projection is orthographic, and omit the divide by w here.
    o.uvw0 = mul(_DecalProjection0, world_vert).xyz * 0.5 + 0.5;
    return o;
}

bool CheckBox(float3 uvw)
{
    return !(uvw.x < 0.0 || uvw.y < 0.0 || uvw.z  1.0 || uvw.y > 1.0 || uvw.z > 1.0);
}

fixed4 frag (v2f i) : SV_Target
{
    bool in0 = CheckBox(i.uvw0);

    // Don't touch this pixel if we're outside the decal bounds.
    if (!in0) discard;

    // In this instance, _MainTex contains an alpha mask for the decal.
    return tex2D(_MainTex, i.uvw0.xy).a * _Color;
}

(While this shader uses the world-space position, using object-space positions will give better results in practice, as it will maintain precision for objects far away from the world origin.)

On the CPU side, we drive the decal rendering from script:

public void SplatDecal(Material decal_material, Matrix4x4 decal_projection)
{
    int pass_idx = decal_material.FindPass("DECALSPLATPASS");
    if (pass_idx == -1)
    {
        Debug.LogFormat("Shader for decal splat material '{0}' does not have a pass for DecalSplatPass.", decal_material.name);
        return;
    }

    RenderTexture active_rt = RenderTexture.active;
    m_DecalTexture.MarkRestoreExpected();
    Graphics.SetRenderTarget(m_DecalTexture);
    decal_material.SetMatrix("_DecalProjection0", decal_projection);
    if (!decal_material.SetPass(pass_idx))
    {
         Debug.Log("Decal splat material SetPass failed.");
         return;
    }
    Graphics.DrawMeshNow(m_Mesh, transform.position, transform.rotation);
    Graphics.SetRenderTarget(active_rt);
}

Producing the decal projection matrix is an interesting problem in and of itself, but I’m not going to get into detail here. In general, we want an orthographic projection matrix corresponding to the oriented bounding box of the decal in world space. (Again, in a production setting, applying decals in object space is likely to produce better results.)

Visual Artifacts

The primary visual artifact resulting from this method is visible UV seams:
decal_seam

The reason this occurs is that our decal rendering and final mesh rendering disagree slightly on what texels are affected by the UVs of a given triangle. This is due to the GPU’s rasterization fill rules: fragments along the edge of a triangle are deemed in or out of the triangle based on a consistent rule so that when two triangles share an edge the fragment is only shaded once. While absolutely the correct thing to do for general rendering, it bites us here.

Conservative rasterization extensions should fix this issue. However, Unity does not expose this extension to users yet, and in any case it is not widely available.

The fix I implemented is an image pass that grows the boundaries of UV islands by one texel.

First, we render a single-channel mask texture, using the same texture-space projection as above, and output a constant 1. The mask now contains 1 on every texel touched by UVs, and 0 everywhere else. This mask is rendered only once, at initialization. (Really, we could even bake this offline.)

Next, after rendering a decal, we run a pass over the entire texture, creating a second decal texture. This pass samples the mask texture at each texel: if the sample is 1, the texel is passed through unchanged into the output. However, if the sample is 0, we sample the neighboring texels. If any of those are 1, the corresponding color samples are averaged together and output.

The resulting fragment shader looks like this:

sampler2D _MainTex;
float4 _MainTex_TexelSize;
sampler2D _MaskTex;

fixed4 frag (v2f i) : SV_Target
{
    float mask_p11 = tex2D(_MaskTex, i.uv).r;
    if (mask_p11 > 0.5)
    {
        // This pixel is inside the UV borders, pass through the original color
        return tex2D(_MainTex, i.uv);
    }
    else
    {
        // This pixel is outside the UV border.  Check for neighbors inside the border and output their average color.
        //
        //   0 1 2
        // 0   -
        // 1 - X -
        // 2   -
        //
        float2 uv_p01 = i.uv + float2(-1, 0) * _MainTex_TexelSize.xy;
        float2 uv_p21 = i.uv + float2(1, 0) * _MainTex_TexelSize.xy;

        float2 uv_p10 = i.uv + float2(0, -1) * _MainTex_TexelSize.xy;
        float2 uv_p12 = i.uv + float2(0, 1) * _MainTex_TexelSize.xy;

        float mask_p01 = tex2D(_MaskTex, uv_p01);
        float mask_p21 = tex2D(_MaskTex, uv_p21);
        float mask_p10 = tex2D(_MaskTex, uv_p10);
        float mask_p12 = tex2D(_MaskTex, uv_p12);

        fixed4 col = fixed4(0, 0, 0, 0);
        float total_weight = 0.0;
        if (mask_p01 > 0.5) {
            col += tex2D(_MainTex, uv_p01);
            total_weight += 1.0;
        }
        if (mask_p21 > 0.5) {
            col += tex2D(_MainTex, uv_p21);
            total_weight += 1.0;
        }
        if (mask_p10 > 0.5) {
            col += tex2D(_MainTex, uv_p10);
            total_weight += 1.0;
        }
        if (mask_p12 > 0.5) {
            col += tex2D(_MainTex, uv_p12);
            total_weight += 1.0;
        }

        if (total_weight > 0.0) {
            return col / total_weight;
        }
        else {
            return col;
        }
    }
}

Note the mask comparisons are “greater than 0.5”, not “equal to 1”. Depending on the layout and packing of UVs, you may be able to get away with storing the mask texture in half-resolution and using bilinear filtering to reconstruct the edge. This produces artifacts where UV islands are within a few texels of each other (as they merge on the lower-resolution rendering).

Seams are filled:
decal_seam_fix

On the CPU side, this is a straightforward image pass:

public void DoImageSpacePasses()
{
    int pass_idx = DecalImageEffectMaterial.FindPass("DECALUVBORDEREXPAND");
    if (pass_idx == -1)
    {
        Debug.Log("Could not find decal UV border expansion shader pass.");
        return;
    }

    // We replace the existing decal texture after the blit.  Copying the result back into
    // the original texture takes time and causes flickering where the texture is used.
    RenderTexture new_rt = new RenderTexture(m_DecalTexture);

    DecalImageEffectMaterial.SetTexture("_MaskTex", m_UVMaskTexture);
    Graphics.Blit(m_DecalTexture, new_rt, DecalImageEffectMaterial, pass_idx);

    m_DecalTexture.Release();
    m_DecalTexture = new_rt;
    m_Material.SetTexture("_DecalTex", m_DecalTexture);
}

Of course, in doing this, we create a new render target each time we run this pass. Therefore, it’s not a great idea to run the image pass after every decal is rendered. Instead, render a batch of decals and run the image pass afterwards.

Advantages

Okay, so what does all of this *get* us, in comparison to existing decal techniques?

– The computation cost of a decal is low. There is a fixed per-frame overhead in using the decal texture in the main shader. Each decal rendered is a one-time cost as it renders to the texture. (The UV fixup can be amortized across a batch of decals, making it even cheaper.)

Contrast deferred decals, which require extra draw calls every frame. Permanent decals are therefore very expensive over time. (Forward+ decals encounter similar issues.)

The price paid, of course, is memory: the decal texture takes memory, the mask texture takes memory. The full decal texture must be kept around, even for distant objects, because we must render into the top mip at all times. (Otherwise, we’re stuck needing to upsample the rendered decal when the object gets closer.)

– Support for skinned meshes.

Deferred decals do not work for skinned meshes. While you can in theory attach them to bones and move the decal around with animation, the decal will not deform with the mesh, and so will appear to slide across the surface.

This is not a problem for the texture-space approach; the decal texture will stretch and squash around joints just like any other texture.

– Avoids UV distortion compared to simply “stamping” decals into the texture. By projecting the decal using the mesh vertex positions, we maintain size and shape, regardless of the topology of the UVs underneath. (As well as wrapping around UV seams.) This means texture-space decals still work well on organic shapes, where UVs may be stretched, skewed, etc. Do note this comes at the cost of inconsistent texel density over a decal in these cases; we can’t have everything.

Tradeoffs

This technique does come with some important caveats.

– No UV overlap.

Overlapping UVs will cause a decal in one area to show up elsewhere. (And since the overlap is almost always along a UV island boundary, one of these will have a sharp edge.)

– Sensitive to UV distortion.

While the texture projection can compensate for minor distortion in UVs, severe stretching will result in wildly varying texel density across the decal.

Both of these combine to make texture-space decals generally inadvisable for environments. (As environment art leans heavily on both UV overlap and stretching across low-detail regions.)

– Unique texture per object.

Each object using this decal method needs its own texture to store the decal layer. These can be combined into a texture atlas to reduce the impact on render batching, but the memory cost is still significant.

Conclusion

This is not a generally-applicable decal method. Deferred / Forward+ decal rendering still solves the general case much better.

However, this *is* a useful tool for decals on a relatively small number of meshes, in cases where general methods aren’t optimal: skinned meshes, permanent decals, large numbers of decals. As an example, a recent trend in FPS games is decal overlays for first-person weapon meshes (i.e, blood splatter on a recently-used melee weapon). Rather than pre-author them, this technique would allow those overlays to be generated dynamically.

Advertisements

hikari – Entity Component System II

I’d intended to use this second post to dive into turning some existing systems in hikari into components. However, after a few attempts at writing it up, I think converting an existing system makes for a really terrible way to explain component systems. So instead I’m going to go over a few common decisions that were showing up a lot and discuss them directly. We’ll get back to the actual implementation after this.

Also, before I go any further, I wanted to clarify a point from last time. ECS is an abstract design pattern. There are many ways of implementing it, just like there are a variety of ways to implement abstract data structures. (i.e, a list could be an array, a linked list, a linked list of array chunks, etc.) There are tradeoffs for every implementation decision, so I don’t claim I’m implementing things in The Right Way. That said, this is a record of my implementation, not a broad study of ECS in general, so I may not mention all of the tradeoffs.

Components and Systems

Continuing the theme of “things are not what they seem”, components don’t necessarily exist in ECS. Which is to say, if you have a Foo component, you may or may not actually have a FooComponent struct anywhere.

The reason for this is simple: it is exceedingly rare to have *only one* of a thing. The structure we store things in determine its memory layout. Memory layout decisions for a group of things should be made with respect to the entire group and how it is used. Therefore, these decisions are made when implementing the system for a particular component.

For example, here’s a few ways we could layout a TransformSystem that holds parent- and world-relative spacial transforms as well as a scene graph structure describing the relationship between instances.

struct TransformSystem {
    // Option:  Array-of-structs
    struct TransformComponent {
        Vector3    local_position;
        Vector3    local_scale;
        Quaternion local_rotation;
        Vector3    world_position;
        Vector3    world_scale;
        Quaternion world_rotation;

        u32 parent;
        u32 next_sibling;
        u32 prev_sibling;
        u32 first_child;
    };
    u32 count;
    TransformComponent * transforms;

    // Option:  Struct-of-arrays
    u32 count;
    Vector3 *    local_position;
    Vector3 *    local_scale;
    Quaternion * local_rotation;
    Vector3 *    world_position;
    Vector3 *    world_scale;
    Quaternion * world_rotation;
    u32 * parent;
    u32 * next_sibling;
    u32 * prev_sibling;
    u32 * first_child;

    // Option:  Array-of-structs, use-case split
    struct Transform {
        Vector3    position;
        Vector3    scale;
        Quaternion rotation;
    };

    u32 count;
    Transform * parent_from_local;
    Transform * world_from_local;
    u32 * parent;
    u32 * next_sibling;
    u32 * prev_sibling;
    u32 * first_child;
};

A lot of material about memory layout pushes “struct of arrays for everything!”, and that’s a good default strategy, but it isn’t necessarily the best way to go every time. What gets used together? What doesn’t? When you optimize for cache performance in the most common code path, how many cache misses do you cause in the *second* most common code path? In the example, struct-of-arrays makes a lot of sense when we’re burning through the data linearly (as we do when updating the world transforms), but means three separate cache misses to look at a specific transform on its own. Which is more important is going to depend on factors that aren’t obvious and may change over time (how many other systems poke at individual transforms, say.) Hence why it’s nice to have this detail localized to the system itself — we can change this at will as long as no one else depends on our exact data layout.

The freedom to choose our data layout per system goes beyond struct-of-arrays, however. Need to keep the data sorted? Go ahead. Need to store it in a spatial lookup structure? Sure. Want to keep components in their own heap-allocated structs? That works too.

The listing above is missing one important field that all components will need: the entity it is attached to. Systems interact directly with each other. If we know our host entity, we can use that to query other systems for components attached to that same entity. (We also need that reference to check if our host has been destroyed. Remember: we’re not storing a list of attached components anywhere!)

Component Lookup

One of the key things we need to be able to do is, given an entity, find the component instance(s) attached to it. Once again, this is something we solve separately per-system. For many systems, we’re going to want to move instances around in our internal data structures, to keep them tightly packed or enforce ordering requirements. Pointers and indices alone won’t be sufficient in this case, because the instance moves around in memory.

We need a lookup structure between Entity handles and instances. How we choose to do this is a tradeoff between extra memory overhead and runtime.

On the low-memory side, we could just search the array of Entity handles. Linear search is O(n) in the number of active instances.

Option<Instance>
GetInstance(Entity e) {
    Instance result = Instance::Null();
    for (u32 i = 0; i < this->count; ++i) {
        if (e == this->entities[i]) {
            result = Instance(i);
            break;
        }
    }
    // NOTE: Option<T> stores an instance of T and whether that value is
    // valid.  Accessing an invalid value from Option<T> will assert.
    // I use this to force handling of the "not found" case by the caller.
    return Option<Instance>(result, !result.IsNull());
}

We can improve this further by letting the caller pass in a previous instance handle — before searching, check if that instance and entity combination is still valid:

Option<Instance>
GetInstance(Entity e, Instance inst=Instance::Null()) {
    if (!inst.IsNull() && inst.index < count) {
         if (e == this->entities[inst.index]) {
            // Past instance is valid
            return Option<Instance>(inst, true);
        }
    }
    // No longer valid, need to search again:
    ...
}

If the system doesn’t store a huge number of instances, and instances don’t move around very often, this is a perfectly reasonable approach. No memory overhead (we need that entity list anyway), and constant time in the common case.

Of course, we’re not going to be so lucky with some systems. For example, we can reasonably expect that the *majority* of entities have a Transform — anything that exists spatially in the world needs one. Transforms are also used by a lot of different systems, so we need to perform the lookup often. Finally, for the sake of its own updates, Transforms have ordering requirements that mean they move around in the internal structures pretty often.

We can solve this one with a linear lookup table indexed by the entity index:

Option<Instance>
GetInstance(Entity e) {
    Instance result = Instance::Null();
    if (e.index < this->lookup_table.size()) {
        result = this->lookup_table[e.index];
        // Check for stale entity:
        if (this->entities[result.index] != e) {
            result = Instance::Null();
        }
    }
    return Option<Instance>(result, !result.IsNull());
}

Unlike the lookup table in the entity manager I discussed last time, there’s no strict requirement on this lookup table not moving around. Therefore, we can use std::vector (or equivalent), sized to hold the highest entity index we’ve ever been attached to. GetInstance can immediately reject entities beyond that point. This gives us O(n) memory overhead in the number of entities that exist in the world.

Somewhere between these two on the memory/runtime tradeoff is a hashtable mapping Entity handles to Instance handles. This requires less memory for component types that aren’t very common (no need to store null Instances for every entity), while having better asymptotic runtime than searching the Entity array. (Search may still win in practice for small enough systems.)

Multiple Instances

The above lookup functions only really work if each Entity can have only one of a particular component. This makes sense in some cases (what would it even *mean* to have multiple Transforms?) In many other cases, though, we’d really like to be able to use multiple instances per entity. For example, an explosion effect might have several different particle systems attached (smoke, rubble, sparks, etc.)

However, in this case, how do we reference a particular instance of a component from some other system? The host entity’s handle is no longer a unique identifier for the instance.

I don’t really have an answer I’m confident in, but my working theory is the system that allows multiple instances should handle this internally. Specifically, by expanding its data structures.

First, we add a unique id field to the Instance struct. This isn’t another index, just a monotonically increasing counter of instances created. (I am very explicitly *not* using the Nth instance attached to an entity as it interacts badly with adding / removing instances.) Next, we add two arrays to our instance data, one that points to the next instance assigned to the same entity, the next containing the instance id. This gives us a linked list of instances for a given entity — by putting this in its own array, we keep it out of the way during most operations, but increase the chance of hitting cache when we *are* iterating the list.

GetInstance() gets a bit more complex:

Option<Instance>
GetInstance(Entity e, Instance inst) {
    if (inst.IsNull()) {
        // Caller is expected to have a non-null instance already.  (As returned
        // by AddInstance(), etc.)  Otherwise we don't know the correct id.
        return Option<Instance>(inst, false);
    }

    if (inst.index < count) {
         if (this->entities[inst.index] == e && this->instance_ids[inst.index] == inst.id) {
            // Given instance reference is still valid.
            return Option<Instance>(inst, true);
        }
    }

    // Find the entity's first instance, using one of the methods above:
    Instance current = FindFirstInstance(e);
    while (current.id != inst.id && !current.IsNull()) {
        u32 next_index = this->next_instance_index[current.index];
        u32 next_id = this->instance_ids[next_index];
        current.id = next_id;
        current.index = next_index;
    }

    // Result is valid if the id matches the target.
    return Option<Instance>(current, current.id == inst.id);
}

I’m not a huge fan of this. More state to keep in sync, and potentially a lot of effectively useless overhead (i.e., if *most* entities with a FooComponent only have one, the next_instance_index array is going to be full of nulls.) On the other hand, we only pay a performance cost when we actually *use* multiple instances per entity.

¯\_(ツ)_/¯

Generalization

So, we have a variety of approaches to solve instance storage and lookup on a per-system basis, but do we really have to do this every time? Many systems are going to want to use the same approach.

I don’t really know. However, I suspect we can get away with sharing a lot of the lookup code, by storing the lookup data in its own struct, templated on the system’s Instance type. Actual component data management is unlikely to be easy to share, but we can write helper code for things like “allocate a bunch of arrays from a single data block”.

Okay, onward to actual code.  Next time.

hikari – Entity Component System I

Happy New Year, everyone.

So far, almost all of the work on hikari has been focused on rendering. In order to continue working on my terrain project, I need to flesh out more of the underlying engine systems. (To be perfectly clear: creating the impetus to work on these systems is part of why I choose this project to work on.) As a starting point, I want to set up an Entity Component System architecture for managing the simulation.

Why? Well, let’s look at what I have right now:

State of the World

The simulation state of the world is unceremoniously dumped in the World struct, which contains lists of meshes, particle systems, and other persistent data (i.e., sun direction.) Also worth noting, several important things are *missing* from this structure — the camera setup, for instance. This is largely an artifact of being hacked together with very little care while testing the renderer itself.

During rendering, the Renderer gathers a list of meshes by calling World::GetRenderMeshes(), which in turn does a lot of (ultimately unnecessary) work to splice together a list of RenderMesh structs from various sources. Terrain meshes and meshes loaded from OBJ files are stored by their various systems. This is less of a performance problem than it may seem (it only happens once a frame), but it’s still silly and ugly and bug prone.

Furthermore, there’s very little structure to the world state at the moment. Each RenderMesh contains a 4×4 world_from_local transform matrix and that is the only transform information known for any mesh. This works, but is less than ideal.

Entity Component System

My proposed solution to this problem is using an Entity Component System architecture to manage all of the simulation state. ECS is a pattern where each Entity in the world is composed of a collection of Components, each of which is processed by some System.

This is the point where most explanations of ECS seem to end. But I’ve explained the concept, not the implementation, and that’s important here. In a standard object-oriented model, you might implement ECS by having each entity store a list of references to components, that live in lists managed by the component’s system. This is certainly possible — both Invaders! and a second (abandoned) project on that engine used this method for entities. However, this led to nasty macros, code generation tools, and generally being a pain to work with. (A dynamic array *per entity*, eek!)

So. I’m not going to do that this time.

A key realization is that we basically never need a list of all components belonging to an entity (at runtime, at least.) So why store it? Systems which need to interact with each other can do so because they have the same entity ID.  Without that list of components, the Entity is just a unique identifier, a glorified integer. Or a foreign key, if you’re familiar with relational databases. Because that’s all an ECS is: a hyper-specialized relational database.

Of course, there are some potential problems introduced here. For example, if an entity doesn’t know all it’s components, how can it destroy all of them when the entity is destroyed? Well, it can’t. But we can solve this problem on a per-system basis — query the entity list to see if a particular component’s owner is still alive, for example. Or, set up a way to register callbacks on entity destruction, for systems where the live check is too much overhead. For components that are cheap to update, maybe we don’t even care.

Entities

Okay, enough theory, let’s talk about something concrete. I’m borrowing a fair bit of the structure of my implementation from Niklas Gray’s blog series about the entity system in Bitsquid / Stingray.

An Entity is actually two numbers that together form a unique ID — index and generation. I use 32-bit and 16-bit integers for these. The index is unique among all *currently live* entities. The generation indicates how many times that index has been used. If I destroy an entity, its index is open for reuse, but the next entity created with that index will have a different generation.

The generation value is also stored in a list owned by the EntityManager. This list is indexed by the entity index. A given Entity value is valid if and only if its generation is equal to the generation value stored in the EntityManager at that index. This is not perfect — there are a finite number of bits in the generation value and it will eventually roll over and repeat. However, by increasing the number of bits, we can ensure this is *highly unlikely* in practice.

The EntityManager has three main operations: creating entities, destroying entities, and checking whether an entity is alive. Of these, the liveness check is by far the most common operation (any component that wants to ensure its owner is alive needs to call this!) In order for the IsAlive() call to be both fast and thread-safe, the generation lookup table can never be reallocated.

Easy enough, I’ll preallocate it. I need one slot for each possible (32-bit) entity index and each value is 16 bits, which gives us an array size of… 8 gigabytes.

Believe it or not this is fine. Because I’m going to cheat.

The important thing is that the address of the lookup table never changes, and that it can grow — without moving — to support however many entities end up being used. Instead of fully allocating the full size, we can instead use OS-specific calls to reserve that much *address space*. While the typical PC has somewhere between 2-16GB of physical memory, the address space of a 64-bit process is *enormous*. 8GB of address space is nothing. As more memory is needed, we can simply commit more pages in that range. And should we ever run on a PC with terabytes of RAM, there’s no artificial limitation keeping us from using all 4-billion or so entities.

No real code listings here, as most of it is uninteresting. Next time, we’ll look at some actual component systems, which are where the real benefits show up.

A String Compare Bug

This is a bit of a departure from the current series. I encountered a bug last week that, while ultimately trivial to fix, was quite difficult to track down.

To start with, the root cause and how to fix it:

I rewrote my shader handling to work indirectly — instead of holding OpenGL handles directly, the renderer has an index into an array of shaders. The ultimate goal being that the asset system can recompile shaders at runtime, swap them out at the top of a frame, and the renderer is none the wiser.

In order to get that index in the first place, the renderer does a lookup using a name string. Asset names are stored in a NameString, which has a character pointer and size. However, the lookup query uses a raw C string. Herein lies the bug:

struct NameString {
    u32 len;
    char * str;

    bool operator==(char * other) {
        return (0 == strncmp(other, str, len));
    }
};

On first glance, this returns whether the strings match. However, it *actually* returns whether the first len characters of the string match. If str is a prefix of other, this will return true, even if the full strings don’t match.

The fix is to actually compute the length of the C string, and compare the lengths as well. As a precaution against bad data, we’d prefer strnlen over strlen. In order to make sure the check is correct, we use a maximum length of len + 1 for the strnlen — therefore, a longer string will still produce a longer length:

struct NameString {
    u32 len;
    char * str;

    bool operator==(char * other) {
        u32 other_len = strnlen(other, len + 1);
        return (other_len == len) && (0 == strncmp(other, str, len));
    }
};

(The *proper* fix, of course, is *don’t use strings for this*. That’s the direction I want to move towards.)

So that’s the error, but not quite the full story of the bug. I feel like it’s worth going into detail here, because it really underscored, to me, the way a subtle error can have wide, unexpected effects.

The problem originally showed itself as serious noise in the screen-space ambient occlusion (SSAO) buffer:
Screenshot 2017-12-15 01.52.21

My implementation of SSAO has two passes: the first takes the depth buffer, reconstructs normals, and then tests points in a hemisphere above the surface. The the percentage of the points that fall behind the depth buffer is used as the occlusion value. (The details get a bit messier but aren’t really relevant at the moment.) To get better coverage with fewer samples, the set of test points is rotated based on a noise texture tiled over the screen. The second pass takes this result and blurs it, using a blur exactly as big as our noise texture, which smooths out the result.

After determining I wasn’t getting NaNs or infinities anywhere, my initial thought was that the blur pass was simply ineffective. Noisy artifacts around depth discontinuities are expected when reconstructing normals from depth, and that’s similar to what we’re seeing here. The blurring *should* mitigate this.

In an attempt to fix it, I modified the blur shader to take the depth buffer as an input, and weight samples in the blur based on how close (depth-wise) they were to the current pixel. And it worked! I got a smooth(er) output:
Screenshot 2017-12-15 01.33.36

Of course, this wasn’t the actual error. (And this isn’t the correct output, either!)

I wanted to get a screenshot of the bug for this blog, so I commented out the changes to my blur shader… and nothing changed, the bug did not reappear.

Well, if I can’t *UN*fix a bug, I probably haven’t actually fixed it.

The one change I had not backed out was binding the depth buffer. Backing that out reintroduced the bug. What’s more, if I bound *another* texture (say, the color framebuffer), I got a glitchy-but-recognizable copy of that texture’s red channel in the SSAO map.

Exasperated, I went to bed.

The next morning, I finally sat down and stepped through in the debugger. And noticed, finally, that the bound shader was the same for both SSAO passes. Ah.

The main SSAO pass shader was named “ssao”, the SSAO blur shader was named “ssao_blur”. Due to the string comparison bug this post began with, both passes ran the main shader. The original visual artifact was due to having the depth input unbound while running the pass. (Obviously it was reading *something*, but I’m not sure what.) Once I bound the depth input for (what I thought was) the updated blur shader, the smoother output was just the main SSAO output.

Fixing the bug for real gives us the blurred version of that, and the correct output:
Screenshot 2017-12-19 21.28.16

There are a few things I could have done to catch this bug earlier and with less frustration.

– Warn or assert on invalid shader uniform names. I had expected the GL driver’s debug mode to catch this — obviously it didn’t. glGetUniformLocation returns -1 for uniform names not found in the shader, need to check that myself because the driver is perfectly happy binding to that location. (This would have caught the bug immediately, when “ssao_input” wasn’t found in the main SSAO shader.)

– First priority is to check the code I just changed, not code that is “obviously” related to the bug. The *only* issue with the SSAO pass was using the incorrect shader (the depth-sensitive blur helps, but wasn’t a big difference.) Even if I’m sure the code I just wrote only uncovered a bug, not caused it: verify that.

– I don’t understand a bug if I can’t force it to happen again.

– Use the debugger more. Use it!

hikari – Terrain Materials

Now that hikari’s material system can express ideas more complicated than “specific set of textures”, it’s time to implement the materials and shading for terrain.

Splat Maps:

The texture selection is going to be based off of splat maps, which means we need the vertex colors: something I conveniently ignored when doing mesh generation earlier. It’s a simple enough fix. I decided to go with layers of noise for testing again, both because it was easy (no need for tooling) and because it produces the worst-case scenario for performance (all pixels use all or most of the splats). If the splat / triplanar mapping is going to fall off the performance cliff I want to find out now, not later.

Splats are normalized (to sum to one) and packed into a 32 bit color per vertex:

inline u32
NormalizeSplat(float w0, float w1, float w2, float w3, float w4) {
    float sum = w0 + w1 + w2 + w3 + w4;
    Assert(sum != 0.0f, "Cannot have total weight of zero on terrain splat.");

    float r = w0 / sum;
    float g = w1 / sum;
    float b = w2 / sum;
    float a = w3 / sum;

    u32 ri = (((u32)(r * 255.0f) & 0xff) <<  0);
    u32 gi = (((u32)(g * 255.0f) & 0xff) <<  8);
    u32 bi = (((u32)(b * 255.0f) & 0xff) << 16);
    u32 ai = (((u32)(a * 255.0f) & 0xff) << 24);

    return ai | bi | gi | ri;
}

I wrote a basic shader to visualize the splat map (although, squeezing 5 values into 3 colors is not particularly effective):
splat_color.png

Textures:

By packing textures tightly, we can reduce the number of texture reads for each splat material to two:
– albedo color
– normal + AO + roughness

This requires encoding the normal map in two channels (which is completely fine for tangent space, just discard the Z component and reconstruct in the shader). It also assumes that none of the splat materials have a metal channel. This is not necessarily true, but the metal mask could be packed into the albedo’s alpha channel. For the textures I’m working with, the metal mask is a constant 0.

So, as a rough calculation:
2 textures per splat material
x 3 splat materials per splat (so that we can have different textures on each planar face)
x 5 splats
= 30 textures.

This immediately rules out the naive approach of binding each texture separately. GL_MAX_TEXTURE_IMAGE_UNITS, the maximum number of bound textures per draw, on my midrange GPU is 32. Since the lighting code uses several textures as well (shadow maps, environment probes, a lookup texture for the BRDF), we run out of texture units.

Fortunately, we don’t have to do things that way.

All of the terrain textures should be the same size, so that the resolution and variation is consistent between them. This in turn means we can pack all of the terrain textures into a set of texture arrays. Instead of binding different textures, each splat material provides an index into the texture arrays.

This does make some things, like streaming textures in and out, more complicated. I’ve ignored this for now, as hikari simply loads everything at startup anyway. (hikari’s handling of assets is not very good, and will need to be addressed sometime soon.)

Shader:

All of the shaders in hikari are plain GLSL, with some minor support for preprocessing (#includes). Material shaders share the same lighting code, via a function called AccumulateLighting() that takes a struct containing all of the surface parameters and returns an RGB color. The benefit of this is writing, debugging, and optimizing the lighting calculation and lookup *once*.

Writing a shader, then, is just a matter of filling out that structure.

For this terrain shader, we need to do two sets of blends: first, blending between splats; second, blending between the three planar projections.

The blending for splats is pretty much exactly what you’d expect:

vec3 SampleAlbedoSplat(vec2 uv, float spw0, float spw1, float spw2, float spw3, float spw4) {
    vec3 splat0 = texture(splat_albedo_array, vec3(uv, splat_id0)).rgb;
    vec3 splat1 = texture(splat_albedo_array, vec3(uv, splat_id1)).rgb;
    vec3 splat2 = texture(splat_albedo_array, vec3(uv, splat_id2)).rgb;
    vec3 splat3 = texture(splat_albedo_array, vec3(uv, splat_id3)).rgb;
    vec3 splat4 = texture(splat_albedo_array, vec3(uv, splat_id4)).rgb;

    vec3 splat_albedo = splat0 * spw0 +
                        splat1 * spw1 +
                        splat2 * spw2 +
                        splat3 * spw3 +
                        splat4 * spw4;

    return splat_albedo;
}

There is another, similar function for sampling and blending the surface properties (normal, roughness, AO).

The triplanar blending is more interesting. See these two blog posts for the fundamentals of what triplanar blending is and how it works:
Ben Golas – Normal Mapping for a Triplanar Shader
Martin Palko – Triplanar Mapping

The first step of triplanar blending is calculating the blend weights for each plane. I take the additional step of clamping weights below a threshold to zero. While this can create some visible artifacts if the threshold is too high, any plane we can avoid looking at is 10 fewer texture samples.

const float blend_sharpness = 4.0;
const float blend_threshold = 0.05;

// fs_in.normal is the interpolated vertex normal.
vec3 blend = pow(abs(fs_in.normal), vec3(blend_sharpness));

blend /= dot(blend, vec3(1.0));

if (blend.x < blend_threshold) blend.x = 0.0;
if (blend.y < blend_threshold) blend.y = 0.0;
if (blend.z < blend_threshold) blend.z = 0.0;

// Need to renormalize the blend
blend /= dot(blend, vec3(1.0));

By checking for a positive blend weight before sampling one of the projections, we can skip those that aren’t going to contribute much. This *does* seem to be a win (about a 0.25ms drop in render time in my test scene), but it’s pretty close to the noise level so I’m not certain.

Thresholding the splat weights may also be worthwhile; in the worst-case test I have set up it definitely isn’t, but actual artist-authored terrain is unlikely to use more than 2 or 3 channels per-pixel.

The actual blending is, mostly, exactly what you’d expect (multiply each planar projection with the corresponding weight and add them together.) The normal blending is slightly different, as described in Ben Golas’ blog post above:

// props_#.ts_normal is the tangent space normal for the plane facing that axis.
vec3 normal_x = vec3(0, props_x.ts_normal.yx);
vec3 normal_y = vec3(props_y.ts_normal.x, 0.0, props_y.ts_normal.y);
vec3 normal_z = vec3(props_z.ts_normal.xy, 0.0);
surf.N = normalize(fs_in.normal +
                   blend.x * normal_x +
                   blend.y * normal_y +
                   blend.z * normal_z);

The normal map for each plane is a linear blend of the splats. This is wrong, but looks okay in practice. I think the correct approach would be to swizzle *every* normal map sample out to world space and *then* blend? Not sure.

The result:
splat_texture.png

Per-plane Textures:

Stepping back a bit, let’s look at this render:
per_plane_color

The textures are placeholders, but each pixel is colorized based on the contribution by each planar mapping. For terrain rendering, this is really nice: it’s identified the slopes for us. In addition, we read a texture for each plane anyway — but there’s no need for it to be the *same* texture!

By exploiting this, we can use a different texture for slopes and level areas, “for free”:
grass_cliff

There’s more we can do here, too, like using different variants of a texture for X and Z planes to increase the amount of variation.

Performance:

I don’t have a very good feel on what makes shaders fast or slow. I was expecting 30+ texture reads to be a problem, but it doesn’t really appear to be. The depth prepass is possibly saving me a lot of pain here, as it means very little of the expensive shading goes to waste. I did notice some issues after first implementing the terrain shader, dropping to 30FPS on occasion, but after adding some GPU profiling code it turns out *shadow rendering* is slow, and just tipped over the threshold to cause problems. (Why does rendering the sun’s shadow cascade take upwards of 5-7ms? I dunno! Something to investigate.)

That said, the method I am using to time GPU operations (glQueryCounter) appears to be less than perfectly accurate (i.e., the UI pass seems to get billed the cost of waiting for vblank.) The GL specification, meanwhile, manages to be either extremely unclear or contradictory on the exact meaning of the timestamp recorded.

For now, I’m going to say this is fine and investigate more later. (ノ`Д´)ノ彡┻━┻

Continuing Work:

At this point, I have most of what I set out to implement, but there are a few things that still need to be done before I’d consider this terrain system complete:

– LOD mesh generation. In addition to simplifying the mesh itself, probably clamp low-weight splats to zero to make shading cheaper.
– Revisit hex tiling. I suspect it really is a more natural fit.
– Fix seams with normals and splats. These are still noticeable with real terrain textures.
– More detailed shading. The terrain materials I’m using all come with height maps; adding occlusion mapping would be interesting and would help sell the look up close.

Until next time.

hikari – New Material System

The next step in building hikari’s terrain system is proper material handling. In order to do so, I need to completely revamp the material system. This is going to be extensive enough to earn its own post.

Current Material System:

Here is the current state of hikari’s material “system”, a single struct:

struct Material {
    GLuint normal_map;
    GLuint albedo_map;
    GLuint metallic_map;
    GLuint roughness_map;
    GLuint ao_map;

    bool has_alpha;
    bool is_leaf;
    bool is_terrain;
};

And its usage, from the per-mesh main render loop:

main_pass.Begin();
main_pass.SetRenderTarget(GL_FRAMEBUFFER, &hdr_buffer_rt_ms);

RenderMesh * rmesh = meshes + i;
IndexedMesh * mesh = rmesh->mesh;
Material * material = rmesh->material;
if (material->has_alpha) {
    glDepthMask(GL_TRUE);
    glDepthFunc(GL_LESS);
}
else {
    glDepthMask(GL_FALSE);
    glDepthFunc(GL_EQUAL);
}
Assert(mesh && material, "Cannot render without both mesh and material!");

GLuint shader_for_mat = default_shader;
if (material->is_leaf) {
    shader_for_mat = leaf_shader;
}
if (material->is_terrain) {
    shader_for_mat = terrain_shader;
}
main_pass.SetShader(shader_for_mat);

main_pass.BindTexture(GL_TEXTURE_2D, "normal_map", material->normal_map);
main_pass.BindTexture(GL_TEXTURE_2D, "albedo_map", material->albedo_map);
main_pass.BindTexture(GL_TEXTURE_2D, "metallic_map", material->metallic_map);
main_pass.BindTexture(GL_TEXTURE_2D, "roughness_map", material->roughness_map);
main_pass.BindTexture(GL_TEXTURE_2D, "ao_map", material->ao_map);

// Material independant shader setup (lights list, shadow map binding,
// environment maps, etc.), elided for space.

// Draw the mesh.
mesh->SetAttribPointersForShader(main_pass.active_shader);
mesh->Draw();

main_pass.End();

The current notion of material is a set of textures and some flags used to determine the correct shader. Needless to say, this is neither flexible nor particularly efficient. Instead, what I’d like is a Material that consists of two bags of data: render states (shader program, blend mode, depth test) and shader uniforms (texture maps).

New Material System, First Pass:

struct Material {
    GLuint shader_program;
    GLenum depth_mask;
    GLenum depth_func;

    struct Uniform {
        const char * name;
        union {
            float value_f[4];
            Texture value_texture;
        };

        enum UniformType {
            Float1,
            Float2,
            Float3,
            Float4,

            TexHandle,
        } type;
    };

    std::vector<Uniform> uniforms;

    // Uniform setup methods elided

    inline void
    BindForRenderPass(RenderPass * pass) {
        pass->SetShader(shader_program);
        glDepthMask(depth_mask);
        glDepthFunc(depth_func);

        for (u32 i = 0; i < uniforms.size(); ++i) {
             Uniform u = uniforms[i];
             switch (u.type) {
                 case Uniform::UniformType::Float1: {
                     pass->SetUniform1f(u.name, u.value_f[0]);
                } break;
                case Uniform::UniformType::Float2: {
                    pass->SetUniform2f(u.name, u.value_f[0], u.value_f[1]);
                } break;
                case Uniform::UniformType::Float3: {
                    pass->SetUniform3f(u.name, u.value_f[0], u.value_f[1], u.value_f[2]);
                } break;
                case Uniform::UniformType::Float4: {
                    pass->SetUniform4f(u.name, u.value_f[0], u.value_f[1], u.value_f[2], u.value_f[3]);
                } break;
                case Uniform::UniformType::TexHandle: {
                    pass->BindTexture(u.name, u.value_texture);
                } break;
            }
        }
    }
};

A good first step. No more branching to select shaders inside the render loop, no hardcoded set of textures. Still doing some silly things, like re-binding the shader for every mesh. Also still looking up uniform locations every time, though there is enough information here to cache those at load time now.

Let’s look at the entire main pass for a bit:

for (u32 i = 0; i < mesh_count; ++i) {     main_pass.Begin();     main_pass.SetRenderTarget(GL_FRAMEBUFFER, &hdr_buffer_rt_ms);     if (!visible_meshes[i]) {         continue;     }     RenderMesh * rmesh = meshes + i;     IndexedMesh * mesh = rmesh->mesh;
    Material * material = rmesh->material;

    Assert(mesh && material, "Cannot render without both mesh and material!");

    material->BindForRenderPass(&main_pass);

    main_pass.SetUniformMatrix44("clip_from_world", clip_from_world);
    main_pass.SetUniformMatrix44("world_from_local", rmesh->world_from_local);
    main_pass.SetUniform3f("view_pos", cam->position);
    main_pass.SetUniform1f("time", current_time_sec);

    main_pass.BindUBO("PointLightDataBlock", point_light_data_ubo);
    main_pass.BindUBO("SpotLightDataBlock", spot_light_data_ubo);
    main_pass.BindUBO("LightList", light_list_ubo);

    main_pass.BindTexture(GL_TEXTURE_CUBE_MAP, "irradiance_map", skybox_cubemaps.irradiance_map);
    main_pass.BindTexture(GL_TEXTURE_CUBE_MAP, "prefilter_map", skybox_cubemaps.prefilter_map);
    main_pass.BindTexture(GL_TEXTURE_2D, "brdf_lut", brdf_lut_texture);

    main_pass.BindTexture("ssao_map", ssao_blur_texture);
    main_pass.SetUniform2f("viewport_dim", hdr_buffer_rt_ms.viewport.z, hdr_buffer_rt_ms.viewport.w);

    main_pass.SetUniform3f("sun_direction", world->sun_direction);
    main_pass.SetUniform3f("sun_color", world->sun_color);
    main_pass.BindTexture("sun_shadow_map", sun_light.cascade.shadow_map);
    for (u32 split = 0; split < NUM_SPLITS; ++split) {
        char buffer[256];
        _snprintf_s(buffer, sizeof(buffer) - 1, "sun_clip_from_world[%d]", split);
        main_pass.SetUniformMatrix44(buffer, sun_light.cascade.sun_clip_from_world[split]);
    }
    mesh->SetAttribPointersForShader(main_pass.active_shader);
    mesh->Draw();
    main_pass.End();
}

There is still a *lot* of uniform setup going on per-mesh, and almost all of it is unnecessary. But, since Material binds the shader each time, all of the other uniforms need to be rebound (because BindForRenderPass() may have bound a different shader).

Ideally, here’s the inner loop we’re aiming for:

for (u32 i = 0; i < mesh_count; ++i) {     RenderMesh * rmesh = meshes + i;     IndexedMesh * mesh = rmesh->mesh;
    Material * material = rmesh->material;
    Assert(mesh && material, "Cannot render without both mesh and material!");

    material->BindForRenderPass(&main_pass);
    main_pass.SetUniformMatrix44("world_from_local", rmesh->world_from_local);
    mesh->SetAttribPointersForShader(main_pass.active_shader);
    mesh->Draw();
}

Material Instances:

When rendering the Sponza test scene, there are a few dozen materials loaded. However, there are only 3 different sets of render states: leaves (a shader doing alpha test and subsurface scattering), alpha tested, and default PBR. Within each class of material the only difference is the texture set.

If we were to separate the set of render states and the set of uniforms into different entities, we’d be able to minimize modifications to the render state in this loop. So that’s what I’ve decided to do.

A Material is a bag of render states, while a MaterialInstance is a bag of uniforms associated with a particular Material. For example, the vases, walls, columns, etc. in Sponza would all be instances of the same Material. If we sort and bucket the mesh list according to Materials, we only need to bind the full render state for each material once. (This is also a convenient point to eliminate culled meshes, removing the visibility check in the main loop.)

At least for now, I’ve done this in the most naive way possible; the list of uniforms is removed from Material and becomes MaterialInstance. Each instance also contains a pointer to its parent material. Done!

This is not a great solution, there are a lot of ways for one to shoot themselves in the foot. For example, a MaterialInstance that doesn’t contain the full set of expected uniforms (will render with stale data), or containing extras (will assert when the uniform bind fails). The Material should probably have “base instance” that defines what set of uniforms are required, and defaults for them; each instance would validate calls against this base. I have not implemented this yet.

Here’s where we end up with MaterialInstance:

BucketedRenderList render_list = MakeBucketedRenderList(meshes, mesh_count, visible_meshes);

for (RenderBucket bucket : render_list.buckets) {
    main_pass.Begin();
    main_pass.SetRenderTarget(GL_FRAMEBUFFER, &hdr_buffer_rt_ms);

    main_pass.BindMaterial(bucket.material);

    // Additional global setup, as above.

    for (u32 i = bucket.start; i < bucket.end; ++i) {
        assert(i < render_list.mesh_count);
        RenderMesh * rmesh = &render_list.mesh_list[i];
        IndexedMesh * mesh = rmesh->mesh;
        MaterialInstance * material_instance = rmesh->material_instance;

        RenderPass main_subpass = main_pass.BeginSubPass();

        main_subpass.BindMaterialInstance(material_instance);
        main_subpass.SetUniformMatrix44("world_from_local", rmesh->world_from_local);
        mesh->SetAttribPointersForShader(main_subpass.active_shader);
        mesh->Draw();

        main_pass.EndSubPass(main_subpass);
    }

    main_pass.End();
}

(Material binding has been moved into RenderPass, which is in hindsight a more natural place for it to live.)

By carving the mesh list up into buckets by Material, we only need to do the global setup once for each bucket, and then only set the specific instance materials per-mesh. Even better, the check for culled objects disappears. And this list is reusable between the depth prepass and main render pass.

Still a lot of room for performance improvement: the RenderMesh struct causes a lot of unnecessary pointer chasing, meshes using the same instance of a material could be batched as well, there are *sprintf* calls in the outer loop. It’s pretty clear I need to spend a lot more time here.

However, this is progress! More importantly, Materials are now general enough I can implement the terrain materials. So that’s next.

hikari – Terrain Mesh Generation

Continuing on from the last post, I have implemented the basic mesh generation for heightmapped terrain. This was probably the easiest part, but did require some cleanup of existing code first.

Mesh Handling Cleanup:

Up to this point, meshes have not been dealt with in a systematic manner in hikari. The Mesh struct contained buffer handles and an index count and that was it. All meshes were assumed to be triangle lists. This was workable, because all object meshes had a common vertex layout, and rendered with variants of a single shader with fixed attribute indices. Rendering a mesh was easy enough:

glBindVertexArray(mesh->vao);
glBindBuffer(GL_ARRAY_BUFFER, mesh->vertex_buffer);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, mesh->index_buffer);

glEnableVertexAttribArray(0);
glEnableVertexAttribArray(1);
glEnableVertexAttribArray(2);
glEnableVertexAttribArray(3);

glDrawElements(GL_TRIANGLES, mesh->index_count, GL_UNSIGNED_INT, NULL);

glDisableVertexAttribArray(3);
glDisableVertexAttribArray(2);
glDisableVertexAttribArray(1);
glDisableVertexAttribArray(0);

There were already other meshes with different layouts, but they were special cases: a quad used for full-screen render passes (with only position and uv), a cube used to render the skybox (position only), etc.

Rather than hack around the limitations here (such as using separate render pass for terrain), I decided a mesh should know its vertex layout and be able to do the attribute setup itself.

Enter VertexBufferLayout:

struct VertexBufferLayout {
    struct Property {
        char * name;
        u32 size;
        GLenum type;
        bool normalized;
        u32 offset;
    };
    
    u32 stride;
    std::vector<Property> props;

    u32 GetPropertySizeBytes(Property p) {
        static const u32 TypeToByteTable[] {
            GL_BYTE, 1,
            GL_UNSIGNED_BYTE, 1,
            GL_SHORT, 2,
            GL_UNSIGNED_SHORT, 2,
            GL_INT, 4,
            GL_UNSIGNED_INT, 4,
            GL_HALF_FLOAT, 2,
            GL_FLOAT, 4,
            GL_DOUBLE, 8,
        };

        for (u32 i = 0; i < array_count(TypeToByteTable); i += 2) {
            if (TypeToByteTable[i] == p.type) {
                u32 byte_size = TypeToByteTable[i + 1];
                return byte_size * p.size;
            }
        }
        return 0;
    }
    
    void AddProperty(char * name, u32 size, GLenum type, bool normalized) {
        Property p;
        p.name = name;
        p.size = size;
        p.type = type;
        p.normalized = normalized;
        p.offset = stride;
        u32 byte_size = GetPropertySizeBytes(p);
        if (byte_size > 0) {
            stride += byte_size;
            props.push_back(p);
        }
        else {
            Assert(false, "Invalid vertex buffer property type.");
        }
    }
};

No real surprises here, just stuffing everything into an array, with a bit of extra bookkeeping to prevent errors in calculating stride and offset pointers. This isn’t a general solution, it makes the assumption that vertex attributes are always stored interleaved (not necessarily true), and does not permit a base offset (for storing multiple meshes in one vertex buffer). But it works, and there’s little point buying more complexity than needed.

Meshes each store a pointer to a layout. Prior to rendering, the renderer passes the shader handle to the mesh, which then uses the handle plus its layout data to determine attribute indices in the shader and bind the pointer to the buffer:

void
SetAttribPointersForShader(GLuint shader_program) {
    Assert(layout != NULL && attrib_indices.size() == layout->props.size(),
            "Invalid vertex buffer layout.");

    glBindBuffer(GL_ARRAY_BUFFER, vertex_buffer);

    for (u32 i = 0; i < attrib_indices.size(); ++i) {
        VertexBufferLayout::Property p = layout->props[i];
        attrib_indices[i] = glGetAttribLocation(shader_program, p.name);
        if (attrib_indices[i] != -1) {
            glEnableVertexAttribArray(attrib_indices[i]);
            glVertexAttribPointer(attrib_indices[i], p.size, p.type, p.normalized, layout->stride, (void *)p.offset);
            glDisableVertexAttribArray(attrib_indices[i]);
        }
    }
    
    glBindBuffer(GL_ARRAY_BUFFER, 0);
}

(Ideally we’d like to cache this — the index lookup should only need to be run once per layout/shader combination, not per-mesh.)

Now we have the list of attribute indices to enable, so we can render very similarly to how we did before.

This still leaves a lot to be desired, but it is enough to unblock the terrain work, so let’s get back on track.

Terrain Mesh Generation:

After all that, generating the terrain mesh itself seems almost trivial. Here is the entire loop for building the vertex positions:

u32 vertex_count = x_count*y_count;
TerrainVertex * verts = (TerrainVertex *)calloc(vertex_count, sizeof(TerrainVertex));

float xo = x_count * 0.5f;
float yo = y_count * 0.5f;\
Vector3 bounds_min(FLT_MAX, FLT_MAX, FLT_MAX);
Vector3 bounds_max = -bounds_min;

for (u32 y = 0; y < y_count; ++y) {
    for (u32 x = 0; x < x_count; ++x) {
        float map_val = map->ReadPixel(x_off + x, y_off + y);

        Vector3 pos = Vector3(
            x - xo,
            map_val, 
            y - yo);

        // Scale up from pixel to world space.
        pos *= chunk_meters_per_pixel;

        verts[y * x_count + x].position = pos;

        bounds_min = MinV(bounds_min, pos);
        bounds_max = MaxV(bounds_max, pos);
    }
}

This is a regular grid of squares in the XZ plane, with the Y coordinate taken directly from the heightmap. Next we build an index buffer (in the most naive manner), and calculate surface normals from the resulting triangles.

All of this is pretty straightforward, but there are a few subtle issues that come up:

– Triangulation. Unlike triangles, the vertices of a quad do not necessarily all fall in one plane. This is in fact *usually* the case with terrain. So how we decide to split that quad into triangles has an effect on the final shape:

quad_triangulation_illust

Currently, I split all quads uniformly. Another option I explored is switching the split direction in a checkerboard fashion. With the noise-based heightmap I’m testing with, the difference is negligible. This may prove more of an issue when producing sparser, low-LOD meshes for distant terrain. It also would *absolutely* matter if the terrain was flat-shaded (i.e., constant normal for each face), but that isn’t the case here.

– Calculating vertex normals. I calculate normals for this terrain mesh the same as any other mesh: loop through the faces and add face normals to each vertex, then normalize in another pass:

for (u32 i = 0; i < index_buffer_count; i += 3) {
    u32 idx_a = indices[i + 0];
    u32 idx_b = indices[i + 1];
    u32 idx_c = indices[i + 2];

    Vector3 a = verts[idx_a].position;
    Vector3 b = verts[idx_b].position;
    Vector3 c = verts[idx_c].position;

    Vector3 normal = Cross(a - b, a - c);

    verts[idx_a].normal += normal;
    verts[idx_b].normal += normal;
    verts[idx_c].normal += normal;
}

// We sum the unnormalized normals for each triangle, and then normalize
// the sum.  Since the cross product is scaled by the area of the triangle,
// this means that larger triangles contribute more heavily to the resulting
// vertex normal.
for (u32 i = 0; i < vertex_count; ++i) {
    verts[i].normal = Normalize(verts[i].normal);
}

So far so good, but there is a crucial problem here.

We iterate the faces of the mesh. But some of the points in question actually belong to several meshes! When we’re at the edge of the terrain chunk, we only account for the face normals of the triangles inside the current chunk. Which means the same point, on two different chunks, may produce a different normal.

And so, we get seams:

normal_seams

This is not necessarily easy to fix. The simplest option would be to iterate over neighboring chunks, find the shared vertices, and average the normals — not 100% correct, but it eliminates the seam. It’s unclear, though, how this would interact with streaming: a new chunk being loaded might change the normals on a mesh that’s been loaded for a while; should we read that mesh data back from the GPU? Keep a CPU copy and update? In addition, one of the design goals is for the terrain system to be agnostic about the source of meshes. For a well-structured grid finding the adjacent vertex is simple, but how does that work with an arbitrary hand-produced mesh?

Ultimately, there’s also the question of whether this seam in the normals actually matters. Obviously it *is* a bug, but it is only so glaringly visible because of the shiny plastic material I’m using to debug. I’m not sure how many of these artifacts will remain visible when applying actual terrain materials.

I’m counting this as a bug but I’m going to leave it alone for now.

– Alternate tiling strategies. I started with laying a square grid over the world. Why squares? There are other good tiling arrangements. In fact, we can tile the plane with *just* triangles, and avoid the question of how to triangulate our tiles altogether.

It turns out this is *really* easy to do:

pos.x += (y & 1) * 0.5f;  // Offset odd rows by half a unit
pos.z *= sqrtf(3.0f) / 2.0f; // Scale vertically by sin(pi/3)

Make sure the triangles cut across the short diagonal of that skewed grid, and now the plane is tiled with equilateral triangles.

There isn’t a huge amount of difference with my noise terrain (just a scaling on the Z axis):

This slideshow requires JavaScript.

However, tiling square chunks with equilateral triangles has a sticking point when it comes to downsampling the terrain for distant LOD meshes. The sawtooth edge can’t be merged into larger triangles without either breaking the tiling pattern or leaving gaps or overlap between chunks.

With that said, I feel like the triangle tiling has some really nice properties. No triangulation needed, and no ambiguity on the final mesh’s shape. It can still be represented easily in an image map (although, we’re pushing the limits on that). Grids make it easy to create 90 degree angles, which stick out, while a triangle tiling makes 60 degree angles which are less obviously objectionable. (This one is decidedly a matter of taste, but: my engine, my taste.)

Things get *really* interesting when we decide chunks don’t need to be rectangular. A hexagonal chunk tiles perfectly with equilateral triangles, which means the LOD reduction works fine as well. Using hexes for chunks also fits into the half-offset tiling described by Alan Wolfe here. (In addition to the memory savings he mentions in the post, the hex tiling also means a constant streaming cost as you move about the world, which is a very nice property to have.)

I haven’t implemented hex tiling, and this post is getting quite long as it is. I do think it is an avenue worth exploring, and will probably return to this after fixing up the material system.

Until next time!