Graphics Programming Tips: Branch Removal

After a rather long hiatus I’ve decided to return once more to drop some new nuggets of learning I’ve picked up along my travels…

In this post I’m  going to be dealing with the subject of modern graphics development & a topic that alot of studios probably do but likely don’t tell anyone how they do it (in detail)…

In modern day console hardware we generally find that dynamic branching sucks..
This generally wouldn’t be a problem outside of the fact that sometimes when trying to maintain a well structured and organised shader library for your game, should you choose to go the uber shader route, you generally need to use some form of branches or switches to determine which features of the shader are used by each model material.
In an ideal world we’d just stick a load of “if” statements in the shader code (like we do in most modern HLLs) and walk away, however the apparent cost of these simple branches on a piece of hardware thats built for maximising data throughput, can easily render our efforts prohibitively crippled performance-wise…

So what’s the solution?

In short, static branch removal…

Now those of you who are familiar with uber shaders will know exactly what this is however for the rest of you who don’t I’ll try to explain..
The aim is to mitigate the cost of dynamic branches in your uber shader by effectively trading computation time for memory. This is done by simply compiling your shader multiple times, with each “permutation” defined by turning on or off various parts of/features in the code. These collections of shaders can then be used at run time so when a material is set to be rendered it chooses the correct permutation it needs & runs that. This method provides the benefit of both maintaining the centralisation of code of the uber shader paradigm whilst providing a necessary means of significantly simplifying the compiled code for materials that only use a subset of the uber shader’s featureset.

Sounds easy right?

Not so much in reality as it can take some careful planning and, depending on the shader asset pipeline for your game, may be much harder to engineer and integrate than initially perceived.
As far as the core implementation goes however, there are various approaches that can be taken in order to do this and since branch removal is performed as a pre-process before compilation, it can be done in any number of ways. The most flexible approach would probably be in setting up a tool that parses the shader source and locates branches on literal “if” statements on any feature switch/bool shader constant of interest, generating duplicate sources of the code with and without the conditional-code-to-be-executed present or not. This is however a very complex task & not one I’d advice anyone to attempt as it would take a lot of work to get the parsing code robust enough to be useable in the real world.
The alternative however is to get the compiler to do it for you by replacing branches of interest in the code with special macros defined which enable you to configure at compile time the code that is and isn’t present in the shader.
In general I’ve performed this step by doing the following:-

  • In the shader or a shader header file the preprocessor checks if a macro to define branch removal macros (brms) is defined & if not, defines it setting it to false (e.g. think #ifndef DEFINE_BRANCHREMOVALMACROS  #define DEFINE_BRANCHREMOVALMACROS=0 etc..)
  • If your brm’s define macro is set to false, define your brms to resolve to bog standard “if” statements (this allows an environment that’s not setup to deal with your branch removal system to still be able to compile and run your shader, useful if your artists are using it directly for visualising materials in Maya or FXComposer for example..) otherwise resolve them to a special case “if” statement where your condition is now a compiler input macro
  • During branch removal the compiler is setup and the brm’s define macro is set as a compiler macro define input as well as a list of branch condition input macros which resolve directly to either true or false depending on the current permutation you wish to generate
  • The compiler is then run multiple times with each run modifying the values of the branch condition input macros in order to generate different code for each permutation. This works as each branch condition now resolves fully to either “if( true )” or “if( false )” by the compiler, which will then proceed to optimise out/in code, mitigating the generation of any branch instructions in the compiled binary

In order to do this your system still needs to be able to somehow analyse the shader source code and identify all the dynamic branch bools/switches used in the vertex and pixel programs as it’s these bools/switches you’ll be replacing with your macros when the time comes to compile. If your shaders are strictly Cg, NVIDIA provides a Cg API that allows you to quickly and conveniently interrogate your shaders in this way in order to retrieve the information needed.

Once you’ve compiled your list of switches you’ll need to categorise and tag your permutations so that you can identify what configuration of these bools each permutation represents. This will allow you to select the correct shader permutation at run-time for each material by matching the permutation’s switch configuration tag against that of the correct shader binary, probably via some sort of hashing setup.
In the implementation I did, I used a 16-bit bitmask to represent the hashkey for each configuration of the uber shader where each bool was added into a switch-table (made up of a bit shift position and id pair) & each one represented a single bit in the mask. This meant that no uber shader could define anymore than 16 switches however, given that the total number of permutations for any given shader program  is 2^n (where n = number of switches), I figured trying to generate more than 2^16 or 65,536 possible permutations would be undesirable at best (both in terms of run-time memory overhead and compilation-time).
Also one of the benefits of using the 16-bit bitmask key was that the algorithm for generating the switch values for compiling the shader programs was simple as I could simply iterate from 0 to the maximum number of configurations of my shader, with each iteration simply taking the iterator index value, treating it as a bit mask and using those values to set each branch condition input macro values (e.g. MYMACROSWITCH=true or false).

Once you’ve got this far the only thing left to do is re-wire your run-time to load each group of permutations in-place of the old, single shader & during render setup, as each material sets the values of the bools to configure the shader for use, the system internally takes each bool, searches the switch-table for the corresponding bit in the current active config mask and sets the bit to the value of the bool for the material. Then when it comes to render with the material, the current active config mask is used to index into the hash structure of shader program permutations to select the correct one for use…

That’s about it folks!

Not the most straight-forward  system to engineer however a little patience, thought and proper code design can go along way into helping you put something together thats both robust and scalable.
Other factors to note also are generally related to the memory-footprint of the system and with so many permutations of each shader, you can quickly end up in a situation where it’s easy to run out of memory if you’re not careful. There are however steps that can be taken to reduce the memory load however these require a much more analytical approach as they’re generally much more specific to the needs of your team/game/engine (i.e. stripping permutations of shaders you know will never be used either because they don’t logically make sense given the semantics of the switches themselves, or because they’re known to never get assigned for use at run-time, if material parameters never change for example…)

For more details on this however you’ll probably want to check out an awesome tri-ace GDC presentation [here].

That’s all from me for now…

Hope this post has been useful and insightful…

More to come in future

– The Hog

Graphics Programming Tips: Branch Removal

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s