Research & Development

We use Perforce for revision control at Insomniac. Most significantly, we manage a large number of large binary assets and there are very few other RCSs that would handle that well. However, we are always interested in looking at other systems and we have a few people here who are fans of git for source code control specifically. Jonathan Adamczewski (Engine Programmer) is among those and presented the following to interested programmers as an introduction to git (and comparison to Perforce).

An Introduction to the Git Revision Control System

As part of our Core team internal training, Mike Day (Sr Engine Programmer, working on our Render team) has been guiding small groups of other engine programmers (specifically those NOT on the render team) to read an academic paper and assist them with whatever background is necessary so that they can explain it. With any specialized discipline, there is a lot of domain knowledge presumed in the explaination of an idea and helping others pick up that background is part of the goal of this exercise. Academic papers in particular also tend to be extremely hard to parse, unnecessarily verbose and missing details crucial for replicating results. Filling those gaps is another goal. Finally, getting folks that don’t normally work together every day to discuss ideas which interest them but does not have an immediate impact on their work has real value.

In this installment, Mike picked the 2001 paper, An Efficient Representation for Irradiance Environment Maps (PDF) by Ravi Ramamoorthi and Pat Hanrahan.

Below is the presentation from the group on their understanding of that paper.
Mike Day Paper Exercise – An Efficient Representation for Irradiance Environment Maps

At Insomniac we’ve been using the well-known Reinhard tone mapping operator for a while now, but have found its parameterization unintuitive, and its curve shape to be lacking the useful ‘toe’ feature. Presented here is an efficiently computed alternative model which has been designed to qualitatively capture the toe and shoulder features while at the same time using intuitive parameters.

Read more: an efficient and user-friendly tone mapping operator.pdf

This document (by Mike Day) accompanies the SIGGRAPH presentation (by Mike Acton) “CSM Scrolling: An acceleration technique for the rendering of cascaded shadow maps” and will explain how a bitmap-scrolling technique can be combined with a shadow map caching scheme to significantly increase the performance of real-time cascaded shadow mapping in games. The two systems integrate well into the standard model of cascaded shadow mapping, and take advantage of temporal coherence to preserve much of the rendered shadow map information across frames. The technique is well-suited to current games consoles, and will ship in “Overstrike”, the forthcoming title by Insomniac.

CSM-Scrolling.pdf

Here’s a short account of an easily overlooked difficulty with vector length and vector normalization functions, together with one way of solving the problem. We’ll use 3-component vectors by way of illustration, but the idea is easily extended to longer or shorter vectors, quaternions, etc. Single-precision floating point is assumed.

Vector-length-and-normalization-difficulties.pdf

This article attempts to fix a problem which came up when implementing Ken Shoemake’s Euler angle extraction in the context of a single-precision floating point library. The original Shoemake code uses double precision, which presumably maintains sufficient precision for the problem not to arise.

euler angles.pdf (Updated 2/24/2014 with typo noted by John-Michael Fischer below fixed.)

A ‘spotlight cone’ can be pictured as an ordinary right circular cone, the base of which has been inflated outwards to form part of a spherical surface centred on the cone’s apex. This shape is useful in lighting systems for finding all the objects which could be influenced by a given spotlight. This document shows a stripped-down calculation for testing whether a given spotlight cone overlaps a sphere, typically the bounding sphere of an object which may be lit. The calculation can be used as the basis for an optimized SIMD test for checking a batch of several spheres against a given spotlight cone.

spotlight-cone-vs-sphere.pdf

We follow the method of separating axes, optimized for the case at hand. The assumed scenario is a looping SIMD implementation, where it is often beneficial to perform the full calculation for each test unconditionally. In other situations, it may be preferable to early-out as soon as a separating axis is found. Also, the usual approach of preceding the test with a quick and conservative one to prune out easily classified cases, using bounding spheres for example, applies.

Read: obb vs obb.pdf

Reposted from Andreas’s blog:

At Insomniac Games we’re currently doing a performance push which means lots of opportunity to look at compiler output. In this post I want to share a story on fighting a particular compiler optimization that was causing me problems.

One of my areas has been the decal system which is very heavy on 3d data processing — a perfect system for SIMD work. While working on one of the bigger compute loops I noticed something odd about the generated PPC code. All of the actual computation work was being done on the vector unit (as it should!), but there was a lot of stack traffic originating from the integer unit. The compiler was overcommitting integer registers it seemed, in what was essentially a pure-SIMD loop. Puzzled, I dug in and tried to work out what was going on.

Let’s look at the code. The loop I was working on processes 4 triangles at a time and has a structure like this:

Vertex* input = ...;
Vertex* output = ...;
int count = ...;

for (int i = 0; i < count; i += 12)
{
  VecSimd t1a = SimdLoad(&input[0].m_Position);
  VecSimd t1b = SimdLoad(&input[1].m_Position);
  VecSimd t1c = SimdLoad(&input[2].m_Position);
  // ... 9 more loads
  VecSimd n1a = SimdLoad(&input[0].m_Normal);
  VecSimd n1b = SimdLoad(&input[1].m_Normal);
  VecSimd n1c = SimdLoad(&input[2].m_Normal);
  // ... 9 more loads
  
  // (crunching)

  SimdStore(out0, &output[0].m_Position);
  SimdStore(out1, &output[1].m_Position);
  SimdStore(out2, &output[2].m_Position);
  // lots more stores

  // increment input and output pointers
  input += 12;
  output += 12;
}

What I found was that the vector loads were all being done from different (integer) registers and they were all being incremented individually by the compiler where I was expecting a single base register and 12 offsets. Because there are so many loads and stores going on, the compiler would run out of registers and start spilling them to the stack, generating load-hit-stores and memory penalties all over the loop!

Mike Day offered the following insight:

… in the case where there’s a sensible number of pointers – small enough 
not to incur spilling onto the stack – the ‘create n pointers’ approach 
is sensible for the compiler as long as it follows through by doing all 
the loads using the indexed addressing form of the load instruction. 

[...] 

The benefit is that instead of incrementing n pointers per pass 
(or one pointer n times), it only needs to increment a single offset by 
n times the stride, saving n-1 adds.

It turns out the optimizers for both our PPC compilers were so keen on using the indexed vector load instruction with a single increment that they would blindly overcommit the integer register pool to achieve that particular instruction selection.

The question then became–how do we force the optimizer to stop doing it? The optimizer can see that the input and output pointers are being moved consistently by a certain stride and any attempt on my part to use variations in the indexing expressions just ended up generating slightly different variations of the same bad behavior.

So let’s look at what we’d want if we were writing the loop ourselves in assembly. In this case a better code generation would be to have a single input register and increment it after each SIMD load, using just one register. The key to accomplish this code generation is in C++ to invalidate the optimizer’s assumptions about the pointers and strides so it can’t create an array of independent pointers. But how do we do that? After all, everything here is using linear memory accesses with a stride that’s well known at compile time, enabling the unwanted optimization.

Time for some trickery! The loop structure I ended up using looks like this:

static volatile uintptr_t s_secret_sauce = uintptr_t(ptrdiff_t(-1));
static volatile uint32_t  s_stride       = sizeof(Vertex);

// load our constants in from memory into registers
const uintptr_t secret_sauce             = s_secret_sauce;
const uint32_t  stride                   = s_stride;

Vertex* input = ...;
Vertex* output = ...;
int count = ...;

for (int i = 0; i < count; i += 12)
{
  // establish base pointer for all loads in the loop
  uintptr_t base = (uintptr_t(input) & secret_sauce) +
                   offsetof(Vertex, m_Position);

  // load and bump base pointer with stride
  VecSimd t1a = SimdLoad((void*)base); base += stride;
  VecSimd t1b = SimdLoad((void*)base); base += stride;
  VecSimd t1c = SimdLoad((void*)base); base += stride;

  // ... more loads
 
  // update input pointer for next loop iteration
  input = (Vertex*) base;

  // ... rest of loop  

  // ... stores handled similarily
}

There are two key things going on in this code that breaks the optimization pattern:

  • We’re loading stride and secret_sauce from memory before the loop, thereby preventing any compile-time knowledge of them.
  • We’re breaking any loop-wide analysis of the input and output pointers by masking with a value that is unknown to the optimizer, forcing it to rely on the sequence of statements we’ve laid out exactly.

This generated the desired instruction selection and removed all stack spills from the loop. In one particular case I was using as a performance test case this saved over 30k instructions over the loop lifetime, a significant chunk of work. It also removed several hundred load-hit-store penalties.

This does generate an additional and instruction which is an artifact of the technique, but compared to the much worse stack spilling code this the way better deal. The and could also be replaced with an add of zero or something similar, but it will just amount to the same 1 or 2-cycle overhead in the end.

This has to have been the first time I’ve used static volatile variables to improve performance. Hopefully it’s useful to someone else encountering this behavior!