Research & Development
July 07, 2010

Procedural Content Generation in RCF2

Posted by Ryan Smith, Gameplay Programmer
Procedurally generated content has the potential to add interesting, replayable content to our games while avoiding spiraling development costs. Here is a presentation from Gameplay Programmer Chaz Pratt about his research into procedurally generating arena challenges for RCF2. Chaz describes his method for exposing procedural generation through our Lua scripting system, as well as the challenges he encountered trying to integrate procedural content into the game engine. Although this system was not used in the final game, it was a worthwhile investigation into a promising approach to content creation.
 
+9
June 01, 2010

A Dynamic Component Architecture for High Performance Gameplay

Posted by Terrance Cohen, Lead Systems Programmer
Delivered at GDC Canada on May 7, 2010, this presentation details a dynamic component architecture used in the Resistance franchise to represent aspects of entity and systems' behavior. This component system addresses several weaknesses of the traditional game object model for high-performance gameplay, particularly in multi-threaded or multi-processor environments. Dynamic components are allocated and deallocated on-demand from efficient memory pools, and the system provides a convenient framework for running updates in parallel and on different processors such as SPUs. The system can be layered on top of a traditional game object model, so a codebase can gradually migrate to this new architecture. The presentation discusses motivations and goals of the system as well as implementation details.
+2
April 13, 2010

2010-04-13 - TGC - MEL Scripting for Animators

Posted by Paul Robbins
With a little coding knowledge you can turn repetitive tasks in Maya into the click of a button. This presentation will show animators the basics of MEL scripting by introducing MEL commands, variables, syntax and the script editor. Walking through the creation of an example script will illustrate how a script is written and what it can do. Resources including the MEL command library, a few helpful books and links for further learning will also be presented.
+3
March 14, 2010

Three Big Lies: Typical Design Failures in Game Programming (GDC10)

Posted by Mike Acton, Engine Director
Recovering from another inspiring GDC. This presentation went pretty well I think. Unfortunately we had to turn away something like 200 people because the room wasn't big enough. And quite a lot of those people were exactly the folk I wanted to get feedback from! So my appologies to anyone who came to the door and wasn't able to get in.
 
I received a lot of comments from people afterward that it inspired them to rethink their approach. Which is absolutely the best result I could hope for. There are a lot of poor practices with years of traction in the programming industry as a whole (not just game development) and I think it's well past time that we take the time to remember our first principles and keep ourselves from straying further.
 
Fundamentally this whole thing comes down to a simple truth: Computers transform data and nothing else. It's our job as programmers to make sure that data gets transformed. Our job is *not* to write code, but to make sure that data is transformed properly. And when we forget that, we end up with poorly-performing, over-complex, over-abstracted programs that simply do not do the job they need to do and cannot be made to do it well without heroic effort, if it's at all possible.
 
Nothing in this presentation should come as a real surprise to those that know me, but I absolutely believe it remains a problem and it's a problem I'm passionate about fixing.
 
Mike.
 
+7
March 03, 2010

Synchronizing Bots

Posted by Ross McIntosh, Gameplay Programmer
Here’s an introduction to a method for synchronizing AI in multiplayer games. It addresses an implementation that uses the Sync Host system that was described in the previous presentation as well as outlining the methods it uses to synchronize the many different aspects of AI. You’ll find details on the rationale behind why systems are synchronized in the way that they are, as well as the problems and solutions programmers have to keep in mind when using this method of synchronization
 
0
February 25, 2010

Introduction to Sync Host

Posted by Peter Kao, Gameplay Programmer
Online functionality has become a fundamental component of our games. This is an introduction to Sync Host, a system developed to address some of the tougher issues we’ve encountered when implementing network synchronization.
 
+4
January 06, 2010

An Algorithm for Lockless Processing of Sound Data

Posted by David Thall, Engine Programmer
An algorithm has been developed to asynchronously load, unload, play and relocate sound files without the use of locks. The algorithm allows the programmer to arbitrarily assign different logical subsections of the underlying system to different threads, all while remaining completely lock-free.
 
An example implementation has been developed that asynchronously loads and unloads sound files in one thread while playing and relocating the same sound files in another thread. All of this happens in a single shared memory heap, all without locking. A complete working version of this system will be described, highlighting the simple yet elegant design principles.
+2
January 06, 2010

Wavelets for the Layman

Posted by Mike Acton, Engine Director
Here's an internal presentation I gave just about a year ago on using wavelets for image compression. A nice and gentle introduction to a subject that's really pretty simple that is often buried under so many details that people don't know how to approach learning it.
+1
October 23, 2009

Color Spaces and You

Posted by Andrew Dickinson, Senior FX TD
A basic introduction to the concept of a colorspace for artists, programmers, and anyone who's just curious about the meaning of terms like "sRGB, "gamma," and "color profile." Talks about where colorspaces come from, what they look like, and why you might need to think about them.
+4
September 21, 2009

Three New Systems of Dynamic Components

Posted by Terrance Cohen, Lead Systems Programmer
The Dynamic Component System is now firmly established in our gameplay architecture. We’ve realized substantial benefits in workflow and performance from implementing gameplay as Dynamic Components. In this presentation, I provide a brief introduction to a few gameplay systems of Dynamic Components.
0
August 15, 2009

Down the concurrency rabbit hole

Posted by Mike Acton, Engine Director
This presentation was an introduction to solving concurrency problems. In particular it was about breaking down previously held beliefs and approaches that in fact make multi-threading and multi-core problem solving seem much more difficult than it actually is. The key is understanding that concurrency is not a code problem, it's a data problem. It always comes back down to the data. I used a doubly-linked list as a reference point in this presentation in order to demonstrate that the "classic" data structures are actually "sequential" data structures and that by looking at concurrency problems from a data design point of view you'll find that those data structures in general, and doubly-linked lists in particular are *not* concurrent data structures. Certainly you can make them work, as many people have, but you're forcing an inherently sequential data design into a concurrent system and you're going to be paying a price in terms of both added complexity and performance loss. These slides are hopefully readable on their own. So enjoy! And remember we're always happy to get feedback.
0
August 14, 2009

GDC09 - SPU Wrangling

Posted by Jonathan Garrett, Senior Engine Programmer
Here's a GDC presentation outlining our SPU job-manager and some tried-and-tested SPU debugging practices - it touches on details such as job-layout, the job-building process and some of the debugging assistance it provides - it concludes by walking through two debugging case-studies from Resistance 2.
0
April 10, 2009

Resistance 2 Water Rendering

Posted by Mike Day, Senior Engine Programmer
Fundamentally, the system simulates waves propagating across a planar surface. Each component wave is a sinusoid, and a detailed height field is generated by summing many such waves. The Fast Fourier Transform (FFT) provides an efficient way to perform the accumulation of the waves. For general background in this technique, see the classic paper by Tessendorf. The paper describes the method used for creating ocean water in the film ‘Titanic’, where the offline generation of images provides the luxury of using very large arrays in the FFT calculations – 2048 × 2048 grid points, with a grid resolution of 3cm. In real-time applications though, it quickly becomes apparent that the size of FFT array must be severely restricted. Consider also the need to render views in which details smaller than 3cm are visible. Our solution is to combine detail at many different scales. We employ only 32 × 32 FFTs, but many such height fields are summed from long wavelengths down to short wavelengths as required by each particular viewpoint. This is not unlike rendering fractal terrain, the major difference being that in the case of water the detail added at each level of detail (LOD) must be animated in a particular way.
 
We decided early on to devote much of the development budget to the modelling of interactive waves – waves emitted by local disturbances such as the impact of a projectile or a character wading through the water. The result is that while we feel the system is still lacking in certain features, we believe it demonstrates a technological first in games: interactive waves with the property of dispersion (a physically-based dependency between wavelength and wave velocity).
 
Full details: water.pdf
Source code (via Nocturnal Initiative): Nocturnal_igR2O_1.0.zip
0
April 10, 2009

GTC09 - How a Physics Solver Works

Posted by Eric Christensen, Principal Engine Programmer
This year I spoke at GTC about how a physics solver works. My intention was to provide a simple run-through of rigid bodies, constraints, and how they are used to pre-condition a solution. The presentation is meant to be comprehensive for everyone, so regardless of math background the information should be relatively easy to grasp. My hope is that this presentation will demystify how core solvers work. As a supplement to the presentation, I’ve provided source code for a simple 3D constraint solver which you can find on our nocturnal site. I encourage you to download the source code and experiment with it. I’d also like to point out that the solver does not have to be used for physics, as illustrated in my presentation.
 
Source code (via Noctural Initiative): Nocturnal_SolverExample_1.0.zip
Presentation: Physics_GTC_2009.pdf
 
0
April 10, 2009

GDC09 - Gameplay Systems on the SPU

Posted by Joe Valenzuela, Engine Programmer
This presentation, given as part of the Insomniac tutorial at GDC 2009, describes some of the systems and conventions used to support gameplay code on the SPUs.
0
April 10, 2009

GDC09 - Resistance 2 Prelighting

Posted by Mark Lee, Senior Engine Programmer
Resistance 2 had a big dynamic lighting overhaul from previous Insomniac titles. Here Mark Lee describes the new lighting and shadowing schemes used and touches on some of the implementation gotchas.
0
April 10, 2009

GDC09 - Insomniac Physics

Posted by Eric Christensen, Principal Engine Programmer
This year I had the opportunity to be part of Insomniac’s all day tutorial for GDC. I gave a presentation on Insomniac’s proprietary physics system, describing the evolution of its pipeline from inception to current day. In this presentation I explain the importance of optimizing a system for the SPU via shaders (isolated code fragments) and library shaders, as well as explaining the importance of good data design for optimal concurrency.
 
0
November 19, 2008

Normal Maps and Cube Maps For Everyone

Posted by Mike Acton, Engine Director
Continuing our For Everyone series. Chester Hsieh (Associate Engine Programmer) had a go at describing how normal and cube maps work to the general studio population. This presentation is for the curious that might want to know a bit more about the math and logic behind these very common texture maps.
0
November 19, 2008

Online Co-op Bot Synchronization

Posted by Mike Acton, Engine Director
Heather Barclay (Multiplayer Lead Programmer) gave an introduction to the programming team on our approach to synchronizing the network data for Resistance 2 Co-op mode. For those of you playing the game now, you know how crazy these games can get, and yet everything stays pretty smooth (for the most part). Enjoy!
0
November 19, 2008

Ratchet & Clank Worldwide Studios Debrief

Posted by Mike Acton, Engine Director
In February, we gave a presentation to the Sony first-party studios on how we progressed during the development of Ratchet and Clank Future: Tools of Technology. The purpose of these presentations is to share with other developers and to give hints to the things that improved our development process. And not just our engine or tools technology, but anything that we think might be helpful. I've culled out a few slides due to NDA, as well as a couple that might have caused forum meltdowns. But otherwise, this should hopefully provide a good inside look into some of our processeses as well as the tech changes we made during RCF.
0
July 02, 2008

Sound Formats for Everyone

Posted by David Thall, Sound Programmer
Driven by hardware design and media requirements, sound file formats are constantly changing. For example, on the PS3 we encode and decode an AD-PCM format for the majority of our sound sources during playback. On computers and portable music devices, most of you will listen to perceptually-coded, lossy, variable bit-rate MP3s. Those who still have CD players will listen to CDDA-encoded, signed 16-bit PCM sampled at 44.1 kHz. And sound designers will work mainly with chunked file formats like WAV and AIFF that give them flexibility in storage, playback and interoperability.
 
In this talk, David will demystify all the jargon. You won't walk away an audiophile, but you'll definitely understand the difference between a VAG, CD-based PCM and an MP3.
 
Who should check this out? Anyone with an interest in learning more about sound file formats.
 
0
July 02, 2008

Dynamic Component System

Posted by Terrance Cohen, Gameplay Lead Systems Programmer
A year ago, we talked about a Gameplay Architecture for Performance, and since then, we’ve been hard at work putting our ideas into practice. The Dynamic Component System is our way of expressing gameplay in bite-size chunks that can be added to and removed from the game as needed. Dynamic components represent aspects of a game object or system’s behavior. They are allocated and deallocated on-demand from efficient memory pools, they’re hierarchical, and client code can query for an object’s component of a given type or a component that implements a given interface. Because of the way components are pooled, it is straightforward to run updates for a given component type in parallel with other component types, and on a different processor such as an SPU.
0
July 02, 2008

Gameplay on SPUs

Posted by Joe Valenzuela, Engine Programmer
What does a next-gen engine look like? Well, for one thing, it needs to take advantage of the multi-CPU architectures. Scalable systems don't just happen, they're designed that way. I was lucky enough to give a presentation at Sony Devcon 2008 about how the Cell (and multi-core architectures generally) need to influence the design of gameplay systems. Designing your gameplay to work with deferred systems is an important step to scaling your performance with future platforms. I hope that this illustrates some of the planning that should go into place to eliminate "gotchas" when writing your next-gen gameplay code.
0
April 26, 2008

Optimization 101

Posted by Mike Acton, Engine Director
Usually in presentations, what we (or anyone else really) usually give to the audience are answers (hopefully!) to some interesting problem. Especially when it comes to optimization techniques. It seemed to me there was a big gap there - certainly a lot of people can simply use the answers to make their code/game/application faster, but how do they get better at solving these problems themselves? Usually the answer is simply, "with experience."
 
But maybe there's more we can do. In this talk, I wanted to focus on the PROCESS that I (personally) went through to optimize a particular function. Including all the little mis-steps and shortcuts I took along the way. The hope was that people would find that the process itself isn't something magical and it's something that can be learnt.
 
Although there are some people *cough*John Edwards*cough* that might tell you make sure you get another opinion on your optimization advice. ;)
0
April 02, 2008

Moby Bounding Spheres

Posted by Jonathan Garrett, Senior Engine Programmer
On R1 we had the PPU dynamically update moby bounding-spheres using collision primitives bound to animated joints. This scheme was reworked for RCF to run asynchronously on the SPU and to also create separate bounding-spheres for rendering related tasks and broad-phase collision.
0
April 02, 2008

MobyAsyncUpdate

Posted by Joe Valenzuela, Engine Programmer
We here at Insomniac don't discriminate - we think all code belongs on the SPU. Engine programmer and good looking S.O.B. Joe Valenzuela describes the AsyncMobyUpdate and Guppy systems, which we use to make sure our gameplay and AI code can use all parts of the Cell.
0
April 02, 2008

AsyncCharacterX

Posted by Mike Acton, Engine Director
One of the things I've been most excited about lately here at Insomniac is our push to move higher-level gameplay code on to the SPUs - and have this be done by our gameplay programmers - AND make sure they get an understanding of the SPUs (and data design) in the process without giving up the "power" of the processor through crazy abstractions.
 
Daniel Gonzalez shared his first experience with re-writing some legacy-style character AI as a set of SPU shaders for our AsyncMobyUpdate system. At the end of it all, the coolest thing was Daniel said that he was suprised when ultimately everything was actually a lot easier!
 
Shaun McCabe (Lead Single-Player Gameplay Programmer on Resistance 2) describes Daniel Gonzalez as "a gameplay programmer working on features for Resistance 2. His previous experience includes S2 Games and Streamline Studios. In his free time, Daniel enjoys searching for metal so dark that light can't escape."
 
Note: You'll notice that we call the character Daniel worked on "Character X (or Y)" throughout these slides. We don't actually call them "Character X" - we just don't want to spoil the surprise in Resistance 2. You'll have to wait and see! :)
0
April 02, 2008

GPU Pipeline for Everyone

Posted by Abdul Bezrati, Associate Engine Programmer
This is article features a simple introduction to how Directx 9.0 graphic-processing units (also referred to as GPUs) operate.
The article will briefly cover polygons, vertices, indices, transforms as well as Fragment and Vertex programs.
 
Correction 10/26/09:
On page 19 of the slides, the node that flips the V component of the UVs should have been declared as float2(1.0, -1.0) instead of float(0.0, -1.0); the node float(0.0, -1.0) nullifies the U component which is not what we are aiming for in the example.
 
 
0
March 24, 2008

SPU Shuffle Mask Helper

Posted by Jonathan Garrett, Senior Engine Programmer
here’s a small snippet of code which calculates shuffle-masks using the
standard convention we use (which came from ICE / SCEE-ATG)
 
here’s the brief details of the naming convention
(see shuffles.pdf for more info):
 
characters A-P specify bytes 0-15
words use A-D
shorts use A-H
bytes use A-P
upper case characters pull from quadword 1
lower case from quadword 2
special characters 0 -> 0x00, X or x -> 0xff, 8 -> 0x80
 
the 16-byte mask is printed to the tty and also copied to the windows clipboard
 
 
0
March 14, 2008

Debugging Tips

Posted by Jonathan Garrett, Senior Engine Programmer
Methods of debugging aren't usually discussed that much - people have their own tried and tested methods - in this presentation we tried to share some of the strategies used by members of the tech team.
+1
March 14, 2008

Faster SPU Clamp

Posted by Mike Day, Senior Engine Programmer and Jonathan Garrett, Senior Engine Programmer
Mike Day wrote:
 
I just had a thought…
 
Using the ‘standard’ approach it takes 4 spu instructions to clamp a floating point value (or vector of 4 values) to the range [0.0f, 1.0f]… compare against zero, select x or zero, compare against 1, select x or 1.
 
But if you don’t mind a bit of added latency (i.e. you’re inside a pipelined loop), it should be possible in 2 instructions – convert f32 to u32, with a scale of 2^32, then convert back to float using the same scale. This takes advantage of the clamping to [0,2^32-1] that the f32->u32 conversion does. (It also doesn’t require setting up a vector of zeros and a vector of 1’s).
 
I expect you wouldn’t get exactly 1 at the top of the range – it would probably be rounded down to the next float below 1.0, which is 0x3F7FFFFF.
 
Certain other clamp ranges would be possible too… [0, 2^n] using a different scale, and [–2^n, 2^n] using signed conversion.
 
Mike
 
Jonathan Garrett wrote:
 
that's great
 
inline qword ClampZeroToOne(qword q_)
{
  qword a = si_cfltu(q_, 32);
  qword b = si_cuflt(a, 32);
  return b;
}
 
inline qword ClampZeroToTwo(qword q_)
{
  qword a = si_cfltu(q_, 31);
  qword b = si_cuflt(a, 31);
  return b;
}
 
inline qword ClampZeroToFour(qword q_)
{
  qword a = si_cfltu(q_, 30);
  qword b = si_cuflt(a, 30);
  return b;
}
 
inline qword ClampMinusOneToOne(qword q_)
{
 qword a = si_cflts(q_, 31);
 qword b = si_csflt(a, 31);
 return b;
}
 
and yes, we lose a bit at the top of the range
 
Jonny
 
 
0
March 14, 2008

Dyadic Interpolation

Posted by Mike Day, Senior Engine Programmer
Given a set of equally-spaced samples of some function, we’d like to be able to make educated guesses for the values which the function would take on at points located half-way between the samples. If we had a scheme for doing this, we could then apply it recursively to generate the values at the 1/4 and 3/4 points; then the 1/8, 3/8, 5/8 and 7/8 points, and so on. Points of the form n/2m are called dyadic points, so an interpolation process of this type is called a dyadic interpolation scheme.
0
March 14, 2008

Progressive Mesh

Posted by Mike Acton, Engine Director
For a while during the development of Ratchet and Clank Future: Tools of Destruction, we were using a progressive mesh system for our LODs. We've since abandoned this approach for various reasons which I'm not going to get in to yet, but Reddy Sambavaram walked us through the how the systems worked and what benefits it had. Read on!
 
0
February 26, 2008

GDC 2008 - Insomniac SPU Programming

Posted by Mike Acton, Engine Director
This year, Eric Christensen and I were planning on speaking together at GDC on our SPU programming practices. But we were told that they prefer presentations with only one speaker, since they tend to be rated higher. So I went it alone. But certainly not without Eric's help in the background in developing the slides and the talk itself.
 
I thought, all told, that the crowd seemed to respond to our ideas positively. But we do seem to be the odd men out when it comes to managing memory and resources (manually). I am a huge proponent of not automatically managing too many things (e.g. memory via malloc or time via a scheduler) for a console game. I believe very strongly that it forces the team to keep things simple.
 
If you did attend the GDC SPU talk, I'd be very interested to hear your feedback. If you didn't, then I hope these slides will have enough detail to give you the main ideas.
0
February 05, 2008

Prelighting

Posted by Mike Acton, Engine Director
During the development of Resistance 2, one of the things we decided we wanted to do was improve our lighting model toward more of a focus on runtime light performance. With deferred rendering all the rage, we certainly considered it. But ultimately we thought it both too large and too risky for our development process. So we went with a kind of semi-deferred model that allowed us to get much of the benefit of fully deferred rendering without having to completely re-write our pipeline. Mark Lee (Master Engine Programmer) explained the approach during a presentation to the programmers.
 
0
February 05, 2008

Remez Exchange Algorithm

Posted by Mike Day, Senior Engine Programmer
I was just telling Mark about some stuff I’ve been doing at home lately, and I thought I’d share. I’ve been trying my hand at the so-called Remez Exchange Algorithm, which is an iterative technique for finding coefficients of minimax polynomials (see example below). The results are promising… I’ve already got it to a state where we could actually use it, and once I’ve done a bit more work on it, maybe during post, I’ll put it up somewhere and write a wiki page on it.
 
For example…
 
Take a familiar case – the sine function. When you call sin() in a floating-point library (or the single-precision sinf(), where sin() is double-precision), or indeed on your calculator, you can more or less guarantee it’s doing something along these lines… first, range-reduction to map all argument values into the range 0 to 90 degrees (or -90 to 90), then, computing a 9th degree polynomial that looks like this:
 
c1*x + c3*x^3 + c5*x^5 + c7*x^7 + c9*x^9
 
The coefficients c1-c9 will have been specially chosen to ensure that the polynomial gives the best result over the given range – where ‘best’ means minimizing the worst-case error, either absolute or relative. Incidentally the reason for choosing degree 9 is that it’s accurate enough to give the nearest floating-point number to the true answer. Degree 7 isn’t enough, and degree 11 or more would be wasted effort in a single-precision answer. But the question is, where do you get the coefficients from? I’ve always had to look them up in a book. But where did the book get them from? And what would you do if the function you wanted wasn’t in the book? Or if you wanted a different degree of polynomial, perhaps for a coarser but faster approximation?
 
The answer seems to be the Remez Exchange Algorithm, till now I’ve seen as a bit of a black art, and I now realize why – when I tried to track down information that would allow me to implement it, everything seemed to be either much to advanced for me to understand, or too vague and simplistic to permit writing any code. Just about the only middle-ground resource I found was the wikipedia entry for a broader topic – approximation theory (http://en.wikipedia.org/wiki/Approximation_theorya>)
 
In the case of the above example, here’s what my code produced for the function sin(x*(pi/2)) over the interval -1.0 <= x <= +1.0 (corresponding to -90 to +90 degrees):
 
c1 = 1.5707963184477
c3 = -0.645963710599868
c5 = 0.0796896789479803
c7 = -0.0046737666126792
c9 = 0.000151485130863276
max relative error = 5.31399e-009
 
 
And for comparison, here are the values in the classic Cecil Hastings’ “Approximation for Digital Computers” (first printed over half a century ago!):
 
c1 = 1.57079631847
c3 = -0.64596371106
c5 = 0.07968967928
c7 = -0.00467376557
c9 = 0.00015148419
 
They agree pretty closely. I don’t yet know the main source of the differences; the wikipedia entry does point out that you need a much higher precision in computing the coefficients than the target precision, so it could just be that. (And now I think about it, I don’t actually know for sure that the figures in the book are closer to the correct answer than are mine.) However, it doesn’t really matter because the best way to judge the goodness of the results is to look at the error curve. The attached “my sin.jpg” shows the sin function in green, and the error curve – enormously magnified – in red. The fact that the tops and bottoms of the error curve all line up horizontally shows that this is the desired minimax polynomial, or near as dammit. Any other polynomial would have an error curve with a bigger peak somewhere.
 
Now compare that red error curve to the error curve shown in the Hastings book, which I scanned as “textbook sin.jpg”. Again, pretty close!
 
Incidentally, many of us faced with the problem of computing a function like sine would be tempted to use a Taylor series expansion about zero – I know I would have, before I knew about minimax polynomials – and this would be a disastrously bad thing to do. For comparison between minimax and Taylor, see the attached “taylor_sin.jpg”, where the yellow curve shows the relative error for the Taylor series expansion about 0, to the same number of terms (i.e. degree 9). You can see that towards the right-hand side of the interval, it zooms up to some massive value compared to the red minimax error curve. It’s actually 3 orders of magnitude bigger at its highest point. So, don’t ever use a Taylor series again.
 
What can we use Remez for? Well, among other things, we can overhaul and expand our spu and ppu maths libraries. It will help us find implementations for the functions we don’t have and for which we must currently fall back on slow library routines. I think it can also improve some of the implementations we do have, where we may have acquired code from somewhere without really understanding how it works or how it can be improved. But it can also be used when you want the fastest speed but you’re not so concerned about accuracy – say you wanted an exp() function that only computed up to degree 4 or something. Indeed, minimax polynomials can be combined with floating point abuse for doing range-reduction in certain functions (e.g. log2, log10, ln, exp, sqrt, 1/sqrt, pow) to get some really fast implementations which I definitely plan on using.
 
Mike
 
0
February 05, 2008

Morphing

Posted by Jonathan Garrett, Senior Engine Programmer
For Resistance 2 we added vertex-morphing support to allow more expressive facial animation. This was augmented with a system for compositing textures on the SPU to support features such as animator-driven dynamic-wrinkles.
0
January 30, 2008

Win32 Profile Tool

Posted by Rob Wyatt, Engine Programmer
I dug out the old win32 profiler code, it needs a little clean up but it’s all functional. Samples are 32 bytes each and each function call takes one sample, there is a buffer of samples in memory and this buffer is full it will stop recording. There is a dump function which will dump the samples to disk so they can be processed by an external tool. Its nowhere near as simple as you think to dump from the hook function when you detect the buffer being full, you cannot call any C function because there is no stack frame. Your options are to create a fake stack frame which you would only do if you detect you need to write the samples, then you could call into C. The other method is to issue the syscall instructions that correspond to createfile/writefile/closefile, this can be done straight from the asm and syscall doesn’t need a stack frame.
 
Also attached is the tool to decode the results, its fairly simple but it can sort functions by min time, max time, avg time, total time, number of timed called etc It can also display the entire call graph of the application. The tool can easily compute total time of a function (cycles from start to finish) and the local time of a function (time excluding all the called functions). It uses dbghelp to cross reference the function entry instruction pointer (the function being called) and the stack return address (the function who called you). There is a lot this profiler could do if some effort was put into it.
 
You can also use this tool to see if a function is being inlined, an inline function will never show up the profiler because its never called!! You can compare the number of calls in debug and release to see what percentage of the calls have been inlined.
 
0
January 30, 2008

Win32 NTFS Journal Dump

Posted by Rob Wyatt, Engine Programmer
Below is the code for a very simple tool that will dump the change journal, in its entirety, directly from the NTFS device driver. The journal normally holds a couple of weeks of disk history. The code below dumps everything and tries to decode the reason code indicating why the journal entry was created, it also displays when the change was made. The tool generates a lot of data but it is possible to pass filters to the driver.
 
Another thing to consider is it possible to mark kernel handles with user data and when those kernel handles are file handles that user data gets written to the journal with the journal events and that user data can be searched. This could be used to silently log modifications to certain files and its all done for you without any external data files and it cannot be bypassed.
 
The journal can be configured to perform full journaling across the file system including recording the actual changes made to a file, making it possible to roll back file changes without actually having a copy of the file and to undo any type of deleted file. This is not enabled by default as it makes things ridiculously slow and I don’t think there may any system tools to enable it the non-server versions of windows. It can be enabled directly at the driver level in any version of windows.
 
There is so much that can be done at this level to optimize disk access patterns that an entire book could be written on it. Search indexing and the now defunct vista file system work at this level.
 
In general be very careful with handles to partitions/drives as you can very easily mess them up. WriteFile() can be called on any kernel handle, most of the time it does nothing other than when the handle is a file in which case it writes data to the file. If the handle is a driver the data write entry point within the driver is called and for disks and partitions this will write sectors and by default it starts at sector 0. If you open device ‘hdd0’ (the physical boot drive) and write 512 bytes to it you’ll destroy the master boot record on the HDD. You have to be an administrator to do it but that’s the way we all run. You have been warned…
 
0
January 30, 2008

Ragdolls and IK

Posted by Eric Christensen, Principal Engine Programmer
The purpose of this presentation is to briefly outline the igPhysics system and to aid in understanding about Ragdoll and IK. As a primer, it describes some of the most common joint types used in the system as well as the basic fundamentals of how they work.
+1
January 30, 2008

More on SPU Shaders

Posted by Mike Acton, Engine Director
SCEA had a little get-together with a small group of SPU programmers recently. It was a chance for us to share the kinds of techniques we use and discuss our approaches with developers with similar levels of experience. It was definitely pretty cool. I presented our "SPU Shader" approach which we've posted about before. But you might find a few new details here. I also had a chance to incorporate some thoughts based on the comments I received from you guys - so, thanks for your thoughts - they're always appreciated!
0
January 23, 2008

Curiously Small Code

Posted by Mike Acton, Engine Director
Here’s an email thread Mike Day and I had on the tech (engine) mailing list a while back trying to find a really quick way to do a texture swizzle on the SPU. There’s a few techniques in there that might interest the SPU coders out there. It all started with this innocent little comment by Mike Day: “I tried swizzling my test texture and mipmaps and was surprised how little code was needed to do it”, and went sideways from there.
 
0
January 23, 2008

Introduction to B-Trees

Posted by Dave Schuyler, Multiplayer Principal Programmer
The b-tree data structure is container system somewhat like a binary tree (although the “b” stands for balanced and the tree is not binary). The API to a b-tree is usually similar to other key-to-data systems such as a hash table, dictionary, or associative array. B-trees are often used for file system indexing and database indexing. The performance gains for a b-tree rely on having a block based memory system (such as disk sectors or cache lines) wherein the cost to get another block far outweighs the cost of doing multiple tests on data in the currently accessible block.
0
January 23, 2008

Insomniac and SCEA Talk SPU Physics

Posted by Eric Christensen, Principal Programmer
Recently I had the pleasure of working closely with SCEA’s Erwin Coumans, author of the Bullet open-source physics engine. Our plan was to create a presentation for Sony’s SPU seminar in which we compared and contrasted the run-time pipelines of our respective physics systems, as well as the SPU programming techniques used for developing them. I’m delighted to say that the presentation was well received and showed the difference in our design processes as they corresponded to our specific needs. One being a system designed for a specific platform (PS3) and the other designed for multiple platforms. The entire experience was wonderful and I look forward to interacting with other developers on this level in the future. In our ongoing effort to share information with the community, Insomniac is happy to make this presentation public. Enjoy.
0
January 23, 2008

Resistance: Fall of Man Debriefing

Posted by Mike Acton, Engine Director
What's in this presentation?
Everything you wanted to know about Resistance: Fall of Man, but were afraid we wouldn't tell you.
 
Mark Lee (Master Engine Programmer) gave a talk to some Sony first-party studios last year on Resistance: Fall of Man. Beginning with how Insomniac made the leap into the PS3, Mark gave a very thorough overview on RFOM. Included are details on terminology, memory layout, shader trade-offs and a whole lot more.
 
These details are a little out-of-date (a whole lot has changed with the engine and tools since the release of RFOM!), but I hope it will still be interesting to our friends and peers that read this site.
0
November 20, 2007

Tier System

Giac Veltri (Tools Programmer), among other things, wrote the loading system for RCF. The loading system is where all the edge cases for the systems tend to collect - dealing with initialization, memory layouts, movie players, TRC issues, etc. - and it had to be running while the rest of the system was still active so we wouldn't have to stop the game to load a level. What Giac came up with in the end was a fairly simple solution that worked. Enjoy.
0
November 20, 2007

Texture Streaming

Posted by Mike Acton, Engine Director
One of the big changes we made in going from Resistance to Ratchet and Clank Future: Tools of Destruction was the adding our texture streaming system. Through a little Al Hastings's magic, we managed to increase our texture budgets significantly without introducing any noticable artifacts (You'd be really hard pressed to find any texture pop-in at all on RCF). Al's system is not just a good example of a simple system that works, it's a good example of a system that works precisely because it is simple.
0
November 20, 2007

Occlusion Systems

Posted by Mike Acton, Engine Director
In a Tech presentation, Al Hastings (CTO) shared the details of both the static and dynamic occlusion systems that we used for both Resistance: Fall of Man and Ratchet and Clank Future: Tools of Destruction.
 
Our static occlusion system is basically a straightforward grid-based PVS which has the advantage of being:
 
Very fast to query at run-time - almost free
Relatively low memory footprint
Doesn't impose any restrictions on level construction techniques
  
But this method also comes with some disadvantages:
 
Labor intensive and time intensive process
Hard to keep it up-to-date during production
Can't handle moving geometry, deforming geometry, destructible geometry
Suffers from some over-inclusion (large ties are the biggest problem, compression error is a secondary problem)
Monolithic database (hard to stream, etc) 
 
Our dynamic occlusion system is another straightforward system. We pre-render the bounding volumes on the GPU to determine visibility. The advantages:
 
Relatively lightweight system (rendering the bounding volumes usually takes a fraction of a millisecond)
Works with any level design
Typically it manages to cull that vast majority of objects that should be culled.
  
And the disadvantages:
 
Basing occlusion on bounding volume visibility works poorly for some objects (long, thin mobys in particular)
One frame lag issues
Visibility spikes can occur after camera cuts and mode transitions
  
Check out Al's slides to find out more about how they're used and implemented in practice.
0
October 28, 2007

Virtual Memory System on Nintendo Gamecube

Posted by Mike Acton, Engine Director
Here at Insomniac we’re dedicated to making the best games we can for Sony’s Playstation 3 – that’s all we do and that’s all we’re concentrating on. But at the same time, that’s not to say our guys haven’t worked with other platforms. Among our tech team we have experience on a very wide range of consoles. For example, Joe Valenzuela (Engine Programmer) has previous experience working on Nintendo’s Gamecube. The Gamecube is a PowerPC-based console, so shares a lot of common ground with the Cell-PPU in the Playstation 3. Joe shared with our programmers a summary of how one might implement a virtual memory system based on his experience on the Gamecube. For some of our programmers it was a fun trip into the details of a machine they hadn’t had a chance to play with.
0
October 28, 2007

Networks and Networking

Eric Ellis, our Multiplayer Lead Programmer, recently did an internal presentation on the basics of networking. Here Eric overviewed fundamental network concepts and summarized the techniques used in Insomniac’s previous networked games. While a lot of the detail that Eric went over isn’t in these slides, they’re a good introduction and a place to start for anyone interested in understanding the basic issues and figuring out what questions to ask in the first place.
0
October 28, 2007

Sound Occlusion and Diffraction

Posted by Mike Acton, Engine Director
Providing dynamic sound occlusion and diffraction effects enhances listener immersion in 3D space. In his presentation, David Thall (Sound Programmer) introduces a system for calculating and processing these effects in our game engine. Topics such as input data reliability, calculation time, audio DSP filter coefficient selection, and overall performance are covered.
0
October 28, 2007

Custom Shaders and Effects

Posted by Mike Acton, Engine Director
From another weekly presentation – Rob Wyatt reviewed with the programmers some of the different options for applying custom graphics shaders within the engine. Rob gave some really detailed implementation examples for anisotropic, cloaking and fur that were added during the development of Ratchet and Clank Future: Tools of Destruction (AKA: RCF).
 
 
 
0
October 28, 2007

Shadows in Ratchet & Clank Future: Tools of Destruction

Posted by Mike Acton, Engine Director
A couple of months ago Mark Lee presented a good breakdown of our dynamic shadow system for one of our internal Tech Presentations. Shadows are sometimes frustrating, often a pain, and always have more edge-cases than you anticipate. Mark spoke in detail about the tradeoffs we made and why those were best for our situation, and what kinds of improvements we could make. Don’t expect the attached slides to give all the implementation details, but they should give you a nice, if not rough, idea of what we’re doing and why.
 
 
 
 
0
September 12, 2007

SPU Shaders Introduction

Posted by Mike Acton, Engine Director
In a previous post (Dynamic SPU Code), I hinted at something we’re calling “SPU Shaders”. Now that we’re fairly comfortable, as an engine team, with building systems for the SPUs we’ve been working on how to step things up for the next set of games. The key issues we’ve struggled with are:
 
Reducing the synchronization points between the PPU and the SPUs
Encouraging use of the SPUs by more systems
Keeping the SPU code and data design straightforward and fast
Moving toward our goal of getting all code running as asynchronously as possible. 
 
Not only do I think SPU Shaders are going to help us accomplish these goals, but they will also make our systems more flexible. Out Gameplay programmers here are all about customizability and we definitely want to make sure that they have good solutions for adding those completely case-by-case touches that make Insomniac games what they are.
 
But still, it’s a very simple idea. Which is really the whole point.
 
Our first major system to make the transition to this new design style was igPhysics. In this introduction I’ll cover the basics of SPU Shaders and a little about how igPhysics was redesigned.
0
September 12, 2007

SPU Instruction Cheat Sheet

Posted by Mike Acton, Engine Director
Here’s a handy three page cheat sheet some of our guys use for quick reference when writing SPU code.
 
Text version: spu_ops.txt
0
September 12, 2007

Supporting Cylinder Collision

Posted by Mike Acton, Engine Director
Reddy Sambavaram did some work recently to add cylinder support to our collision system. He had to work out a couple of good, fast methods for testing cylinders versus every other type of collision primitive we already have. He shared his experience and methods with our programmers in a recent Tech Team Presentation. Attached are his slides which includes details on the actual collision algorithms, how they work and why they’re unique. Many of them are hand drawn – We’ve been pretty busy recently with getting Ratchet and Clank: Tools of Destruction completed – but there’s certainly enough detail in there for you programmers out there to make this work.
0
September 04, 2007

Autodesk Mental Ray

Posted by Mike Acton, Engine Director
As part of our static lighting pipeline here at Insomniac, we use Autodesk's mental ray. We often get questions about how we process this data and how we encorporate it into our tools.
 
Here's a quick summary for those of you who are interested from Dustin Reagan (Tools Programmer) who's done much of the work with our mental ray pipeline.
 
1. Wrote our own Content (insomniac tool format) to Mental Ray scene (.mi file) exporter.
This is not a full-blown Mental Ray scene exporter (doesn't support all or even most .mi objects)
It’s many times (at least an order of magnitude) faster than exporting the same scene from Maya.
 
2. We opportunistically create an .mi file for each art asset that is expected to go through the lighting pipeline.
Uses our dependency system to only generate this data when necessary.
 
3. Users create persistent "lighting jobs" in our level editor. Lighting jobs consist of:
Instances to create lighting data for.
Instances to cast shadows.
Lights to use.
Global Mental Ray render options.
 
4. Rendering a Lighting job causes a lightweight Content Scene --> Mental Ray scene export to occur.
All the heavy-weight Mental Ray data is referenced into this scene file.
All referenced files are done so with network paths to the users machine (for remote rendering, see below)
Only per-instance data needs to be exported ( transforms & per-instance lighting parameters ).
 
5. Remote rendering is accomplished via irush ( a renderfarm job scheduler ) on our farm of blade machines.
Once a render job is picked up by a blade machine the following occurs:
If necessary, the user’s tools release is copied onto the blade machine.
The exported scene (.mi file ) is copied up to the blade machine and rendered with Mental Ray.
All referenced files (.mi object files, textures, etc ) are referenced with network paths to the user’s machine.
This allows us to avoid sync issues, or copying large amounts of data between the user’s machine and the render machine.
The output lighting data is condensed into a single file that is unique among Lighting jobs and copied to a shared network location.
 
6. The Lighting job render system is set up to allow rendering to occur outside of our level editor. This will allow stuff like:
An automated render process that is run over every production level, loading up all Lighting jobs, determining "Out-of-date" jobs, and re-rendering these jobs or notifying the lighters about out-of-date jobs.
0
August 15, 2007

SPU Ninja Homework #3

Posted by Mike Acton, Engine Director
 
Here at Insomniac we’ve started an SPU training class which we call SPU Ninja Training. It’s an optional class open to our programmers. The main idea is that we introduce some problem to solve on the SPU and programmers will spend the next week or two working out a solution in their spare time. Each class we try to cover the details of different solutions and offer suggestions for better implementations.
 
For homework #3, the problem was to take this production code, originally written by Mike Day (engine programmer) about 4 years ago and optimize it for the SPU. The source: distance_la_original.c
 
Why don’t you try out the problem yourself?
 
Mark Lee, one of our engine programmers who has specialized in graphics work, turned in this result: distance_la_mlee.c. The difference between the original and Mark’s version was pretty impressive:
 
(Based on running 10000 iterations using identical randomly generated for each case):
Original: total time = 21464464, average per call = 2146 cycles
New:      total time =  3884055, average per call =  388 cycles
 
You might also want to check out Reddy Sambavaram’s version: distance_la_rsambavaram.cpp
 
+2
August 02, 2007

Configuring VNC on PS3

Posted by Mike Acton, Engine Director
André Leiradella, who has started doing some consulting work for us from Brazil has written a document describing how he set up VNC with his PS3/Linux box. We've been using PS3/Linux to do development here and there when we're working on something that can be built and tested in isolation from the rest of the system. We've found that to a good, fast way to iterate on some of our code.
0
August 02, 2007

Why rotqmbybi is Broken

Posted by Mike Day, Senior Engine Programmer
I feel that the operation carried out by the instruction rotqmbybi is broken. The instruction executes as documented, but I don’t believe it achieves what its designers originally planned. (Or, if it was designed that way, I contend that it was a bad design.)
 
In order to explain it’s necessary to first give a bit of background information. There are left shifts (and rotates), and there are right shifts (and rotates). On the spu, a right shift is treated as a left shift by a negative shift amount. This is an unfortunate convention for various reasons. It replaces mnemonics familiar to most assembly programmers, such as variants on ‘shr’ for shift-right, with unwieldy and not especially memorable abbreviations based on the ambiguous-sounding phrase ‘rotate and mask’. With it implied that a rotate by a positive amount is a rotate-left, then a rotate-and-mask by a negative count describes a right-shift, filling with zeros at the left (the filling with zeros corresponding to the ‘and-mask’ part of the name).
 
A more serious problem comes in the form of increased susceptibility to bugs. For example, to shift each word of a quadword by one bit to the right, one would use the immediate instruction rotmi (for rotate-and-mask-word-immediate), supplying an immediate shift amount of –1. The likelihood of forgetting the minus sign is very high, since programmers tend to think of the operation as a right-shift by +1, and this leads to many bugs. (At this point it’s also worth mentioning that the instruction rotma, which propogates the sign bit, is described in the manual as an ‘algebraic’ rotate-and-mask. I think this is inconsistent with the nomenclature used for the majority of architectures, where the letter ‘a’ in the name usually refers to an ‘arithmetic’ shift, contrasting it with the default kind, a ‘logical’ shift.)
 
It is also unfortunate that when using the variable-shift form of right-shifts, an extra instruction must be inserted just to negate the shift amount held in the register. While multi-instruction operations are often a sacrifice one must make when trading off against other properties, such as orthogonality of instruction set, or quantities like instruction word size, the point of contention here is inconsistency: an immediate shift-left in one instruction, a variable shift-left in one instruction, an immediate shift-right in one instruction, yet a variable shift-right in two instructions.
 
Shifts and rotates by any amount are possible within 32-bit words (or 16-bit halfwords) inside a quadword. When it comes to shifts & rotates of an entire quadword by an arbitrary amount, no single instruction can do it – instead these must be broken down into 2 operations: one which moves whole bytes around, and one which moves by a sub-byte number of bits to complete the shift. To illustrate, consider using si_ intrisics to left-shift a quadword q by an amount stored in s:
 
q = si_shlqbybi(q, s);
q = si_shlqbi(q, s);
 
If in a particular case s has the value 43, the first instruction corresponds to shifting left by 43 div 8 = 5 full bytes, and the second by 43 rem 8 = 3 extra bits. When these two instructions are combined, any left-shift amount is possible on the full 128-bit register, and this seems like a reasonable compromise. (The two instructions can be performed in either order; the net effect is the same.)
 
Corresponding to this pair are the instructions rotqmbybi / rotqmbi. Together these perform a right-shift of a quadword by any amount, and they maintain the convention of regarding the operation as a left-shift by a negative count. So, we might expect that if s held the value –43, the sequence
 
q = si_rotqmbybi(q, s);
q = si_rotqmbi(q, s);
 
ought to shift q to the right by 43 bits. In fact it will shift q to the right by 51 bits. Because of the way the rotqmbybi determines its byte-shift count, the first instruction here shifts not by 5 bytes but by 6. This is the respect in which I believe rotqmbybi to be ‘broken’ as alluded to at the top.
 
The specs for rotqmbybi show that the byte count is determined as (0 minus bits 24 to 28 of RB) modulo 32. It seems to me that the instruction would have worked more sensibly had the expression been (bits 24 to 28 of (0 minus RB)) modulo 32.
 
It is possible to patch up the value in s to account for this; however we must now supply two different values of s to the two component instructions. The net result is that where we would usually add in the extra instruction needed to negate the shift amount in preparation for a rotate-and-mask, we must now supply TWO extra instructions – the usual negation of the shift-amount for rotqmbi, but now also SEVEN minus the shift amount for rotqmbybi. Where variable rotate-and-mask operations are at best unwieldy and counterintuitive, in this case specific knowledge of this non-standard behaviour is also required.
 
0
August 02, 2007

A Hair-Tearing Out Bug

Posted by Mike Acton, Engine Director
A couple of weeks ago Mike Day (one of our engine programmers) sent this email out to the group. He collected up a particularly painful moment and turned it into a little quiz that everyone could enjoy. Check out the question and the answer. (Take the test yourself. Don’t cheat - It’s a good one!)
 
 
0
August 02, 2007

Beware of Statics

Posted by Carl Glave, Engine Programmer
While working on an optimization for the FXVis system, I managed to make some code slower just by rearranging some statics it was using. Not one to take something like this lying down, I looked into it, and learned something that seems worth sharing.
  
The whole issue stemmed around the use of some static values that existed at the top of the function. Here’s the original function. Note that it has static in various vector formats at the beginning.
 
qword SimdFloatToHalf2(vf32 a_, vf32 b_)
{
static vu8 shuf_mid16 =
{ 0x01, 0x02, 0x05, 0x06, 0x09, 0x0a, 0x0d, 0x0e, 0x11, 0x12, 0x15, 0x16, 0x19, 0x1a, 0x1d, 0x1e };
static vu8 shuf_hi16 =
{ 0x00, 0x01, 0x04, 0x05, 0x08, 0x09, 0x0c, 0x0d, 0x10, 0x11, 0x14, 0x15, 0x18, 0x19, 0x1c, 0x1d };
 
static vu32 m_mask = (vu32)si_ilh(0x03ff); //{e4}
static vu32 s_mask = { 0x80008000, 0x80008000, 0x80008000, 0x80008000 };
static vu32 e_mask = { 0x7c007c00, 0x7c007c00, 0x7c007c00, 0x7c007c00 };static vu32 exp_off = { 0x38003800, 0x38003800, 0x38003800, 0x38003800 };
 
// 0 1 2 3
// seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
 
qword m = si_shufb((qword)a_, (qword)b_, (qword)shuf_mid16); //{o4} m in hword | emmmmmmm mmmmmmmm |
m = si_rothmi(m, -5); //{e4} | 00000emm mmmmmmmm |
m = si_and(m, (qword)m_mask); //{e2} | 000000mm mmmmmmmm |
 
qword se = si_shufb((qword)a_, (qword)b_, (qword)shuf_hi16); //{o4} se in hword | seeeeeee emmmmmmm |
 
qword s = si_and(se, (qword)s_mask); //{e2} mask | s0000000 00000000 |
 
qword e = si_andc(se, (qword)s_mask); //{e2} mask | 0eeeeeee emmmmmmm |
 
qword d_mask = si_cgth((qword)exp_off, e); //{e2} (e <= 112 ?) (if e <= 0x3800, then exp_off would make <= 0)
 
e = si_sfh((qword)exp_off, e); //{e2} correct exponent (e = ((e - 127) + 15))
e = si_shlhi(e, 3); //{e4} e <<= 3 | eeeeeemm mmmmm000 |
e = si_and(e, (qword)e_mask); //{e4} mask | 0eeeee00 00000000 |
 
qword h = si_or(s, m); //{e2} s | m
h = si_or(h, e); //{e2} s | e | m
h = si_andc(h, d_mask); //{e2} set to 0 if underflow
 
// e11, o2
// load/store odd
// shuffle odd
 
return h;
}
 
 
Here’s how this function disassembled:
 
00000000 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_>:
0: 30 80 00 02 lqa $2,0
4: 30 80 00 05 lqa $5,0
8: 30 80 00 06 lqa $6,0
c: 30 80 00 07 lqa $7,0
10: 3e 80 00 89 cbd $9,0($1)
14: b0 a1 01 85 shufb $5,$3,$4,$5
18: b0 61 01 82 shufb $3,$3,$4,$2
1c: 30 80 00 04 lqa $4,0
20: 58 21 82 82 andc $2,$5,$6
24: 0f be c1 88 rothmi $8,$3,-5
28: 09 00 83 83 sfh $3,$7,$2
2c: 00 20 00 00 lnop
30: 49 00 83 87 cgth $7,$7,$2
34: 30 80 00 02 lqa $2,0
38: 0f e0 c1 83 shlhi $3,$3,3
3c: 3f 83 42 04 rotqbyi $4,$4,13
40: 18 21 82 86 and $6,$5,$6
44: 7e 00 02 04 ceqbi $4,$4,0
48: 18 20 c1 05 and $5,$2,$3
4c: 56 c0 02 04 xsbh $4,$4
50: 40 20 00 7f nop $127
54: 22 00 03 84 brhz $4,70 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x70> # 70
58: 41 81 ff 82 ilh $2,1023 # 3ff
5c: 30 80 00 03 lqa $3,0
60: 20 80 00 02 stqa $2,0
64: 40 80 00 82 il $2,1
68: b0 60 c1 09 shufb $3,$2,$3,$9
6c: 20 80 00 03 stqa $3,0
70: 30 80 00 03 lqa $3,0
74: 18 22 01 83 and $3,$3,$8
78: 08 20 c3 03 or $3,$6,$3
7c: 08 21 41 83 or $3,$3,$5
80: 58 21 c1 83 andc $3,$3,$7
84: 35 00 00 00 bi $0
  
In trying to get this code cleaner and more consistent, I switched the statics to qwords (to match the rest of the code in the function) like so:
 
static qword shuf_mid16 = (qword)(vu8){ 0x01, 0x02, 0x05, 0x06, 0x09, 0x0a, 0x0d, 0x0e, 0x11, 0x12, 0x15, 0x16, 0x19, 0x1a, 0x1d, 0x1e };
static qword shuf_hi16 = (qword)(vu8){ 0x00, 0x01, 0x04, 0x05, 0x08, 0x09, 0x0c, 0x0d, 0x10, 0x11, 0x14, 0x15, 0x18, 0x19, 0x1c, 0x1d };
static qword s_mask = (qword)(vu32){ 0x80008000, 0x80008000, 0x80008000, 0x80008000 };
static qword e_mask = (qword)(vu32){ 0x7c007c00, 0x7c007c00, 0x7c007c00, 0x7c007c00 }; //@@@ calc this in code!
static qword exp_off = (qword)(vu32){ 0x38003800, 0x38003800, 0x38003800, 0x38003800 }; //@@@ calc this in code!
static qword m_mask = si_ilh(0x03ff); //{e4}
 
This had… undesirable effects on the code generated. To show just how much the compiler dislikes this syntax, here’s the disassemble from the same function with only the above changes made to it (all lines of actual code were untouched, only these statics declarations were reformatted):
 
00000000 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_>:
0: 04 00 01 87 ori $7,$3,0
4: 30 80 00 02 lqa $2,0
8: 04 00 02 05 ori $5,$4,0
c: 3f 83 41 02 rotqbyi $2,$2,13
10: 7e 00 01 02 ceqbi $2,$2,0
14: 56 c0 01 02 xsbh $2,$2
18: 22 00 04 02 brhz $2,38 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x38> # 38
1c: 30 80 00 02 lqa $2,0
20: 30 80 00 03 lqa $3,0
24: 3e 80 00 84 cbd $4,0($1)
28: 20 80 00 02 stqa $2,0
2c: 40 80 00 82 il $2,1
30: b0 60 c1 04 shufb $3,$2,$3,$4
34: 20 80 00 03 stqa $3,0
38: 30 80 00 02 lqa $2,0
3c: 3f 83 41 02 rotqbyi $2,$2,13
40: 7e 00 01 02 ceqbi $2,$2,0
44: 56 c0 01 02 xsbh $2,$2
48: 22 00 04 02 brhz $2,68 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x68> # 68
4c: 30 80 00 02 lqa $2,0
50: 30 80 00 03 lqa $3,0
54: 3e 80 00 84 cbd $4,0($1)
58: 20 80 00 02 stqa $2,0
5c: 40 80 00 82 il $2,1
60: b0 60 c1 04 shufb $3,$2,$3,$4
64: 20 80 00 03 stqa $3,0
68: 30 80 00 02 lqa $2,0
6c: 3f 83 41 02 rotqbyi $2,$2,13
70: 7e 00 01 02 ceqbi $2,$2,0
74: 56 c0 01 02 xsbh $2,$2
78: 23 00 1e 02 brhnz $2,168 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x168> # 168
7c: 30 80 00 02 lqa $2,0
80: 3f 83 41 02 rotqbyi $2,$2,13
84: 7e 00 01 02 ceqbi $2,$2,0
88: 56 c0 01 02 xsbh $2,$2
8c: 22 00 04 02 brhz $2,ac <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0xac> # ac
90: 41 c0 00 02 ilh $2,32768 # 8000
94: 30 80 00 03 lqa $3,0
98: 3e 80 00 84 cbd $4,0($1)
9c: 20 80 00 02 stqa $2,0
a0: 40 80 00 82 il $2,1
a4: b0 60 c1 04 shufb $3,$2,$3,$4
a8: 20 80 00 03 stqa $3,0
ac: 30 80 00 02 lqa $2,0
b0: 3f 83 41 02 rotqbyi $2,$2,13
b4: 7e 00 01 02 ceqbi $2,$2,0
b8: 56 c0 01 02 xsbh $2,$2
bc: 22 00 04 02 brhz $2,dc <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0xdc> # dc
c0: 41 be 00 02 ilh $2,31744 # 7c00
c4: 30 80 00 03 lqa $3,0
c8: 3e 80 00 84 cbd $4,0($1)
cc: 20 80 00 02 stqa $2,0
d0: 40 80 00 82 il $2,1
d4: b0 60 c1 04 shufb $3,$2,$3,$4
d8: 20 80 00 03 stqa $3,0
dc: 30 80 00 02 lqa $2,0
e0: 3f 83 41 02 rotqbyi $2,$2,13
e4: 7e 00 01 02 ceqbi $2,$2,0
e8: 56 c0 01 02 xsbh $2,$2
ec: 22 00 04 02 brhz $2,10c <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x10c> # 10c
f0: 41 9c 00 02 ilh $2,14336 # 3800
f4: 30 80 00 03 lqa $3,0
f8: 3e 80 00 84 cbd $4,0($1)
fc: 20 80 00 02 stqa $2,0
100: 40 80 00 82 il $2,1
104: b0 60 c1 04 shufb $3,$2,$3,$4
108: 20 80 00 03 stqa $3,0
10c: 30 80 00 03 lqa $3,0
110: 30 80 00 02 lqa $2,0
114: 30 80 00 04 lqa $4,0
118: 30 80 00 06 lqa $6,0
11c: 35 80 00 12 hbr 164 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x164>,$0
120: b0 61 43 83 shufb $3,$7,$5,$3
124: b0 41 43 82 shufb $2,$7,$5,$2
128: 58 21 01 87 andc $7,$3,$4
12c: 00 20 00 00 lnop
130: 18 21 01 83 and $3,$3,$4
134: 30 80 00 04 lqa $4,0
138: 0f be c1 02 rothmi $2,$2,-5
13c: 09 01 c3 05 sfh $5,$6,$7
140: 49 01 c3 06 cgth $6,$6,$7
144: 0f e0 c2 85 shlhi $5,$5,3
148: 18 20 82 04 and $4,$4,$2
14c: 30 80 00 02 lqa $2,0
150: 08 21 01 83 or $3,$3,$4
154: 18 21 41 02 and $2,$2,$5
158: 08 20 81 83 or $3,$3,$2
15c: 58 21 81 83 andc $3,$3,$6
160: 40 20 00 7f nop $127
164: 35 00 00 00 bi $0
168: 41 81 ff 82 ilh $2,1023 # 3ff
16c: 10 00 00 09 hbra 190 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x190>,0
170: 30 80 00 03 lqa $3,0
174: 3e 80 00 84 cbd $4,0($1)
178: 20 80 00 02 stqa $2,0
17c: 40 80 00 82 il $2,1
180: b0 60 c1 04 shufb $3,$2,$3,$4
184: 40 20 00 7f nop $127
188: 40 20 00 7f nop $127
18c: 20 80 00 03 stqa $3,0
190: 32 7f dd 80 br 7c <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x7c> # 7c
194: 00 20 00 00 lnop
 
 
Yup, lots of lovely badness, including a handful of lovely branches in our branchless code.
 
My next step was to simply move the statics out of the function. That’s the only change made to get this disassembly:
 
00000000 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_>:
0: 04 00 01 85 ori $5,$3,0
4: 30 80 00 03 lqa $3,0
8: 30 80 00 02 lqa $2,0
c: 30 80 00 09 lqa $9,0
10: 30 80 00 07 lqa $7,0
14: 35 80 00 10 hbr 54 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x54>,$0
18: b0 61 02 83 shufb $3,$5,$4,$3
1c: b0 a1 02 82 shufb $5,$5,$4,$2
20: 41 81 ff 84 ilh $4,1023 # 3ff
24: 30 80 00 02 lqa $2,0
28: 58 22 41 88 andc $8,$3,$9
2c: 0f be c2 85 rothmi $5,$5,-5
30: 09 02 03 86 sfh $6,$7,$8
34: 18 22 41 83 and $3,$3,$9
38: 0f e0 c3 06 shlhi $6,$6,3
3c: 18 21 42 04 and $4,$4,$5
40: 49 02 03 87 cgth $7,$7,$8
44: 08 21 01 83 or $3,$3,$4
48: 18 21 81 02 and $2,$2,$6
4c: 08 20 81 83 or $3,$3,$2
50: 58 21 c1 83 andc $3,$3,$7
54: 35 00 00 00 bi $0
 
 
Interesting, not only did all the badness go away, but the function is also now almost half the size. (including a branch that was originally hanging out in the middle of the function)
 
The last change I tried before calling it a day on this was to put the value declarations back into the function, but as const, rather than static, like so:
 
const qword shuf_mid16 = (qword)(vu8){ 0x01, 0x02, 0x05, 0x06, 0x09, 0x0a, 0x0d, 0x0e, 0x11, 0x12, 0x15, 0x16, 0x19, 0x1a, 0x1d, 0x1e };
const qword shuf_hi16 = (qword)(vu8){ 0x00, 0x01, 0x04, 0x05, 0x08, 0x09, 0x0c, 0x0d, 0x10, 0x11, 0x14, 0x15, 0x18, 0x19, 0x1c, 0x1d };
const qword s_mask = (qword)(vu32){ 0x80008000, 0x80008000, 0x80008000, 0x80008000 };
const qword e_mask = (qword)(vu32){ 0x7c007c00, 0x7c007c00, 0x7c007c00, 0x7c007c00 }; //@@@ calc this in code!
const qword exp_off = (qword)(vu32){ 0x38003800, 0x38003800, 0x38003800, 0x38003800 }; //@@@ calc this in code!
const qword m_mask = si_ilh(0x03ff); //{e4}
 
This change ended up yielding the best results of all the setups:
 
00000000 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_>:
0: 04 00 01 85 ori $5,$3,0
4: 30 80 00 03 lqa $3,0
8: 41 c0 00 09 ilh $9,32768 # 8000
c: 30 80 00 02 lqa $2,0
10: 41 9c 00 07 ilh $7,14336 # 3800
14: 35 80 00 10 hbr 54 <_ZN3FXV16SimdFloatToHalf2EU8__vectorfS0_+0x54>,$0
18: b0 61 02 83 shufb $3,$5,$4,$3
1c: b0 a1 02 82 shufb $5,$5,$4,$2
20: 41 81 ff 84 ilh $4,1023 # 3ff
24: 41 be 00 02 ilh $2,31744 # 7c00
28: 58 22 41 88 andc $8,$3,$9
2c: 0f be c2 85 rothmi $5,$5,-5
30: 09 02 03 86 sfh $6,$7,$8
34: 18 22 41 83 and $3,$3,$9
38: 0f e0 c3 06 shlhi $6,$6,3
3c: 18 21 42 04 and $4,$4,$5
40: 49 02 03 87 cgth $7,$7,$8
44: 08 21 01 83 or $3,$3,$4
48: 18 21 81 02 and $2,$2,$6
4c: 08 20 81 83 or $3,$3,$2
50: 58 21 c1 83 andc $3,$3,$7
54: 35 00 00 00 bi $0
 
Notice that it’s the same number of instructions as when the static were moved out of the function, but a few of the lqa instructions were automatically changed to ilh instructions to better balance between even/odd.
 
So, the moral of the story is, be weary of using statics within a function. Use const instead. Also, for values that you know can be easily constructed by a single instructions such as ilh, just leave the value in there so the compiler can load it or construct it as needed to better balance your function. 
 
As a small addendum to the problem of this function, Mike Day optimized that function over the last weekend as just something to do. He managed to make it four instructions shorter (six if it’s inlined, as there’s a nop and an lnop).
 
Here’s Mike’s version of the function:
 
qword SimdFloatToHalf( vf32 a_, vf32 b_ )
{
const qword shuf_hi16 = (qword)(vu16){ 0x0001, 0x0405, 0x0809, 0x0c0d, 0x1011, 0x1415, 0x1819, 0x1c1d };
 
const qword s_mask = (qword)(vu16){ 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000 };
const qword blah = (qword)(vu16){ 0x4000, 0x4000, 0x4000, 0x4000, 0x4000, 0x4000, 0x4000, 0x4000 };
const qword exp_off = (qword)(vu16){ 0x3800, 0x3800, 0x3800, 0x3800, 0x3800, 0x3800, 0x3800, 0x3800 };
 
qword a_rot5 = si_shlqbii((qword)a_, 3);
qword b_rot5 = si_shlqbii((qword)b_, 3);
 
qword mid = si_shufb(a_rot5, b_rot5, shuf_hi16); // mid = | e5 e4 e3 e2 e1 e0 m23 m22 m21 m20 m19 m18 m17 m16 m15 m14 |
qword hi = si_shufb((qword)a_, (qword)b_, shuf_hi16); // hi = | s e7 e6 e5 e4 e3 e2 e1 e0 m23 m22 m21 m20 m19 m18 m17 |
 
qword h = si_xor(mid, blah); // h = | e5 e4' e3 e2 e1 e0 m23 m22 m21 m20 m19 m18 m17 m16 m15 m14 |
qword e = si_andc(hi, s_mask); // e = | 0 e7 e6 e5 e4 e3 e2 e1 e0 m23 m22 m21 m20 m19 m18 m17 |
h = si_selb(h, hi, s_mask); // h = | s e4' e3 e2 e1 e0 m23 m22 m21 m20 m19 m18 m17 m16 m15 m14 |
qword d_mask = si_cgth(exp_off, e);
h = si_andc(h, d_mask); // set to 0 if underflow
 
return h;
}
 
 
And here’s what it compiles into:
00000000 <_ZN3FXV15SimdFloatToHalfEU8__vectorfS0_>:
0: 04 00 01 86 ori $6,$3,0
4: 35 80 00 0f hbr 40 <_ZN3FXV15SimdFloatToHalfEU8__vectorfS0_+0x40>,$0
8: 41 c0 00 08 ilh $8,32768 # 8000
c: 30 80 00 07 lqa $7,0
10: 3f 60 c2 02 shlqbii $2,$4,3
14: 3f 60 c1 83 shlqbii $3,$3,3
18: 40 20 00 7f nop $127
1c: b0 c1 03 07 shufb $6,$6,$4,$7
20: 41 9c 00 04 ilh $4,14336 # 3800
24: b0 60 81 87 shufb $3,$3,$2,$7
28: 41 a0 00 02 ilh $2,16384 # 4000
2c: 58 22 03 05 andc $5,$6,$8
30: 48 20 81 83 xor $3,$3,$2
34: 49 01 42 04 cgth $4,$4,$5
38: 80 61 81 88 selb $3,$3,$6,$8
3c: 58 21 01 83 andc $3,$3,$4
40: 35 00 00 00 bi $0
44: 00 20 00 00 lnop
 
 
0
August 02, 2007

Dynamic SPU Code

Posted by Mike Acton, Engine Director
When Eric Christensen and I were speaking at Georgia Tech, one of the big points we'd made was that on the SPUs, "Code is just data." There's nothing special about it and really no reason at all why you shouldn't just be dynamically uploading the code you need to transform the data right along with the data. Using the OS provided system for launching SPUs and SPU code may carry too much overhead though. And sometimes dynamic code loading libraries are just over-complicated. So we've written up a little document to give those of you that want to dynamically load SPU fragments a really straightforward and simple way to do just that.
 
0
August 02, 2007

Process Memory Utility

Posted by Mike Acton, Engine Director
The amount of content that needs to be managed, converted and otherwise manipulated by our PC tools never stops growing. Games are getting bigger all the time. The struggle with hard drive space, memory and network speed is part of a constant battle to keep production moving. Rob Wyatt wrote this handy little utility to help track memory usage in different processes running under some of our Windows boxes.
 
From Rob:
“Attached is a very crude app that will dump the memory map of a running Win32 process, in addition to the memory map it tells you the type of pages in each allocation and tries to map each allocation to a loaded module, if the allocation is assigned to a loaded module the name of the EXE/DLL is printed along side.
 
After the memory map it will display the amount free address space, the largest free block, the amount of address space reserved and the amount committed.
 
If you run the tool without any options it will dump all the processed and their PIDs, to look at a specific process pass the PID on the command line, the process has to be running. If you want to put this code into an application so it can dump its own memory map while it is running you can just cut the guts out (basically you need to use VirtualQuery instead of VirtualQueryEx).
 
You could extend this to do much more, you could find the data and BSS sections of DLLs, memory that is mapped to a named section could display info about the section. You could find heaps within the process, you could locate the stack within the process etc etc.”
 
 
To build (with gcc):
$ g++ –O3 process_mem.cpp –o process_mem –lpsapi
 
0
August 02, 2007

Curved Surfaces on the RSX

Posted by Mike Acton, Engine Director
While the Cell processor is certainly the heart of the Playstation 3, understanding its graphics processor (the RSX) is also an important part of the engine team's work. The game we're currently working on, Ratchet and Clank Future: Tools of Destruction (check out the RCF trailer!), involves a lot of effects and other graphics processing. We're always experimenting with alternative ways to take advantage of the RSX. In another one of the Tech Team's weekly presentations, Rob Wyatt talked about the basics of drawing curves and curved surfaces completely on the RSX. Some content has been removed from this presentation because not all of the details are public, but the basics are definitely there!
 
0
August 02, 2007

Gameplay Architecture for Performance

Posted by Terrance Cohen, Engine Programmer
Here is a presentation given to the programmers at Insomniac Games on July 6, 2007. It covers two related topics. First we consider a case study of an optimization to NPC AI used on Resistance: Fall of Man. Then we discuss a broader proposal for a game-play architecture that supports and encourages higher performance code compared to the common game-object driven main loop.
0
August 02, 2007

Effects Conduit

Posted by Terrance Cohen, Engine Programmer
Here is a presentation given to the programmers at Insomniac Games on Febrary 28, 2007, about our Effects Conduit. The Conduit is a system for coordinating various types of effects (e.g. visual, sound, and dialogue). It allows programmers to focus on game-play logic while designers focus on effects. And it provides a convenient mechanism for quickly introducing generic effects early in a project and then specializing them as more assets become available.
0
August 02, 2007

Multithreading Optimization Basics

Posted by Mike Acton, Engine Director
We're now knee deep in the world of multi-core processing. Virtually everything we do must consider synchronization and parallelism as part of the fundamental design - it's not ported to multi-core, it's designed multi-core. However, there's still legacy code as would be expected. Part of the plan to make sure that we continue forward with designing everything to be more and more parallel is making sure that everyone understands the basics. This was from one of our internal Tech Weekly Presentations I did on the basics of multithreading and one specific simple lock-free optimization. The key to good multi-processing, or multi-threading, is to not be in a position where you need to "optimize out" synchronization because you will always end up with implicit dependencies that are tough to get rid of. What you want to do is start from the position of synchronizing as little as possible and adding sync points in only as necessary, and as few as possible.
0
August 02, 2007

igCollision

Posted by Mike Acton, Engine Director
Jonathan Garrett gave an internal talk about igCollision a while back. This was one of our weekly presentations where we try to share something we've done, something we're working on, or something we just think would be generally helpful to other programmers. He gave a broad overview of our low-level SPU collision system - touching on the core components, overall asynconous processing and balancing a system across multiple SPUs and the PPU.
0
LATEST ARTICLES