Dynamic Code Uploading on the SPU --------------------------------- Version: 30 May 2007, Insomniac Games Eric Christensen Mike Acton Introduction ------------ A very common issue when developing for the SPU is having a data transform (function) that is too large to fit, along with the necessary data, in the SPU local store. The solution to this problem is quite simple once you realize two important facts: (1) Code is just data. (2) There is no special section in SPU local store for code, nor special executable flags. In other words, decompose code exactly the same way you would decompose data and upload code exactly the same way you would upload data. Conceptually, SPU code fragments are exactly the same as shader programs on the GPU. Create code that is exactly (and nothing more than) what is needed to transform some specific data, patch it, if necessary, then upload it along with the data. Background ---------- (1) When is it appropriate to dynamically upload code fragments? (a) When the SPU code size plus the data that you need to operate on at one time is greater than 256K. (b) When you are sharing the SPU with another process and the necessary code is determined by the data that is being loaded to local store. (2) When is it not appropriate to dynamically upload code fragments? (a) When you're doing something small for a short amount of time. (b) When the data can be organized such that you can load in enough work to the local store that remains with all the code loaded. (c) When you're doing it just for the hell of it. Things to Consider ------------------ (1) While it's feasible to call other functions directly from another code fragment, only do it if you know your functions are going to live at the same offset from each other as they were in the compiled object. (2) Constants such as string literals will get stored in a read only segment that is referenced directly by your code. This means that if you relocate that segment, your code needs a fix-up. While this is possible, it's currently not recommended. (3) Debugging your dynamic code in a production environment can be very difficult. You need to have individually compiled elfs for each piece of code so you can debug them with preconditioned data. Organizing the Code ------------------- (1) Collect common functions There are bound to be some functions needed that will be needed for any dynamically loaded code. Create a structure that contains function pointers to all of the global functions that will be used by your uploaded code. Some examples are: Wrapper functions for DMA get/put Print functions for debugging While dynamically loaded code can certainly be fixed-up to branch to the appropriate location of any common functions, it's generally simpler (especially for a first-pass) to simply have a structure of function pointers which collect all your common functions which is passed in to any dynamically loaded code. Using this method it doesn't matter where anything is located in local store, and it can still be easily optimized out later. For example something like: typedef struct CommonFunctions CommonFunctions; typedef int (PrintfProc)(const char *, ...); typedef void (PrintVectorProc)(vf32); typedef void (PrintIntegerProc)(int); typedef void (PrintFloatProc)(float); typedef void (PrintMatrix4Proc)(mtx4 *); typedef void (PrintMatrix3Proc)(mtx3 *); typedef void (DmaGetProc)(volatile void*, unsigned int, unsigned int, unsigned int); typedef void (DmaPutProc)(volatile void*, unsigned int, unsigned int, unsigned int); struct CommonFunctions { PrintfProc* m_print_proc; PrintVectorProc* m_print_vector; PrintIntegerProc* m_print_integer; PrintFloatProc* m_print_float; PrintMatrix4Proc* m_print_mtx4; PrintMatrix3Proc* m_print_mtx3; DmaGetProc* m_dma_get; DmaPutProc* m_dma_put; }; Common functions are functions which would always be available on the SPU, pre-loaded. (2) Create uploadable fragments Code fragments are individual SPU functions, or sets of functions, which are to be dynamically loaded. Eric Christensen Says: For sanity's sake, keep all fragment function parameters the same. You can put any function specific parameters in a structure. It should be assumed that the function before it knew how to pack the data for you to use. Mike Acton Adds: Since the ABI is obviously not going to change from the time you compile the fragment to the time it's run on the SPU, it's not really necessary to worry about the fragment's parameters. So long as the prototype the calling function uses does, in fact, match the parameters the fragment was compiled with there will be no problems. However, Eric's suggestion above does probably help reduce user error: If you standardize on a parameter list, that should never end up being the source of any error. Rules for fragment code: (1) Call other functions through the pointers which are passed as parameters. Note the common functions structure will be passed into fragment programs as a parameter. Any other functions needed will either be passed in as parameters, or their location in main ram will be passed in so they can also be loaded dynamically. (2) Only make use of data specified by the current local store mapping. The calling function must inform the fragment of the location of any needed local store. Do NOT refer to globals. Generating fragment code: In its simplest form, fragment code is compiled into a header which is included in a PPU module, which is linked into the PPU executable, which is then DMA'd up to the SPU on-demand. Generating this header is a fairly straightforward process: (1) Compile the SPU fragment program as usual, into an object file. (2) Dump the object file using spu-objdump (objdump is a standard GCC utility) (3) Convert the dump generated into an array which can be included in PPU code. (4) Include the array in PPU code, and refer to that address when uploading. An example Makefile for the above process: code_fragments.h : dynamic_code_frag.o.dump awk -f obj_dump_header_gen.awk dynamic_code_frag.o.dump > code_fragments.h dynamic_code_frag.o.dump : dynamic_code_frag.o spu-objdump -D dynamic_code_frag.o > dynamic_code_frag.o.dump dynamic_code_frag.o : dynamic_code_frag.c spu-gcc -O3 -fpic -c dynamic_code_frag.c -o dynamic_code_frag.c Step (2) above generates a file like this: dynamic_code_frag.o: file format elf32-spu Disassembly of section .text: 00000000 : 0: 40 ff ff 84 il $4,-1 4: 35 80 00 0a hbr 2c ,$0 8: 24 ff 40 81 stqd $1,-48($1) c: 34 00 01 82 lqd $2,0($3) 10: 1c f4 00 81 ai $1,$1,-48 14: 3e c0 01 85 cwd $5,0($3) 18: 1c 0c 00 81 ai $1,$1,48 # 30 1c: b0 40 82 05 shufb $2,$4,$2,$5 20: 40 20 00 7f nop $127 24: 40 20 00 7f nop $127 28: 24 00 01 82 stqd $2,0($3) 2c: 35 00 00 00 bi $0 00000030 : 30: 40 ff ff 84 il $4,-1 34: 35 80 00 0a hbr 5c ,$0 38: 24 ff 40 81 stqd $1,-48($1) 3c: 34 00 01 82 lqd $2,0($3) 40: 1c f4 00 81 ai $1,$1,-48 44: 3e c0 01 85 cwd $5,0($3) 48: 1c 0c 00 81 ai $1,$1,48 # 30 4c: b0 40 82 05 shufb $2,$4,$2,$5 50: 40 20 00 7f nop $127 54: 40 20 00 7f nop $127 58: 24 00 01 82 stqd $2,0($3) 5c: 35 00 00 00 bi $0 00000060 : 60: 34 00 02 02 lqd $2,0($4) 64: 24 fe c0 d4 stqd $84,-80($1) 68: 04 00 02 54 ori $84,$4,0 6c: 24 00 40 80 stqd $0,16($1) 70: 24 ff c0 d0 stqd $80,-16($1) 74: 24 ff 80 d1 stqd $81,-32($1) 78: 04 00 01 d0 ori $80,$3,0 7c: 24 ff 40 d2 stqd $82,-48($1) 80: 04 00 02 d1 ori $81,$5,0 84: 3b 81 01 02 rotqby $2,$2,$4 88: 04 00 03 52 ori $82,$6,0 8c: 24 ff 00 d3 stqd $83,-64($1) 90: 04 00 02 83 ori $3,$5,0 94: 24 fe 80 d5 stqd $85,-96($1) 98: 1c 10 02 d3 ai $83,$5,64 # 40 9c: 24 fd c0 81 stqd $1,-144($1) a0: 1c dc 00 81 ai $1,$1,-144 a4: 35 80 01 0a hbr cc ,$2 a8: 40 80 0a 55 il $85,20 # 14 ac: 40 20 00 7f nop $127 b0: 40 20 00 7f nop $127 b4: 40 20 00 7f nop $127 b8: 40 20 00 7f nop $127 bc: 40 20 00 7f nop $127 c0: 40 20 00 7f nop $127 c4: 40 20 00 7f nop $127 c8: 40 20 00 7f nop $127 cc: 35 20 01 00 bisl $0,$2 d0: 34 00 2a 02 lqd $2,0($84) d4: 04 00 29 83 ori $3,$83,0 d8: 3b 95 01 02 rotqby $2,$2,$84 dc: 35 20 01 00 bisl $0,$2 e0: 41 1f c0 02 ilhu $2,16256 # 3f80 e4: 3e c0 00 83 cwd $3,0($1) e8: 40 80 02 04 il $4,4 ec: 38 81 2a 05 lqx $5,$84,$4 f0: 18 01 2a 06 a $6,$84,$4 f4: b0 54 01 03 shufb $2,$2,$80,$3 f8: 41 20 00 04 ilhu $4,16384 # 4000 fc: 3e c1 00 83 cwd $3,4($1) 100: b0 80 82 03 shufb $4,$4,$2,$3 104: 3e c2 00 82 cwd $2,8($1) 108: 41 20 20 03 ilhu $3,16448 # 4040 10c: 3b 81 82 86 rotqby $6,$5,$6 110: 3e c3 00 85 cwd $5,12($1) 114: b0 61 01 82 shufb $3,$3,$4,$2 118: 41 20 40 02 ilhu $2,16512 # 4080 11c: b0 60 c1 05 shufb $3,$2,$3,$5 120: 35 20 03 00 bisl $0,$6 124: 38 95 68 08 lqx $8,$80,$85 128: 40 80 0c 02 il $2,24 # 18 12c: 34 00 68 04 lqd $4,16($80) 130: 18 15 68 05 a $5,$80,$85 134: 34 00 aa 06 lqd $6,32($84) # 20 138: 18 00 aa 07 a $7,$84,$2 13c: 38 80 aa 02 lqx $2,$84,$2 140: 04 00 29 03 ori $3,$82,0 144: 3b 81 44 05 rotqby $5,$8,$5 148: 3b 94 02 04 rotqby $4,$4,$80 14c: 3b 95 03 06 rotqby $6,$6,$84 150: 3b 81 c1 02 rotqby $2,$2,$7 154: 35 20 01 00 bisl $0,$2 158: 18 15 68 02 a $2,$80,$85 15c: 35 80 29 09 hbr 180 ,$82 160: 04 00 28 85 ori $5,$81,0 164: 38 95 68 06 lqx $6,$80,$85 168: 04 00 2a 04 ori $4,$84,0 16c: 3f e0 28 03 shlqbyi $3,$80,0 170: 40 20 00 7f nop $127 174: 40 20 00 7f nop $127 178: 3b 80 83 06 rotqby $6,$6,$2 17c: 18 01 a9 06 a $6,$82,$6 180: 35 20 29 00 bisl $0,$82 184: 34 00 2a 02 lqd $2,0($84) 188: 1c 08 28 83 ai $3,$81,32 # 20 18c: 3b 95 01 02 rotqby $2,$2,$84 190: 35 20 01 00 bisl $0,$2 194: 34 00 2a 04 lqd $4,0($84) 198: 04 00 29 83 ori $3,$83,0 19c: 3b 95 02 02 rotqby $2,$4,$84 1a0: 35 20 01 00 bisl $0,$2 1a4: 1c 24 00 81 ai $1,$1,144 # 90 1a8: 40 80 00 03 il $3,0 1ac: 34 ff c0 d0 lqd $80,-16($1) 1b0: 34 ff 80 d1 lqd $81,-32($1) 1b4: 34 ff 40 d2 lqd $82,-48($1) 1b8: 34 ff 00 d3 lqd $83,-64($1) 1bc: 34 fe c0 d4 lqd $84,-80($1) 1c0: 34 fe 80 d5 lqd $85,-96($1) 1c4: 34 00 40 80 lqd $0,16($1) 1c8: 35 00 00 00 bi $0 1cc: 00 20 00 00 lnop 000001d0 : 1d0: 04 00 02 83 ori $3,$5,0 1d4: 34 00 02 02 lqd $2,0($4) 1d8: 24 ff 40 d2 stqd $82,-48($1) 1dc: 04 00 02 52 ori $82,$4,0 1e0: 24 00 40 80 stqd $0,16($1) 1e4: 24 ff c0 d0 stqd $80,-16($1) 1e8: 04 00 02 d0 ori $80,$5,0 1ec: 24 ff 80 d1 stqd $81,-32($1) 1f0: 1c 18 02 d1 ai $81,$5,96 # 60 1f4: 3b 81 01 02 rotqby $2,$2,$4 1f8: 24 fe 80 81 stqd $1,-96($1) 1fc: 1c e8 00 81 ai $1,$1,-96 200: 35 20 01 00 bisl $0,$2 204: 34 00 29 02 lqd $2,0($82) 208: 04 00 28 83 ori $3,$81,0 20c: 3b 94 81 02 rotqby $2,$2,$82 210: 35 20 01 00 bisl $0,$2 214: 3e c0 00 87 cwd $7,0($1) 218: 41 20 00 04 ilhu $4,16384 # 4000 21c: 3e c1 00 88 cwd $8,4($1) 220: 41 20 40 05 ilhu $5,16512 # 4080 224: 3e c2 00 89 cwd $9,8($1) 228: 41 20 80 06 ilhu $6,16640 # 4100 22c: 3e c3 00 8a cwd $10,12($1) 230: 41 20 c0 02 ilhu $2,16768 # 4180 234: b0 74 02 07 shufb $3,$4,$80,$7 238: 40 80 02 0b il $11,4 23c: 38 82 e9 07 lqx $7,$82,$11 240: 18 02 e9 0c a $12,$82,$11 244: b0 60 c2 88 shufb $3,$5,$3,$8 248: b0 60 c3 09 shufb $3,$6,$3,$9 24c: 3b 83 03 84 rotqby $4,$7,$12 250: b0 60 c1 0a shufb $3,$2,$3,$10 254: 04 00 01 82 ori $2,$3,0 258: 58 c0 c1 83 fm $3,$3,$3 25c: 58 c0 81 83 fm $3,$3,$2 260: 35 20 02 00 bisl $0,$4 264: 34 00 29 02 lqd $2,0($82) 268: 1c 08 28 03 ai $3,$80,32 # 20 26c: 3b 94 81 02 rotqby $2,$2,$82 270: 35 20 01 00 bisl $0,$2 274: 34 00 29 04 lqd $4,0($82) 278: 04 00 28 83 ori $3,$81,0 27c: 3b 94 82 02 rotqby $2,$4,$82 280: 35 20 01 00 bisl $0,$2 284: 1c 18 00 81 ai $1,$1,96 # 60 288: 34 ff c0 d0 lqd $80,-16($1) 28c: 34 ff 80 d1 lqd $81,-32($1) 290: 34 ff 40 d2 lqd $82,-48($1) 294: 34 00 40 80 lqd $0,16($1) 298: 35 00 00 00 bi $0 29c: 00 20 00 00 lnop 000002a0 : 2a0: 04 00 01 86 ori $6,$3,0 2a4: 12 00 00 11 hbrr 2e8 ,0 2a8: 42 00 00 02 ila $2,0 2ac: 33 00 00 ca brsl $74,2b0 # 2b0 2b0: 04 00 03 04 ori $4,$6,0 2b4: 3f 83 03 07 rotqbyi $7,$6,12 2b8: 42 00 00 03 ila $3,0 2bc: 3f 81 03 05 rotqbyi $5,$6,4 2c0: 77 00 02 04 fesd $4,$4 2c4: 3f 82 03 06 rotqbyi $6,$6,8 2c8: 77 00 03 87 fesd $7,$7 2cc: 77 00 02 85 fesd $5,$5 2d0: 08 12 81 4a sf $74,$2,$74 2d4: 24 ff 40 81 stqd $1,-48($1) 2d8: 77 00 03 06 fesd $6,$6 2dc: 1c f4 00 81 ai $1,$1,-48 2e0: 18 12 81 83 a $3,$3,$74 2e4: 1c 0c 00 81 ai $1,$1,48 # 30 2e8: 32 00 00 00 br 0 2ec: 00 20 00 00 lnop 000002f0 : 2f0: 12 00 00 07 hbrr 30c ,0 2f4: 24 ff c0 d0 stqd $80,-16($1) 2f8: 04 00 01 d0 ori $80,$3,0 2fc: 34 00 01 83 lqd $3,0($3) 300: 24 00 40 80 stqd $0,16($1) 304: 24 ff 00 81 stqd $1,-64($1) 308: 1c f0 00 81 ai $1,$1,-64 30c: 33 00 00 00 brsl $0,0 310: 34 00 68 03 lqd $3,16($80) 314: 33 00 00 00 brsl $0,0 318: 1c 10 00 81 ai $1,$1,64 # 40 31c: 34 00 a8 03 lqd $3,32($80) # 20 320: 34 00 40 80 lqd $0,16($1) 324: 34 ff c0 d0 lqd $80,-16($1) 328: 32 00 00 00 br 0 32c: 00 20 00 00 lnop 00000330 : 330: 12 00 00 07 hbrr 34c ,0 334: 24 ff c0 d0 stqd $80,-16($1) 338: 04 00 01 d0 ori $80,$3,0 33c: 34 00 01 83 lqd $3,0($3) 340: 24 00 40 80 stqd $0,16($1) 344: 24 ff 00 81 stqd $1,-64($1) 348: 1c f0 00 81 ai $1,$1,-64 34c: 33 00 00 00 brsl $0,0 350: 34 00 68 03 lqd $3,16($80) 354: 33 00 00 00 brsl $0,0 358: 34 00 a8 03 lqd $3,32($80) # 20 35c: 33 00 00 00 brsl $0,0 360: 1c 10 00 81 ai $1,$1,64 # 40 364: 34 00 e8 03 lqd $3,48($80) # 30 368: 34 00 40 80 lqd $0,16($1) 36c: 34 ff c0 d0 lqd $80,-16($1) 370: 32 00 00 00 br 0 374: 00 20 00 00 lnop 00000378 : 378: 42 00 00 02 ila $2,0 37c: 33 00 00 ca brsl $74,380 # 380 380: 77 00 01 84 fesd $4,$3 384: 12 00 00 0a hbrr 3ac ,0 388: 42 00 00 03 ila $3,0 38c: 24 ff 40 81 stqd $1,-48($1) 390: 08 12 81 4a sf $74,$2,$74 394: 1c f4 00 81 ai $1,$1,-48 398: 18 12 81 83 a $3,$3,$74 39c: 00 20 00 00 lnop 3a0: 1c 0c 00 81 ai $1,$1,48 # 30 3a4: 40 20 00 7f nop $127 3a8: 40 20 00 7f nop $127 3ac: 32 00 00 00 br 0 000003b0 : 3b0: 42 00 00 02 ila $2,0 3b4: 33 00 00 ca brsl $74,3b8 # 3b8 3b8: 04 00 01 84 ori $4,$3,0 3bc: 12 00 00 0a hbrr 3e4 ,0 3c0: 42 00 00 03 ila $3,0 3c4: 24 ff 40 81 stqd $1,-48($1) 3c8: 1c f4 00 81 ai $1,$1,-48 3cc: 08 12 81 4a sf $74,$2,$74 3d0: 1c 0c 00 81 ai $1,$1,48 # 30 3d4: 00 20 00 00 lnop 3d8: 18 12 81 83 a $3,$3,$74 3dc: 40 20 00 7f nop $127 3e0: 40 20 00 7f nop $127 3e4: 32 00 00 00 br 0 Disassembly of section .data: 00000000 : 0: 45 78 65 63 xorhi $99,$74,481 # 1e1 4: 75 74 69 6e mpyui $110,$82,465 # 1d1 8: 67 20 46 75 .long 0x67204675 c: 6e 63 74 69 .long 0x6e637469 10: 6f 6e 3b 20 .long 0x6f6e3b20 14: 20 20 20 20 brz $32,10114 # 10114 18: 20 20 20 20 brz $32,10118 # 10118 1c: 20 20 0a 00 brz $0,1006c # 1006c 20: 4c 65 61 76 cgti $118,$66,405 # 195 24: 69 6e 67 20 .long 0x696e6720 28: 46 75 6e 63 xorbi $99,$92,469 # 1d5 2c: 74 69 6f 6e mpyi $110,$94,421 # 1a5 30: 3a 20 20 20 .long 0x3a202020 34: 20 20 20 20 brz $32,10134 # 10134 38: 20 20 20 20 brz $32,10138 # 10138 3c: 20 20 0a 00 brz $0,1008c # 1008c 40: 4d 61 69 6e cgthi $110,$82,389 # 185 44: 20 4b 65 72 brz $114,25b6c 48: 6e 65 6c 20 .long 0x6e656c20 4c: 20 20 20 20 brz $32,1014c # 1014c 50: 20 20 20 20 brz $32,10150 # 10150 54: 20 20 20 20 brz $32,10154 # 10154 58: 20 20 20 20 brz $32,10158 # 10158 5c: 20 20 0a 00 brz $0,100ac # 100ac 60: 54 65 73 74 .long 0x54657374 64: 20 46 75 6e brz $110,2340c 68: 63 74 69 6f .long 0x6374696f 6c: 6e 20 41 20 .long 0x6e204120 70: 20 20 20 20 brz $32,10170 # 10170 74: 20 20 20 20 brz $32,10174 # 10174 78: 20 20 20 20 brz $32,10178 # 10178 7c: 20 20 0a 00 brz $0,100cc # 100cc 80: 64 75 6d 6d .long 0x64756d6d 84: 79 20 66 6f .long 0x7920666f 88: 72 20 6d 61 .long 0x72206d61 8c: 69 6e 20 20 .long 0x696e2020 90: 20 20 20 20 brz $32,10190 # 10190 94: 20 20 20 20 brz $32,10194 # 10194 98: 20 20 20 20 brz $32,10198 # 10198 9c: 20 20 0a 00 brz $0,100ec # 100ec ... Disassembly of section .rodata.str1.16: 00000000 <.rodata.str1.16>: 0: 25 66 20 25 bihnze $37,$64 4: 66 20 25 66 .long 0x66202566 8: 20 25 66 0a brz $10,12b38 # 12b38 c: 00 00 00 00 stop 10: 25 66 0a 00 bihnze $0,$20 ... 20: 25 64 0a 00 bihnze $0,$20 ... Disassembly of section .comment: 00000000 <.comment>: 0: 00 47 43 43 synce 4: 3a 20 28 47 .long 0x3a202847 8: 4e 55 29 20 cgtbi $32,$82,340 # 154 c: 34 2e 31 2e lqd $46,2944($98) # b80 10: Address 0x0000000000000010 is out of bounds. Step (3) uses the following awk script (obj_dump_header_gen.awk) to convert the dump above to a header file: BEGIN { section = ".none"; function_open = 0; function_size = 0; function_name =""; } { if ($1 == "Disassembly") { close_function(); section = $4; } if (section == ".text:") { # Two fields, start of new function if (NF == 2) { close_function(); function_name = gensub(/[<:> ]/,"","g",$2); printf("uint32_t %s[] __attribute__((aligned(128))) = \n",function_name); printf("{\n"); function_open = 1; } else if ( function_open && (NF >= 5) ) { printf(" 0x%s%s%s%s, // ",$2,$3,$4,$5); for ( i=6;i<=NF;i++) { printf("%s ",$i); } printf("\n"); function_size += 4; } } if (section == ".data:") # just in case we want to do something different with the printing { # Two fields, start of new function if (NF == 2) { close_function(); function_name = gensub(/[<:> ]/,"","g",$2); printf("uint32_t %s[] __attribute__((aligned(128))) = \n",function_name); printf("{\n"); function_open = 1; } else if ( function_open && (NF >= 5) ) { printf(" 0x%s%s%s%s,",$2,$3,$4,$5); printf("\n"); function_size += 4; } } } END { close_function(); } function close_function() { if ( function_open ) { printf("};\n\n"); printf("#define %s %d\n",toupper(function_name "_size"),function_size); printf("\n\n"); function_open = 0; function_size = 0; } } The header file generated by the script above looks like this: uint32_t DmaGet[] __attribute__((aligned(128))) = { 0x40ffff84, // il $4,-1 0x3580000a, // hbr 2c ,$0 0x24ff4081, // stqd $1,-48($1) 0x34000182, // lqd $2,0($3) 0x1cf40081, // ai $1,$1,-48 0x3ec00185, // cwd $5,0($3) 0x1c0c0081, // ai $1,$1,48 # 30 0xb0408205, // shufb $2,$4,$2,$5 0x4020007f, // nop $127 0x4020007f, // nop $127 0x24000182, // stqd $2,0($3) 0x35000000, // bi $0 }; #define DMAGET_SIZE 48 uint32_t DmaPut[] __attribute__((aligned(128))) = { 0x40ffff84, // il $4,-1 0x3580000a, // hbr 5c ,$0 0x24ff4081, // stqd $1,-48($1) 0x34000182, // lqd $2,0($3) 0x1cf40081, // ai $1,$1,-48 0x3ec00185, // cwd $5,0($3) 0x1c0c0081, // ai $1,$1,48 # 30 0xb0408205, // shufb $2,$4,$2,$5 0x4020007f, // nop $127 0x4020007f, // nop $127 0x24000182, // stqd $2,0($3) 0x35000000, // bi $0 }; #define DMAPUT_SIZE 48 uint32_t MainKernel[] __attribute__((aligned(128))) = { 0x34000202, // lqd $2,0($4) 0x24fec0d4, // stqd $84,-80($1) 0x04000254, // ori $84,$4,0 0x24004080, // stqd $0,16($1) 0x24ffc0d0, // stqd $80,-16($1) 0x24ff80d1, // stqd $81,-32($1) 0x040001d0, // ori $80,$3,0 0x24ff40d2, // stqd $82,-48($1) 0x040002d1, // ori $81,$5,0 0x3b810102, // rotqby $2,$2,$4 0x04000352, // ori $82,$6,0 0x24ff00d3, // stqd $83,-64($1) 0x04000283, // ori $3,$5,0 0x24fe80d5, // stqd $85,-96($1) 0x1c1002d3, // ai $83,$5,64 # 40 0x24fdc081, // stqd $1,-144($1) 0x1cdc0081, // ai $1,$1,-144 0x3580010a, // hbr cc ,$2 0x40800a55, // il $85,20 # 14 0x4020007f, // nop $127 0x4020007f, // nop $127 0x4020007f, // nop $127 0x4020007f, // nop $127 0x4020007f, // nop $127 0x4020007f, // nop $127 0x4020007f, // nop $127 0x4020007f, // nop $127 0x35200100, // bisl $0,$2 0x34002a02, // lqd $2,0($84) 0x04002983, // ori $3,$83,0 0x3b950102, // rotqby $2,$2,$84 0x35200100, // bisl $0,$2 0x411fc002, // ilhu $2,16256 # 3f80 0x3ec00083, // cwd $3,0($1) 0x40800204, // il $4,4 0x38812a05, // lqx $5,$84,$4 0x18012a06, // a $6,$84,$4 0xb0540103, // shufb $2,$2,$80,$3 0x41200004, // ilhu $4,16384 # 4000 0x3ec10083, // cwd $3,4($1) 0xb0808203, // shufb $4,$4,$2,$3 0x3ec20082, // cwd $2,8($1) 0x41202003, // ilhu $3,16448 # 4040 0x3b818286, // rotqby $6,$5,$6 0x3ec30085, // cwd $5,12($1) 0xb0610182, // shufb $3,$3,$4,$2 0x41204002, // ilhu $2,16512 # 4080 0xb060c105, // shufb $3,$2,$3,$5 0x35200300, // bisl $0,$6 0x38956808, // lqx $8,$80,$85 0x40800c02, // il $2,24 # 18 0x34006804, // lqd $4,16($80) 0x18156805, // a $5,$80,$85 0x3400aa06, // lqd $6,32($84) # 20 0x1800aa07, // a $7,$84,$2 0x3880aa02, // lqx $2,$84,$2 0x04002903, // ori $3,$82,0 0x3b814405, // rotqby $5,$8,$5 0x3b940204, // rotqby $4,$4,$80 0x3b950306, // rotqby $6,$6,$84 0x3b81c102, // rotqby $2,$2,$7 0x35200100, // bisl $0,$2 0x18156802, // a $2,$80,$85 0x35802909, // hbr 180 ,$82 0x04002885, // ori $5,$81,0 0x38956806, // lqx $6,$80,$85 0x04002a04, // ori $4,$84,0 0x3fe02803, // shlqbyi $3,$80,0 0x4020007f, // nop $127 0x4020007f, // nop $127 0x3b808306, // rotqby $6,$6,$2 0x1801a906, // a $6,$82,$6 0x35202900, // bisl $0,$82 0x34002a02, // lqd $2,0($84) 0x1c082883, // ai $3,$81,32 # 20 0x3b950102, // rotqby $2,$2,$84 0x35200100, // bisl $0,$2 0x34002a04, // lqd $4,0($84) 0x04002983, // ori $3,$83,0 0x3b950202, // rotqby $2,$4,$84 0x35200100, // bisl $0,$2 0x1c240081, // ai $1,$1,144 # 90 0x40800003, // il $3,0 0x34ffc0d0, // lqd $80,-16($1) 0x34ff80d1, // lqd $81,-32($1) 0x34ff40d2, // lqd $82,-48($1) 0x34ff00d3, // lqd $83,-64($1) 0x34fec0d4, // lqd $84,-80($1) 0x34fe80d5, // lqd $85,-96($1) 0x34004080, // lqd $0,16($1) 0x35000000, // bi $0 0x00200000, // lnop }; #define MAINKERNEL_SIZE 368 uint32_t InternalFunctionTest[] __attribute__((aligned(128))) = { 0x04000283, // ori $3,$5,0 0x34000202, // lqd $2,0($4) 0x24ff40d2, // stqd $82,-48($1) 0x04000252, // ori $82,$4,0 0x24004080, // stqd $0,16($1) 0x24ffc0d0, // stqd $80,-16($1) 0x040002d0, // ori $80,$5,0 0x24ff80d1, // stqd $81,-32($1) 0x1c1802d1, // ai $81,$5,96 # 60 0x3b810102, // rotqby $2,$2,$4 0x24fe8081, // stqd $1,-96($1) 0x1ce80081, // ai $1,$1,-96 0x35200100, // bisl $0,$2 0x34002902, // lqd $2,0($82) 0x04002883, // ori $3,$81,0 0x3b948102, // rotqby $2,$2,$82 0x35200100, // bisl $0,$2 0x3ec00087, // cwd $7,0($1) 0x41200004, // ilhu $4,16384 # 4000 0x3ec10088, // cwd $8,4($1) 0x41204005, // ilhu $5,16512 # 4080 0x3ec20089, // cwd $9,8($1) 0x41208006, // ilhu $6,16640 # 4100 0x3ec3008a, // cwd $10,12($1) 0x4120c002, // ilhu $2,16768 # 4180 0xb0740207, // shufb $3,$4,$80,$7 0x4080020b, // il $11,4 0x3882e907, // lqx $7,$82,$11 0x1802e90c, // a $12,$82,$11 0xb060c288, // shufb $3,$5,$3,$8 0xb060c309, // shufb $3,$6,$3,$9 0x3b830384, // rotqby $4,$7,$12 0xb060c10a, // shufb $3,$2,$3,$10 0x04000182, // ori $2,$3,0 0x58c0c183, // fm $3,$3,$3 0x58c08183, // fm $3,$3,$2 0x35200200, // bisl $0,$4 0x34002902, // lqd $2,0($82) 0x1c082803, // ai $3,$80,32 # 20 0x3b948102, // rotqby $2,$2,$82 0x35200100, // bisl $0,$2 0x34002904, // lqd $4,0($82) 0x04002883, // ori $3,$81,0 0x3b948202, // rotqby $2,$4,$82 0x35200100, // bisl $0,$2 0x1c180081, // ai $1,$1,96 # 60 0x34ffc0d0, // lqd $80,-16($1) 0x34ff80d1, // lqd $81,-32($1) 0x34ff40d2, // lqd $82,-48($1) 0x34004080, // lqd $0,16($1) 0x35000000, // bi $0 0x00200000, // lnop }; #define INTERNALFUNCTIONTEST_SIZE 208 uint32_t PrintVector[] __attribute__((aligned(128))) = { 0x04000186, // ori $6,$3,0 0x12000011, // hbrr 2e8 ,0 0x42000002, // ila $2,0 0x330000ca, // brsl $74,2b0 # 2b0 0x04000304, // ori $4,$6,0 0x3f830307, // rotqbyi $7,$6,12 0x42000003, // ila $3,0 0x3f810305, // rotqbyi $5,$6,4 0x77000204, // fesd $4,$4 0x3f820306, // rotqbyi $6,$6,8 0x77000387, // fesd $7,$7 0x77000285, // fesd $5,$5 0x0812814a, // sf $74,$2,$74 0x24ff4081, // stqd $1,-48($1) 0x77000306, // fesd $6,$6 0x1cf40081, // ai $1,$1,-48 0x18128183, // a $3,$3,$74 0x1c0c0081, // ai $1,$1,48 # 30 0x32000000, // br 0 0x00200000, // lnop }; #define PRINTVECTOR_SIZE 80 uint32_t PrintMatrix3[] __attribute__((aligned(128))) = { 0x12000007, // hbrr 30c ,0 0x24ffc0d0, // stqd $80,-16($1) 0x040001d0, // ori $80,$3,0 0x34000183, // lqd $3,0($3) 0x24004080, // stqd $0,16($1) 0x24ff0081, // stqd $1,-64($1) 0x1cf00081, // ai $1,$1,-64 0x33000000, // brsl $0,0 0x34006803, // lqd $3,16($80) 0x33000000, // brsl $0,0 0x1c100081, // ai $1,$1,64 # 40 0x3400a803, // lqd $3,32($80) # 20 0x34004080, // lqd $0,16($1) 0x34ffc0d0, // lqd $80,-16($1) 0x32000000, // br 0 0x00200000, // lnop }; #define PRINTMATRIX3_SIZE 64 uint32_t PrintMatrix4[] __attribute__((aligned(128))) = { 0x12000007, // hbrr 34c ,0 0x24ffc0d0, // stqd $80,-16($1) 0x040001d0, // ori $80,$3,0 0x34000183, // lqd $3,0($3) 0x24004080, // stqd $0,16($1) 0x24ff0081, // stqd $1,-64($1) 0x1cf00081, // ai $1,$1,-64 0x33000000, // brsl $0,0 0x34006803, // lqd $3,16($80) 0x33000000, // brsl $0,0 0x3400a803, // lqd $3,32($80) # 20 0x33000000, // brsl $0,0 0x1c100081, // ai $1,$1,64 # 40 0x3400e803, // lqd $3,48($80) # 30 0x34004080, // lqd $0,16($1) 0x34ffc0d0, // lqd $80,-16($1) 0x32000000, // br 0 0x00200000, // lnop }; #define PRINTMATRIX4_SIZE 72 uint32_t PrintFloat[] __attribute__((aligned(128))) = { 0x42000002, // ila $2,0 0x330000ca, // brsl $74,380 # 380 0x77000184, // fesd $4,$3 0x1200000a, // hbrr 3ac ,0 0x42000003, // ila $3,0 0x24ff4081, // stqd $1,-48($1) 0x0812814a, // sf $74,$2,$74 0x1cf40081, // ai $1,$1,-48 0x18128183, // a $3,$3,$74 0x00200000, // lnop 0x1c0c0081, // ai $1,$1,48 # 30 0x4020007f, // nop $127 0x4020007f, // nop $127 0x32000000, // br 0 }; #define PRINTFLOAT_SIZE 56 uint32_t PrintInteger[] __attribute__((aligned(128))) = { 0x42000002, // ila $2,0 0x330000ca, // brsl $74,3b8 # 3b8 0x04000184, // ori $4,$3,0 0x1200000a, // hbrr 3e4 ,0 0x42000003, // ila $3,0 0x24ff4081, // stqd $1,-48($1) 0x1cf40081, // ai $1,$1,-48 0x0812814a, // sf $74,$2,$74 0x1c0c0081, // ai $1,$1,48 # 30 0x00200000, // lnop 0x18128183, // a $3,$3,$74 0x4020007f, // nop $127 0x4020007f, // nop $127 0x32000000, // br 0 }; #define PRINTINTEGER_SIZE 56 uint32_t data_section[] __attribute__((aligned(128))) = { 0x45786563, 0x7574696e, 0x67204675, 0x6e637469, 0x6f6e3b20, 0x20202020, 0x20202020, 0x20200a00, 0x4c656176, 0x696e6720, 0x46756e63, 0x74696f6e, 0x3a202020, 0x20202020, 0x20202020, 0x20200a00, 0x4d61696e, 0x204b6572, 0x6e656c20, 0x20202020, 0x20202020, 0x20202020, 0x20202020, 0x20200a00, 0x54657374, 0x2046756e, 0x6374696f, 0x6e204120, 0x20202020, 0x20202020, 0x20202020, 0x20200a00, 0x64756d6d, 0x7920666f, 0x72206d61, 0x696e2020, 0x20202020, 0x20202020, 0x20202020, 0x20200a00, }; #define DATA_SECTION_SIZE 160 Uploading the Code ------------------ You can DMA data/code from anywhere in PPU memory. Your data/code start address must be 16 byte aligned and your DMA size a multiple of 16 bytes. Dynamic fragment loading should be managed by the SPU, not the PPU. When an SPU module needs a new function, the SPU should initiate the DMA. The only involvement from the PPU should be to pass the address of the table of functions to the SPU when it is initialized. Creating a Roadmap ------------------ (1) Take the time to map out your code fragment sizes, and the size you expect your data to be in local store at each point in the process. (2) Make sure to have everything set up so you can debug your code in a controlled environment on an individual basis in case something breaks. (3) Before writing one line of code for your dynamic code loading kernel, test your individual code fragments with dummy data. Don't make the mistake of waiting until everything is implemented. There aren't enough tissues in the world to dry your tears if something were to break.