VLIW Processors – Lecture 18 •P. Faraboschi, G. Desoli, J. Fisher, “The latest word in Digital and Media Processing,” IEEE Signal Processing Magazine, March 1998, pg. 59-85, Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles ingrid@ee.ucla.edu 1 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 BUT: DSP Software Development • Complex DSP architecture not amenable to compiler technology • Algorithms are modeled in high level language (e.g. C++) • Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support HLL algorithmic hand coded assembler model optimize & debug prototype production code code Long, frustrating time to market Fragile legacy code Still used in handhelds, but change in basestations, Part II 2 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 1 Processor style CISC (Complex Instruction Set) RISC (Reduced Instruction Set) VLIW (Very Long Instruction word) • extension of RISC Ultimate goal:make the processor “compiler-friendly” CISC DSP’s: • hard for the compiler to detect the CISC instructions • compilers don’t like special registers (Accumulators, T-reg, address registers, etc.) • but extremely efficient in terms of power & performance ILP = Instruction level parallelism • how much is available in the “code” • how much is achievable on the hardware 3 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 ILP Hardware Data path units in parallel: • same hardware modules (SIMD) • different hardware modules (Superscalar or VLIW) Superscalar: • hardware detects available parallelism in the Instructions and rearranges the RISC like operations. • very complex hardware controller • unpredictable execution time VLIW: compiler decides the alignment of the RISC like instructions • might fill it up with NOP’s • DSP applications well suited for this because lots of inherent parallelism. • and because flow of operations known “at-compile-time” (recall the “synchronous data flow graph” lecture) 4 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 2 Support by compiler Key issue: • Detect available parallelism from the sequential software program Sequential: t3 = t2 + 1; store (add0) = t3; f6 = f7 * f14; t1 = p2 + p7; t5 = p2 + p10; t3 = f6 – t1; .. . Parallel: reorder the instructions, keep the data dependencies intact t3 = t2+1; f6 = f7*f14; t1 = p2 + p7; t5 = p2 + p10; store (add0) = t3; t3 = f6 – t1; 5 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Software pipelining: For tight loops: Original code: inner loop, 2 cycles per iteration (I: 1 to N, step 1) begin a[I] = c[I] * x[N-I-1]; s[I] = a[I] + s[I-1]; end; Software pipelining: Inner loop: 1 cycle per iteration, 2 operations in parallel a[0] = c[0] * x[N]; (I: 1 to N-1, step 1) begin s[I] = a[I] + s[I-1]; a[I+1] = c[I+1] * x[N-I]; end; s[N] = a[N] + s[N-1]; See also presentation by F. Catthoor, Monday June 3, 3pm! 6 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 3 VLIW architectures for DSP 5 architecture features: • ILP (SIMD versus VLIW) • memory hierarchy • register organization • branch support • code size 7 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Compiler Driven VLIW Instruction format: cond/branch ex1 ex2 ex3 ….. exn Data memory Register Array Interconnect ex1 (alu) ex2 (alu) ex3 (mpy) ex4 (ld/st) exn (ld/st) Large orthogonal register set, regular interconnect Atomic RISC-like operations => heavily pipelined, high freq. clock 8 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 4 Explicitly Parallel Instruction Computing Execution Clusters Data memory Register Array Register Array Interconnect ex1 (alu) ex2 (alu) Interconnect ex3 (ld/st) ex4 (alu) ex5 (mpy) ex6 (ld/st) 9 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 ILP: SIMD SIMD in general purpose processors: • cut larger container in smaller pieces e.g. 64 bit word = 8 byte operations adjust ALU for that (no overflow between bytes) e.g. Ueda’s presentation for Viterbi acceleration In GPP: called “subword parallelism” or multimedia extensions e.g. MMX extensions to Intel processors Compiler problem: • rewrite code with special instructions • use special libraries 10 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 5 SIMD code rewriting: Original code: (I: 1 to N, step 1) begin s[I] = a[I] + b[I]; end; Assume 32 bit operator cut in 4 pieces: (I: 1 to N/4, step 1) begin t1 = packed4_load a[I]; t2 = packed4_load b[I]; t3 = add4 t1, t2; packed4_store s[I]; end; Requires: • rewriting the code • alignment of data in memory • size is divisible by 4 • architecture dependent (what if other processor has add6?) 11 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 VLIW parallelism Original code: (I: 1 to N, step 1) begin s[I] = a[I] + b[I]; end; Loop unrolling 4 times (compiler can do this): (I: 1 to N, step 4) begin s[I] = a[I]+ b[I]; s[I+1] = a[I+1]+ b[I+1]; in parallel: s[I+2] = a[I+2]+ b[I+2]; s[I+3] = a[I+3]+ b[I+3]; end; Requires: • 4 times bigger code size • 4 execution units (each might have too large wordlength) • 4 times as many registers • flexible combination of primitive operations • no rewriting of the code 12 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 6 Register file architecture N unit VLIW: • 2N input ports, N output ports • would require 3N port register file • if N=8, would mean 24 ports! • area grows with N^^2 • delay grows with N • Today: around 15 ports register files • therefore execution clustering Schedule extra register copy operations: • done by hardware (invisible to the programmer) unpredictable execution time • done by compiler: • needs to cluster operations • create Direct acyclic graph + cluster algorithms Mix subword parallelism in VLIW: tricky 13 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Memory architecture Data memory access patterns: • spatial locality but no temporal locality (pixel streams, sample streams) • temporal locality but no spatial locality (small look-up tables often reused) Caches don’t work • unpredictable worst case execution time (worst case becomes too worst case) Therefore local memory (separate memory space) How to tell the compiler: • “pragma’s” or programmer directions: # pragma local_memory int local_memory_array[1000]; 14 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 7 Prefetch buffer • Technique to avoid memory accesses to be the bottleneck for throughput driven stream based applications (such as a stream of pixels, samples, etc.) • Bypass the cache • Special prefetch instructions: • at loop level: separate loop to fill the prefetch buffer • at interation level: (kind of similar to software pipelining) fetch the samples for instructions N iterations ahead (N is called the prefetch_stride) • at instruction level: prefetch for the current loop iteration but do it in the beginning of the loop. 15 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Branch architecture 16 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 8 Code size 17 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Applications & coding 18 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 9 Texas Instruments ‘C6201 Program Memory (16K x 32) 256 Instruction Dispatch & Decode Register Bank A (16 x 32) ALU shift mpy Register Bank B (16 x 32) add ALU shift mpy add Data Memory (32K x 16) 8-way VLIW with two execution clusters 256 bit (8x32) instruction fetch with variable length execute set Each 32 bit instruction individually predicated 11 stage pipeline 1600 MIPS, 400 MMACs @ 200 MHz 19 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 FIR Filter on TI ‘C6x Hand-coded assembly: 32-tap FIR filter loop: || ||[b0] ||[b0] || || || || ldw ldw sub b mpy mpyh add add .d1t1 .d2t2 .s2 .s1 .m1x .m2x .l1 .l2 *a4++,a5 *b4++,b5 b0,1,b0 loop a5,b5,a6 a5,b5,b6 a7,a6,a7 b7,b6,b7 • Outer Loop: 23 cycles, 180 bytes • All 8 exec units used in inner loop - maximum efficiency – 1 cycle in inner loop – 2 MACs per cycle Assembly syntax more difficult to learn. Hard to get full use of all 8 execution units at once. Software pipelining difficult to implement, and requires longer prolog/epilog (larger code size). Courtesy: Gareth Hughes: Bell Labs Australia 20 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 10 Viterbi on TI ‘C6x • 16-state Viterbi decoder for GSM 3-cycle 2-ACS Inner-Loop LOOP: [b1] ||[b1] ||[!a2] ||[!a2] || || || b sub sth add cmpgt cmpgt mpy .s1 .s2 .d1 .d2 .l1 .l2 .m1x LOOP b1,1,b1 b12,*+a6[8] b0,b14,b14 a11,a10,a1 b11,b10,b0 1,b5,a4 [a2] ||[!a2] ||[a1] ||[b0] || || || sub sth add mpy mpy sub ldh .s1 .d1 .s2 .m2 .m1 .l2x .d2 a2,1,a2 a12,*a6++ 2,b0,b0 1,b11,b12 1,a10,a12 a7,b5,b10 *++b9,b5 shl ||[a1] mpy || add || sub || add || mpy || ldh || ldh ; end of LOOP .s2 .m1 .s1 .l1x .l2 .m2 .d2 .d1 from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm – 3 cycles per butterfly – 32 cycles per GSM timeslot (8 butterflies) x8 – Cycle 0 .D1 STH new 8 LDH old0 .D2 ADD tr LDH old1 MPY mj *MPY b0 MPY a0 MPY a8 *MPY b8 .M1 .M2 b14,2,b14 1,a11,a12 a7,a4,a10 b13,a4,a11 b13,b5,b11 1,b10,b12 *b4++[2],a7 *a5++[2],b13 2 MPY instructions used to move data 4 6 STH new 0 STH new 8 LDH mj MPY mj *MPY b0 MPY a0 MPY a8 *MPY b8 CMPGT t0 .L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 .S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 .S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr 1 3 7 9 .D1 .D2 STH new 0 STH new 8 CMPGT t0 5 LDH old0 10 12 LDH mj SUB b0 LDH old0 LDH old1 MPY mj *MPY b0 MPY a0 MPY a8 *MPY b8 CMPGT t0 SUB b0 SUB a8 CMPGT t8 ADD b8 SUB k B JLOOP ADD a0 *ADD t0,t8 SUB j SHL tr 13 15 LDH old0 16 STH new 0 STH new 8 18 STH new 0 STH new 8 ADD tr 11 STH new 0 STH new 8 14 STH new 0 STH new 8 LDH old1 .L1 Cycle SUB b0 8 LDH old0 ADD tr LDH mj 20 LDH old0 22 LDH old1 LDH mj MPY mj *MPY b0 MPY a0 MPY a8 *MPY b8 CMPGT t0 SUB b0 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB k B JLOOP ADD a0 SUB k *ADD t0,t8 SUB j SHL tr *ADD t0,t8 19 21 17 LDH old0 STH new 0 STH new 8 26 28 30 LDH sd1 STH m[2] STH m[3] SUB m LDH sd0 STH m[5] STH m[4] ADD m0 SUB -m0 SUB old SUB -m1 SUB m1 ADD tr B JLOOP SUB I MVK j 23 25 27 29 31 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1 STH trans STH m[1] STH m[6] LDH old0 ADD old ADD SP LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 .M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 .M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 .L1 24 STH new 0 STH new 8 ADD tr MPY a8 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 .L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 .S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 .S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr MPY mj SUB new Utilization of execution units in Viterbi decoder MVK k B JLOOP 21 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Lucent / Motorola Star*Core SC140 Program / Data Memory Program Sequencer Instruction Dispatcher Data Registers (16) Address Registers (27) MAC MAC MAC MAC AAU AAU ALU ALU ALU ALU BFU BFU BFU BFU 6-way VLIW with 128 bit (8x16) instruction fetch Prefix instructions for high performance without sacrificing code density Each execution set (parallel instructions + prefix) predicated 5 stage pipeline 1800 MIPS, 1200 MMACs @ 300 MHz 22 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 11 Viterbi on Star*Core GSM (K=5, 16 states) [ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 [ add2 d0,d4 sub2 d6,d2 sub2 d4,d0 add2 d2,d6 [ max2vit d4,d2 max2vit d0,d6 [ vsl.4w d2:d6:d1:d3,(r2)+n0 vsl.4f d2:d6:d1:d3,(r3)+n0 • ] ] ] x4 ] Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction: max2vit d4,d2 max2vit d0,d6 SR • Hardware support for Viterbi algorithm: – max2vit instruction. – vsl instruction • 1 cycle per butterfly through software-pipelining vsl.4w d2:d6:d1:d3,(r2)+n0 D1 decisions D3 decisions D2 path metrics D6 path metrics Results written to memory Courtesy: Gareth Hughes: Bell Labs Australia 23 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 Parallel DSP Architectures Arch. Parallelism Compile? Power ? ûû S/scalar Dynamic instruction level VLIW Static instruction level SIMD Highly regular, data dependent MIMD Task level û ûû û MIMD with VLIW / SIMD provides high order parallel execution The future of high performance DSPs is MIMD 24 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 12 2G Basestation Baseband Processing • Multiple DSPs used for baseband processing. • RISC Microcontroller for timing, framing, I/O control • Software upgradable over the network • DSPs dominate cost and power consumption Channel Equalization Tx Rx Tx Rx Channel De/coding Encryption RISC Micro AFE DSP DSP RAM DSP DSP DSP Controller AFE DSP DSP RAM DSP DSP DSP Future trend - integrate baseband processing low cost Pico BTS I/O I/O T1/E1 I/O I/O Tx/Rx baseband processing board for 2-carrier GSM basestation ASIC 25 EE213A, Spring 2000, Ingrid Verbauwhede, UCLA, Lecture 18 13