VLIW Architectures for DSP VLIW Architectures for DSP: A Two-Part Lecture Berkeley Design Technology, Inc. www.BDTI.com Copyright © 1999 Berkeley Design Technology, Inc. 1 Outline u Part I: VLIW basics and a case study l What's VLIW? l Why VLIW? l The TMS320C62xx l Advantages, disadvantages of VLIW u Part II: Other VLIW DSP architectures l StarCore SC140 l ADI TigerSHARC l Infineon Carmel © 1999 Berkeley Design Technology, Inc. 2 © 1999 Berkeley Design Technology, Inc. 1 VLIW Architectures for DSP Why VLIW? u Until ~1997, most DSP processors were very similar l Specialized execution units l Specialized instruction sets • Difficult to program in assembly • Unfriendly compiler targets l One instruction per instruction cycle u VLIW architectures execute multiple instructions/cycle and use simple, regular instruction sets l More parallelism, higher performance l Better compiler targets © 1999 Berkeley Design Technology, Inc. 3 VLIW vs Superscalar Memory INS 2 INS 3 • • • INS n Execution Units ALU MAC BMU • • • INS 1 INS 2 ? INS 3 Time INS 1 Instruction scheduling, dispatch INS 4 INS 6 INS 5 © 1999 Berkeley Design Technology, Inc. 4 © 1999 Berkeley Design Technology, Inc. 2 VLIW Architectures for DSP Characteristics of VLIW Processors u Multiple independent instructions per cycle, packed into single large "instruction word" or "packet" l Instructions may be positional, or may include routing information within each sub-instruction u Large complement of independent execution units u More regular, orthogonal, RISC-like instructions l Usually wider than typical DSP instructions l Usually simpler than typical DSP instructions u Large, uniform register sets u Wide program and data buses © 1999 Berkeley Design Technology, Inc. 5 Example VLIW DSP: The TI TMS320C62xx On-Chip Program Memory 2 independent data paths, 8 execution units Dispatch Unit 32x8=256 bits (8 instructions) L1 S1 M1 D1 L2 S2 M2 D2 Register File A Register File B L: ALU S: Shifter, ALU M: Multiplier D: Address gen. 32 32 On-Chip Data Memory © 1999 Berkeley Design Technology, Inc. 6 © 1999 Berkeley Design Technology, Inc. 3 VLIW Architectures for DSP FIR Filtering on the 'C62xx LOOP: ADD Can execute up to eight 32-bit instructions in parallel .L1 A0,A3,A0 ||ADD .L2 B1,B7,B1 ||MPYHL .M1X A2,B2,A3 ||MPYLH .M2X A2,B2,B7 ||LDW .D2 *B4++,B2 ||LDW .D1 *A7--,A2 ||[B0] ADD .S2 -1,B0,B0 ||[B0] B .S1 LOOP Compare to a conventional DSP... dotprod: MR=MR+MX0*MY0(SS), MX0=DM(I0,M0),MY0=PM(I4,M4); © 1999 Berkeley Design Technology, Inc. 7 Advantages of VLIW Architectures u Increased performance u Better compiler targets u Potentially easier to program u Potentially scalable l Can add more execution units, allow more instructions to be packed into the VLIW instruction © 1999 Berkeley Design Technology, Inc. 8 © 1999 Berkeley Design Technology, Inc. 4 VLIW Architectures for DSP Disadvantages of VLIW Architectures u New kinds of programmer/compiler complexity • Programmer (or code-generation tool) must keep track of instruction scheduling • Deep pipelines and long latencies can be confusing, may make peak performance elusive u Increased memory use • High program memory bandwidth requirements u High power consumption u Misleading MIPS ratings © 1999 Berkeley Design Technology, Inc. 9 Benchmark Results Execution Time on Complex Block FIR microseconds 45 Microseconds (lower is faster) 30 15 0 ADSP-2189 75 MHz DSP1620 120 MHz DSP56311 150 MHz '320C549 120 MHz © 1999 Berkeley Design Technology, Inc. TMS320C6202 250 MHz 10 © 1999 Berkeley Design Technology, Inc. 5 VLIW Architectures for DSP Benchmark Results Memory Usage on FSM Benchmark Bytes (lower is better) 200 160 120 80 40 0 ADSP-218x DSP16xx DSP563xx '320C54x TMS320C62xx © 1999 Berkeley Design Technology, Inc. 11 For More Information... Free resources on BDTI's web site, www.bdti.com l l l l DSP Processors Hit the Mainstream covers DSP architectural basics and new developments. Originally printed in IEEE Computer Magazine. Evaluating DSP Processor Performance, a white paper from BDTI. Numerous other BDTI article reprints, slides comp.dsp FAQ © 1999 Berkeley Design Technology, Inc. 12 © 1999 Berkeley Design Technology, Inc. 6 VLIW Architectures for DSP Outline u Part I: VLIW basics and a case study l What's VLIW? l Why VLIW? l The TMS320C62xx l Advantages, disadvantages of VLIW u Part II: Other VLIW DSP architectures l StarCore SC140 l ADI TigerSHARC l Infineon Carmel © 1999 Berkeley Design Technology, Inc. 13 StarCore SC140 u 16-bit fixed-point VLIW DSP core from Lucent/Motorola u StarCore claims it's a scalable architecture l First VLIW machine to target low-power apps u More execution units (13) than 'C62xx (8), but fewer instructions can be issued per cycle l Six for SC140 vs eight for 'C62xx MAC MAC MAC MAC BMU ALU ALU ALU ALU BFU BFU BFU BFU © 1999 Berkeley Design Technology, Inc. 14 © 1999 Berkeley Design Technology, Inc. 7 VLIW Architectures for DSP StarCore SC140 u Uses 16-bit instructions with optional 16-bit prefixes Should have pretty good code density, better than 'C62xx ('C62xx uses fixed-width 32-bit instructions) l u Pipeline relatively simple and shallow (5 stages) u Targeting 198 mW @ 300 MHz, 1.5 V u Development chip expected late '99 u Lucent and Motorola will each create chips using the SC140 core Motorola's MSC8101 sampling 1H00 l © 1999 Berkeley Design Technology, Inc. 15 ADI TigerSHARC u 8-, 16-, 32-bit fixed-point and 32-bit floating-point l Unusual data-type agility u Combines VLIW with extensive SIMD (single instruction, multiple data) to get massive parallelism l Using SIMD, can perform eight 16x16-bit fixed-point multiplications per cycle (4X the 'C62xx) SIMD multiply instruction ALU MAC Shift Four 16-bit multiplies ALU MAC Shift Four 16-bit multiplies © 1999 Berkeley Design Technology, Inc. 16 © 1999 Berkeley Design Technology, Inc. 8 VLIW Architectures for DSP ADI TigerSHARC u "Hierarchical" SIMD is unusual u Requires high on-chip data memory bandwidth l Sixteen 16-bit data words/cycle u Issues and executes up to four instructions per cycle u Uses 32-bit instructions, like '62xx l l May have high program memory use Memory use may also be increased by data- and algorithm-rearrangement needed for use of SIMD u Targeting 250 MHz, expected to begin sampling in late 1999 © 1999 Berkeley Design Technology, Inc. 17 Infineon Carmel u 16-bit fixed-point VLIW DSP core from Infineon (Siemens) l In silicon at 120 MHz, 0.25 µm (development chip) u Two data paths, six execution units ALU MAC EXP Shift ALU MAC u Mixed-width 24/48-bit instruction set u Can execute in parallel: l One 48-bit instruction, or l One or two 24-bit instructions, or l Up to six instructions as part of a "CLIW" © 1999 Berkeley Design Technology, Inc. 18 © 1999 Berkeley Design Technology, Inc. 9 VLIW Architectures for DSP Infineon Carmel CLIW (Configurable Long Instruction Word) General format: cliw name (operand1, ... , operand 4) { ALU1 || MAC1 || ALU2 || MAC2 || MOV1 || MOV2 } Example CLIW: cliw fft4(r0+=rn0, r1, r4, r5) { || || || || || } a2 = a1l * a0h *ma1 = ff1 + ff2 *ma2 = a2 - a1h * a0h a1h = *ma3 - *ma4 ff1 = *ma3 ff2 = *ma4; © 1999 Berkeley Design Technology, Inc. 19 Comparison Issue width Data memory bandwidth (16-bit words) Instruction size TMS320C62xx 8 4 words/cycle 32 bits SC140 6 8 words/cycle TigerSHARC 4 16 words/cycle 2, 6 4 words/cycle Processor Carmel Clock Pipeline (MHz) depth 250 16 bits w/ 300* 16-bit prefixes 11 Notable characteristics 1st VLIW-based DSP processor 5 Scalable, approach to compact code 32 bits 250* 8 SIMD + VLIW, data type agility 24/48 bits 120 8 CLIW instructions, 4 AGUs *Projected © 1999 Berkeley Design Technology, Inc. 20 © 1999 Berkeley Design Technology, Inc. 10 VLIW Architectures for DSP Benchmark Results Execution Time on Complex Block FIR 12 ted 10 jec 8 pro 6 4 2 0 TMS320C6202 250 MHz Carmel 120 MHz SC140 300 MHz © 1999 Berkeley Design Technology, Inc. 21 For More Information... Free resources on BDTI's web site, http://www.BDTI.com l l l l DSP Processors Hit the Mainstream covers DSP architectural basics and new developments. Originally printed in IEEE Computer Magazine. Evaluating DSP Processor Performance, a white paper from BDTI. Numerous other BDTI article reprints, slides comp.dsp FAQ © 1999 Berkeley Design Technology, Inc. 22 © 1999 Berkeley Design Technology, Inc. 11