Computer Science MPhil dissertation
(Computer Science PhD program, Microelectronics Option)
presented by David Castells i Rufas
and supervised by Jordi Carrabina i Bordoll
Bellaterra, September 2008

Jordi Carrabina i Bordoll, lecturer in the Microelectronics and Electronics Systems
Department from the “Universitat Autònoma de Barcelona”,

CERTIFIES:

That the present research has been done under his direction by David Castells i Rufas
as experimental work inside the Computer Science PhD program, Microelectronics
option, given by the “Universitat Autònoma de Barcelona”.
Bellaterra, September 2008

Jordi Carrabina i Bordoll

This MPhil dissertation presents a new verification system for FPGA based designs
described in the JHDL hardware description language. The method consists of
performing hardware emulation of designer selected blocks in a co-simulation
environment. Although JHDL has a Hardware execution mode it does not provide a fine
control of which blocks have to be executed in Hardware and it is based on Xilinx
readback technology. In this work the simulation environment is extended to control of
the Hardware emulation system, instrument the design for debug, and automatically
create the interface to communicate the simulator with the emulated hardware block.
The resulting system does not offer 100% observability and controllability of hardware
blocks. Nevertheless its interactivity provides a solid basis for incremental verification
while offering the possibility of substantial simulation speedups.
Aquest treball de recerca presenta un nou sistema de verificació per dissenys descrits
amb el llenguatge de descripció de Hardware JHDL. El mètode consisteix en realitzar
l’emulació del bloc de Hardware seleccionat pel dissenyador en l’entorn de simulació.
Malgrat que el JHDL ja disposava d’un mode d’execució, aquest no proporciona un
control fi sobre quins blocs s’executen en Hardware i es basa en la tecnologia
readback de Xilinx. En aquest treball s’amplia l’entorn de simulació per controlar el
sistema d’emulació, instrumentar els dissenys per a la depuració i crea
automàticament la interfície de comunicació entre el simulador i el bloc Hardware
emulat. El sistema resultant no ofereix ni observabilitat ni controlabilitat completa, però
suposa una sòlida base per realitzar verificació incremental i reduir el temps de
simulació de manera significativa.
Este trabajo de investigación presenta un nuevo sistema de verificación para diseños
descritos mediante el lenguaje de descripción de Hardware JHDL. El método consiste
en realizar la emulación del bloque de Hardware seleccionado por el diseñador dentro
del entorno de simulación. A pesar de que JHDL ya disponía de un modo de ejecución,
éste no proporciona un control fino sobre que bloques se ejecutan en Hardware y se
basa en la tecnología readback de Xilinx. En este trabajo se amplía el entorno de
simulación para controlar el sistema de emulación, instrumentar los diseños para su
depuración y crear automáticamente la interfaz de comunicación entre el simulador i el
bloque Hardware emulado. El sistema resultante no ofrece ni observabilidad ni
controlabilidad completa, pero supone una sólida base para realizar verificación
incremental i reducir el tiempo de simulación de manera significativa.

I would like to thank the many people that have helped me, in very different ways, to
finish this never ending story. My gratitude for all that I mention and for the ones I
probably forget.
To my wife Mar Soldevila for the endless hours, her infinite patience, support and love.
To my daughter Anna for her contribution of some published photos and for not erasing
all my notebook hard disk in the many opportunities she had, and of course for being a
so lovely kid.
To my newborn daughter Carla, I wish you can see your father getting a PhD before
you get married.
To my newborn daughter Marta, who allowed us to dream of a wonderful life, a dream
that was vanished too early.
To Jordi Carrabina for his guidance on this work and for giving me the chance to enjoy
an academic lifestyle after so many years of being in the corporate fire front.
To Eloi Ramon, Lluis Ribas, Toni Portero, Quim Saiz, and Lluís Terés for sharing their
knowledge and be always open for academic discussion.
To Brent Nelson and Francky Cathoor for giving me valuable feedback about the topics
of my research.
To Jaume Joven for his optimism and for being a so easy person to work with.
To Sergi Risueño, Eduard Fernàndez, Jorge Luis Zapata, and Juan Carlos Chak for
listening me and, from time to time, letting me think I can teach them something.
To Aitor Rodriguez, Eric Teruel, and Pablo Romàn for helping me so much in my
everyday work and still be nice for returning a smile even when I was in bad mood
(probably too often).
To Oscar Navas, David Novo, Martí Bonamusa, Josep Mesado, Jordi Escrig, and Jordi
Farré, for being great people to have around.
To Alexis Morugó for his resolution to complete his project.
To Borja Martinez for revealing to me some mysteries of Quartus.
To Enric Pons for enduring the use of some of the first results of my tools.
To my brother Enric Castells for awakening my curiosity during my childhood by
constant challenges and puzzles, especially in hiding things to avoid my finding of his
VIC20 computer and electronic kits.
To Toni Ubieto and Pere Joan Cardona for, without knowing, showing me the little
satisfactions of research, in its literal meaning.
To my grandfather Esteve Rufas, who passed away recently, for being always starting
something that would never end but be so stubborn to never admit it.
To my parents, for their indispensable financial support during my youth and of course,
for giving me the most important thing in the world: life.

!

!

"# $

% &

"# $

' %
(
%
) *+

,
-

.

-

/

!
!

"

0
#

"# $%&'

(

!

$
$

%

&

'
(

& )

*

+

*

,,
12 /
3
1- + '0
3
1- 2
3
1- 4
3
$

) * *

%

**

(

+

,

*

.

+

.!#/
.. 3
% &
& -5
7 55
8

-

,,

0

,, 1 )

"
5
! -

-( *+
! 3 & -5

6
'*
'

) *+
1
9
:
7
+* 9
+* 9
+* 9
;: *
'# #
7
#
1
9

/7
)5
3
4 # 3
'#* 5
2 .. $
3
3
/

$
**

2

,
,
,
6
6
,

'3

/

.

4
,

#

5

- 6#
1- *
:
'
9
3
<

&
3
,0
,
,
,
,
,
,
,
,
,

'
1 3
.
.
.
. .
.. -

.

8
8

<
'
*

.
< .
( 3

8
#.

0

1

0
/

!

,

!

"

#

This work is a monograph, which contains some unpublished material, but is mainly
based on the following publications. Copyright of the previously published material is
owned by the copyright holders of the following publications.

[Castells04]

D. Castells, M. Monton, R. Pla, D.Novo, A. Portero, O. Navas, J.
Farré, L. Ribas, J. Carrabina "Comparing Design Flows fot
Structural System Level Specifications facing FPGA Platforms"
DCIS 2004.

[Castells04b] D.
Castells-Rufas,
J.
Farré-Capel,
J.
Carrabina,
“Experimentación con el lenguaje JHDL”, in Proceedings of IV
Jornadas de Computación Reconfigurable y Aplicaciones JCRA.
Barcelona, September, 2004.
[Castells05]

D. Castells-Rufas, E. Pons, J. Carrabina. "Implementación de un
sistema OCR en FPGA". in Proceedings of V Jornadas de
Computación Reconfigurable y Aplicaciones (JCRA). Granada,
September, 2005.

[Castells06]

D. Castells-Rufas, J. Carrabina, "Camera-Based Digit
Recognition System". 13th International Conference on
Electronics, Circuits and Systems (ICECS2006). Nice, France,
December 10-13, 2006.

[Castells06b] D. Castells i Rufas, A. Morugó, J. Carrabina, "Traducción
automática de JHDL a VHDL". VI Jornadas sobre Computación
Reconfigurable y Aplicaciones (JCRA2006), Cáceres, Spain,
September 12-14, 2006
[Castells07]

D. Castells-Rufas, J. Carrabina, "Jumble: A Hardware-in-theLoop Simulation System for JHDL". IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM), Napa,
USA, April 23-25, 2007.

6

!

! "
$
%
0
23

&

'
( ' )1

"

$

( ' )* + '

,$-.

,$-.
( ' )1
,$-.
( ' )6 " ,,-.
, -.

4 5
#7
( ' )8
/955 4
' ( ' " :;; < ' < '; ;
;
12 ,, , 4,, .
,
'
5 % '
' "
&
"
=
( '
)6
, -.
7
4 +
' ( ' )4 " ' " ,0-.
!7
'
( ' )>
/ -.
$ "
'
"
( '>
/.
01
"
?
( ' )>
/#-.< .
'
' &. 4
3
@
( ')
"
//-.
"
"
'
' ?? ( ' ) ,$-.
#
' ( ' )?
+,0-.
/?
&
'
A
( ' )?
+,0-.
,
"
A
'
"
:.
> " "
"
"
&
B
&
''
&. @
.8 '
.4 " '
7
2
'
'
!C8 4 D
' "
"
+ &
$C8 4 D
'
'
04 '
'
9
" '
C8 4 D
'
+
#755E
+
'
/4
5
'
!,4
5 & +
'( ')
"
! A
"
4
! @
&
!!@
'
!$ " 1
'
!01
'
! A
3
'
! A '& 4 '
'
!#<4
"
+
!/> +
'
+
$,<>
$ 7 35 >
'
/
$

'

,$-.
5 %7( ' )8 ,!-.

3

"

& 7

3

$
0

#
#
/
!,
!
!
!!
!0
'
!
!#
!/
$,
$
$
$
$$
$0
$0
$
$
$
$
$#
$/
0,
0
0
0

=

>

#
/
/
,
,

"8
0#
,

7
0

$! 4
'5 '
$$ 5?%F &
'
$0<
5?%F "
'
$ <@
4
"
$
@
8
$#
5?F ? @ =
$/<5 ''
?
'
0,<
'
'
0 8 +
'
'
'
0 4
" '
0! . 9
"
'& <&. 48
'&

!
$
0
#
/
,
$
0
"
0

0$
9
4
'> +
004 " '
7
" 159
'
0
& "
159
'
0 4 " ' 7
"
' 3
&
0#5
' 3'
0/5
' 3'
,5
' 3'
5
'
'
5
'
'

'

"

'
'

00

8

3

'

#
#
#$
#$
#0
#
#

AMR
ASIC
CBSE
CCM
CDFG
CLB
COTS
DCT
DE
DFG
DSP

Automatic Meter Reading
Application Specific Integrated Circuit
Cycle-Based Simulation Engine
Custom Computing Machine
Control Data Flow Graph
Configurable Logic Block
Commercial of the shelf
Discrete Cosine Transform
Driving Environment
Data Flow Graph
Digital Signal Processing
Digital Signal Processor
DUT
Device Under Test
EDA
Electronic Design Automation
EIA
Electronic Industries Alliance
ESL
Electronic System Level
FCCM FPGA-based Custom Computing Machine o Field-Programmable
Custom Computing Machine
FPGA Field Programmable Gate Array
FPLD
Field Programmable Logic Device
FSM
Finite State Machine
FSMD Finite State Machine with Data-path
GPP
General Purpose Processor
HDL
Hardware Description Language
HIL
Hardware In the Loop
HLL
High Level Language
IC
Integrated Circuit
IDCT
Inverse Discrete Cosine Transform
IP
Intellectual Property
ISA
Instruction Set Architecture
ISS
Instruction Set Simulator
LE
Logic Element
LUT
Look up Table
MAC
Multiply Accumulate
MPSoC MultiProcessor System on Chip
NoC
Network on Chip
NRE
Non Recurrent Engineering
PCB
Printed Circuit Board
PLA
Programmable Logic Array
PLD
Programmable Logic Device
PLI
Programming Language Interface
PLL
Phase Locked Loop
RTL
Register Transfer Level
SRAM Static Random Access Memory
SoC
System on Chip
TLM
Transaction Level Modeling
0

VHDL
VHSIC

VHSIC Hardware description Language
Very High Speed Integrated Circuit

0

0

%
!
&
Programmable logic devices are becoming essential platforms to prototype hardwaresoftware solutions for commercial electronic systems and sometimes even a good
alternative to implement them. The reconfiguration capabilities of these devices offer
great advantages versus ASICs as they reduce the risks of possible design errors. The
importance of flexibility increases with every passed day. Nowadays the non-recurrentengineering costs in ASIC design are about one million dollars. Every unexpected
additional iteration in the design cycle means a potentially null profit or significant loss
of benefit. General Purpose Processors (GPP) and Digital Signal Processors (DSP)
offer great deal of flexibility and are broadly used but they have limited hardware
resources and are more energy inefficient. On the other hand, FPGAs allow designing
specific hardware to maximize parallelism, offering better performance at a competitive
price with less energy consumption.
The roles of programmable logic devices have been increasing with time (Figure 1).
First devices where based on AND-OR planes, that were able to implement any
combinational function and where used to simplify the connectivity of electronic
systems (glue logic). Before its introduction, the area of printed circuit boards (PCBs)
was dominated by circuits to interconnect the main operational circuits. Programmable
devices assumed simple functions, like the ones offered by TTL74 family, so that area
for glue-logic was greatly reduced and consequently cost was reduced as well. The
steady increase of integration capacity and the technology change to SRAM-based
FPGA on 1985 drove their use for ASIC prototyping.

0

2001
Xilinx Virtex II
DSP blocks
10 MG

1985
Lattice E 2CMOS
GAL devices
350 G
1972
NSC PLA
conversion
EBCDIC - ASCII

1998
Xilinx Virtex
500 KG

1985
Xilinx SRAM
FPGA
800 G

1978
MMI PAL
300 G

1982 AMD buys
MMI
22V10 PAL
400 G

2000
Altera
NIOS

1987
Altera MAX5000
1.4 KG

2005
Altera Cyclone II
820 KG

2002
Altera
Cyclone &
Stratix

SoC
DSP

1970'
s

1980'
s

1990'
s

2000'
s

Co-Processor
ASIC prototyping

1974
Signetics FPLA
50 G

1984
Altera EPLD
EPROM Based
1.2 KG

1987
Altera PLD
with Bus Interface
1.4 KG

Glue Logic

1995
Altera FLEX
includes
Memory
10 KG

1999
Lattice
acquires
Vantis

1996
1992
Altera FLEX
Altera FLEX
includes
8000
PLLs
8 KG
100 KG

2001
Altera
Hardcopy

In early 90s people started to see FPGA devices as computing resources rather than
as flexible interconnection systems. This leaded to their use as custom reconfigurable
co-processor that could help to surpass limitations of the limited resources of CPUs
with custom computing units. The coprocessor contribution to an application speedup
is derived from Amdahl’s law (1) [Amdahl67] in which α is the fraction of the application
implemented in Hardware.

=

4

0
0−α$ +

(1)

α
5

4

So to get a significant reduction time, a large fraction of the application must be
accelerated [Edwards97], and this is not always possible. In addition, the Coprocessor
Speedup factor is defined by (2) and, since FPGA based co-processors usually are
relatively far from CPUs, the overhead in communication often burdens the
performance of the system [Benitez04].

4

=

1
5 ''

'
' +5 '

'

(2)

Simultaneously, and thanks to the dynamic reconfiguration ability of some SRAMbased devices, it was feasible to think of computing machines with reconfigurable
functional units. The concept was referred as Custom Computing Machines (CCM) or
Field-programmable Custom Computing Machines (FCCM) [Sima00]. The difference
between coprocessors and CCMs were that the first were addressed to a single
function while the later were designed to exploit reconfiguration to adapt better to
different application scenarios. As FPGA devices are reconfigurable in essence, in
practice the difference between CCM and FPGA coprocessors has diluted overtime
and both terms are often used to express the same concept.

0

During the 90s reconfigurable platforms were experimentally used for signal processing
applications. In late 90s, FPGA manufacturers introduced specific signal processing
circuitry like Phase Locked Loops (PLL) to enable multiple clock domains and MultiplyAccumulate modules (MAC) starting a battle with ASIC and DSP manufacturers that
were the dominant players of that arena [Tessier01].
In early 2000s the integration capacity had increased enough to embed full
microprocessors inside the FPGA device, either as a normal microprocessor sharing
part of the silicon area of the FPGA (Hard-Core processor) or as an Intellectual
Property block mapped in the device (Soft-Core processor). The new “Intellectual
Property” (IP) concept was to Hardware what Software Components were to Software.
They should enable the flourishing of a market of resources that would be ready to use
for any new design. The combination of various IPs including microprocessors,
peripherals and buses and their programming environments allowed the design of
Systems on Chip (SoC) on an FPGA.
To summarize, today reconfigurable systems exploit the tradeoff between flexibility and
performance in their various roles as glue-logic, ASIC prototyping, co-processing, DSP
and SoCs.

'

(

FPGA applications are designed using a combination of several tools. The design flows
depend on the design language and the EDA tool chain. Hardware description
languages like VHDL, Verilog and AHDL share a similar design flow as shown in Figure
2. The designer usually receives a specification in the form of a requirement list, which
must be transformed into a HDL source code. This first deliverable can be validated
with functional simulation tools. This adds the need to develop additional test code,
called test-benches, to generate stimuli to the circuit under test. After validation, the
synthesis process translates HDL language definitions into hierarchical definitions of
the circuit structure based on basic device primitives (ands, ors, multiplexers, flip-flops,
...). There usually exist innumerable circuits structures that can implement the same
function defined as HDL code. Mapping HDL into a given hardware structure is a NP
complex problem.
Specification

Coding

Program

Functional
Simulation

Synthesis

Netlist

Place &
Route

Gate level
Simulation

Bitstream

Configuration

Execution

Validation

The result of synthesis, a circuit netlist, can be simulated into gate level simulators,
which can use detailed information about time and other physical properties of the
particular device primitives, like power consumption. Although it is possible, they are
seldom used for large designs because of the long simulation time that they take.
The next step is to place the device primitives into the actual resources of the FPGA
and define the interconnection between the logic elements. This function is performed
by Place & Route tools that are usually provided by device vendors because of the

0

amount of technology information needed by them. Finally the bit-stream produced by
Place & Route tools is downloaded into the device for its execution.
FPGA device manufacturers are the major providers of FPGA design flows. Since
Place & Route is so technology dependent and synthesis algorithms are quite mature,
there is little competition in the Synthesis Tools market. Moreover, FPGA device
manufacturers try to offer design flows as a single tool and, although allowing it, do not
encourage the decoupling of the process.
EDA tools that are based on higher abstraction level languages have few reasons to
provide an equivalent synthesis step. Instead, they usually translate high level
descriptions into HDL descriptions that can be feed into an HDL design flow as shown
in Figure 3. Popular examples are SystemC compilers (like Forte Cynthesizer) that
produces RTL VHDL, and model based design tools based on MATLAB Simulink like
Xilinx System Generator [Hwang01] and Altera DSP Builder [Altera05].

Specification

Coding

HLL
Program

Behavioural
Synthesis

Functional
Simulation

HDL
Program

Functional
Simulation

Synthesis

Netlist

Place &
Route

Gate level
Simulation

Bitstream

Configuration

Execution

Validation

Besides working at a higher level of abstraction, which is easier for human
understanding, HLL design flows offer benefits: as the code is more abstract it should
also be shorter and as a consequence a simulator working at this level should take less
time to execute than its equivalent HDL simulator. In addition, the speedup in
simulation is greater when cycle accuracy is not needed and one can work at TLM or
ISA levels (Figure 4).

0,

!

"

#

$

# %& ' #

( )*

Speeding up simulation is very important. Simulation is the most usual way of
performing the verification of a digital circuit design and is usually a very time
consuming task. In most projects, time spent on design is exceeded by the time spent
on verification ([Hunt02],[Molina07]).
In late 90s, there was the widespread idea that the productivity of design teams was
not following the Moore’s Law and that, as a consequence, there was a gap between
chip capacity and design productivity that was increasing. Semagroup concluded that
the number of transistors per chip was increasing by a factor of 58% per year while the
productivity of designers measured in transistors per month that a design team
produces was increasing by a factor of 21%. This means that although chips with many
more transistors are available designing a new chip with the same size takes more time
than before. To address the problem and bridge the gap some industry and research
groups encourage to re-use components and design from higher levels of abstraction.

+

$

# %,

( )*

However, some argued that this is not the case and that the gap has not been
continuously increasing. In [Ofner04] the authors suggest that after a design

06

technology change happens (for instance a raise in level of abstraction) there are three
phases, childhood, youth and old age, in which technology is progressively mastered to
reach higher levels of productivity (Figure 6). The eclosion of a new mainstream
technology causes a significant drop in productivity as tools are usually immature and
designers lack the knowledge to take advantage of it. After this childhood phase, as
tools mature, designers are trained, and many designs can be reused great productivity
can be achieved to catch-up the Moore’s Law.

- ./

$

# %,

( )*

Even more shocking are the statements of [Bazeghi05] that conclude that the number
of transistors per month produced by a design team has no correlation with the design
time effort and suggest better productivity indicators as lines of HDL code and the sum
of fan-ins of logic structures. As Sematech productivity gap forecast is based in the
transistors per month indicator, it could be not valid at all. This is also backed by the
forecast of the evolution of memory usage in SoC designs (Figure 7). As memory is a
so simple design, its area occupancy expansion adds little design effort in the
development process, and productivity measured in transistors per month is very easily
boosted. Furthermore, having more memory on-chip instead of having it off-chip
provides some additional benefits because it is usually faster and more energy
efficient.

0

1 2

$

# %3

(()*

Nevertheless, there is little discussion about the fact that the complexity of future chips
will increase drastically. While techniques like code reuse can reduce a lot the coding
effort, they cannot eliminate the testing effort. In fact, the ratio between testing and
coding effort keeps increasing steadily and by now testing is the major contribution to
the overall development time (Figure 8).

45

$

# %6

(0)*

This situation is even worse when various levels of abstraction are mixed. As explained
in [Hemani04] this would happen when a HLL based design is reusing HDL blocks
following the platform-based design hype. A full HLL design can be quickly verified at
HLL level and synthesized assuming a correct-by-construction approach. But including
a low level block, forces to include HDL verification tools in the development process,
that slows it down while adding complexity in the synthesis step because of the
interconnection of systems at various levels.
Hemani encourages fully adopting more abstract levels of design and improving
synthesis capabilities to avoid getting stuck in the current transient productivity gap.
However, to my knowledge, this is happening very slowly and HDL design flows are
still in very good shape.
In this context, functional or logic simulation is still the main method to verify system
correctness. As shown in Figure 4, RTL logic simulation offers cycle accuracy but its
low speed slows-down the development process. Although hardware emulation based
on FPGAs has been commercially available for about a decade and would provide a
significant speedup in verification, it is seldom integrated in HDL design flows provided
by FPGA vendors. Model based design tools such as Matlab/Simulink, Xilinx System
GeneratorTM [Xilinx00] and Altera DSP BuilderTM [Altera05] have successfully revamped
hardware emulation for the DSP domain with the concept of hardware-in-the-loop (HIL)
simulation and proved that it can be a convenient and easy technology.

0

The motivation of this work is to show that emulation can also be integrated easily in
classic HDL design flows so that verification time can be greatly reduced and so
productivity increased.

) *

(

I propose a method based on a developed tool, named Jumble, based on JHDL that
integrates hardware emulation into the design flow following a simple approach. The
principal idea is to allow designers to work in an interactive simulation environment
from which they can select any block of the circuit hierarchy and instruct the tool to
transparently download it into a supported hardware platform for real hardware
execution. Synthesis of the custom hardware, and all the necessary communication
between the simulator and the hardware implementation, is hidden to the user, greatly
reducing the complexity of the process.
In the Model Design world, the concept, known as Hardware-in-the-loop simulation,
usually suffers from the need of a migration phase from high-level abstraction models
to hardware implementation. This process is often done manually.
It makes sense to use the same concept in HDL tools, and in fact, there exists some
commercial offerings that contain some of the desired features. However, most of them
are bound to a particular hardware, dependent of FPGA device or are loosely coupled
with the simulation environment.
The method I propose will integrate the following capabilities:
•

Integration of Hardware-in-the-loop simulation in the JHDL environment.

•

Automation of synthesis, place & route and device configuration tools and operative
system identification of the reconfigured system.

•

Independence of the used FPGA device

•

Independence of the used hardware platform.

Neither full observability nor full controllability are mandatory requirements, these
would be important issues for a Hardware Debugger but are not central for a HIL
simulation system. In our case, a user design is viewed as a black box that is
downloaded to hardware in order to speedup the whole simulation and possibly to
verify that its hardware version behaves equivalently.

+
# (
!

,

-

The integration of hardware emulation in system simulators has been a recurrent topic
in EDA research and industry. An initial clean-room attempt was implemented in the
JHDL project with its execution model [Bellows98]. Other attempts focused in
integrating reconfigurable hardware platforms into MATLAB/Simulink [Alpha],[Lyr].
Simulink extensions have evolved much since then and become popular among the
data signal processing community as they allow accelerating long and complex
simulations without leaving a familiar development environment. There are many
examples of integrating Hardware in the Loop for System Simulation of various
applications, like Bit Error Rate calculation [Singh03][Shirazi03], Software Defined
Radio [Dick01][Ramon05], Sonar Beamforming [George99], etc.
Probably because of EDA tools manufacturers and their marketing strategies, the
integration of emulation in simulation has been presented under different terms.
Sometimes these different flavors are caused by stressing the benefits of some of the
techniques in front of others, for instance performance vs. design productivity.

Hardware simulation [Wisniewski01] is a technology that allows speed up simulation
time, turning weeks or months of simulation into days or even hours. Designer can
“push” whole or a part of the design into hardware. Because it is rather a new
technology every solution is different and has different features. Some vendors
produce only hardware simulators, others manufacture hardware and also software
simulators.

.

+/// 0 "

1! 23

Xcite-2000 [Axis] offers a simulation performance up to 100K cycles/second. Their
product is based on a PCI board containing an Altera FPGA that communicates with
the simulator (Figure 9). Design description can be separated into three components:
behavioral, RTL and gates. The Xcite compiler automatically maps sections, which can
be RCC accelerated (RTL and gate level components), and builds a native compiled
simulation image for behavioral sections, which need to stay within the Axis software
simulator, Xsim. Using "Hierarchy Extracted" mapping technique, the Xcite compiler
automatically maps the design onto arrays of FPGAs.
One of the unique capabilities of Xcite-2000 is its ability to swap software and RCC
state in real time. Thus during simulation, the user may choose to swap all RCC state
into Xsim in order to debug the design and continue software simulation. Once the
circuit is fully diagnosed, simulation state value can be swapped back into RCC for
maximum performance acceleration.
Within Xcite RCC simulation, simulation history for all nodes is compressed within RCC
and stored onto the workstation. Either during or after simulation, the Xcite VCD-onDemand capability can extract all node history values without re-simulation. Thus

design debugging has become highly efficient without the high cost of disk storage or
simulation slowdown.

7 822 1

-

# $

#

9
::

;

0

#

;

#:

:

:,.

(( (0

1((-

*

1! 23

Hardware Embedded Simulation (HES) [Aldec] is the technology that facilitates the
incremental design verification of FPGA and ASIC devices while speeding up design
verification. HES technology allows you to download selected modules of your design
into an FPGA and perform hardware-software co-simulation. After a design block has
been verified at the behavioral level, it is synthesized, implemented and downloaded
into an FPGA residing on an accelerator board. HES technology supports up to four
acceleration boards residing in one computer. The boards are the PCI cards inserted
into the slots of the computer.
The entire design is simulated in the HES environment, which consists of an HDL
software simulator and PCI boards. This environment assures correct communication
between modules located in silicon and modules simulated in software.
Using the HES technology, verified modules of the design can be put into silicon after
the synthesis of even a small part of the design. User needs to synthesize the modules
that should be pushed into silicon, and the HES Design Verification Manager (DVM)
will help to configure HES environment.
Aldec’s simulator is based on the Incremental Prototyping. Figure 10 shows the idea of
Incremental Prototyping. When module A is finished, it is synthesized and implemented
and finally downloaded to the HES board. Since module A resides in the hardware
simulator, the designer can prototype module B in software. When module B is verified
successfully at the software level, it goes thru incremental synthesis and incremental
place and route processes. Note that since module A now resides in the hardware, it is
not synthesized and implemented again.

(

#

Co-emulation is a verification technique that maps portions of the design under
verification into hardware while the rest is simulated in a software environment on a
host computer.
The first traces of this technology are found in [Bauer98] that migrate portions of the
circuit under simulation into a commercial emulation system from Quickturn. In this
work the synchronization between a cycle accurate simulator and the emulation system
is done at every clock cycle. This causes a bottleneck in the system performance. The
maximum frequency reported in this work is 200KHz for a 35KG design and no
speedup is reported.
In order to increase the overall system performance, other synchronization approaches
are possible. [Fritsch99] reports poor speedups unless an enable triggered approach is
used, and in this case the co-emulation system is only 3 times faster than functional
simulation. [Kudlugi01] proposes a transaction-based approach to synchronize the
simulation kernel with an Ikos commercial emulation system (see Figure 11).
Synchronization points between the driving environment (DE) and the device under test
(DUT) are more abstract than clock cycles. They involve several of them, reducing the
number of total synchronization events needed. As a result this approach is faster, and
authors actually report clock speeds of 700KHz and speedups of 320 for a 152KG
design.

#

7

#78

1,,

7

58/
&
,

1,,

$ 2

1

,

8

#

$ 2

-

$
$)

2 ! #

#

"
%3

<

$

#

( )*

[Kim04] proposes to split the testbench by the parts that shown a dependence on the
outputs of the DUT. The dependent part is moved into Hardware so that the
communication between HW and SW parts can be buffered, allowing the use of burst
transfers. This approach allows achieving a simulation speed up to 649KHz.
As the transmission of data between the simulator is main bottleneck of co-emulation
systems, reducing the amount of transmitted data gives a direct frequency increase.
[Nakamura04] studies various scenarios of interaction of C++ based simulators with a
FPGA-based emulator via a register interface. The results of this work shows that
simulation frequency is about 100KHz for some designs but it can reach to 1.1Mhz
when a processor is emulated and only clock and Program Counter (PC) is transferred.
The main drawback of most of these systems is that custom code at both sides has to
be developed for each device under test. [Kudlugi01] uses the Programming Language
Interface (PLI), which allows a Verilog simulator to interface with external code, to
provide some communication primitives based on Unix sockets to interconnect the DE
with the DUT. But designers have to manually develop black boxes that use these
primitives to redirect signals to the emulation system.
This problem is addressed by [Sarmadi02], that defines how to systematically write
HDL code that uses PLI code to interact with the emulation part of the design.
Nevertheless its approach is still manual and they report a maximum speedup of 56.
In [Çakir03] a semi automatic process to generate the communication layers between
simulator and emulator is presented. A tool called ProtoEnvGen [Çakır01] is used for
the generation. No speedup is reported.
Another usual drawback of co-emulation systems is that they are usually limited to the
emulation of a single DUT. But complex designs would benefit from the possibility to

have several designs to test at the same time from a complex testbench.
[Schumacher05] address this problem by defining Virtual Sockets to access multiple
emulated circuits.

5

1 '

# $

# %1

#

(+)*

Finally relatively recent works [Siripokarpirom04] have addressed the use of runtime
reconfiguration to efficiently reuse the FPGA recourses to speedup large designs that
would not fit in a single device.

4

-

The term Virtual Emulation was first used in [Borgatti96] and later [Borgatti97] and
[Dozza98]. This short-life term was used to define a very similar concept to CoEmulation.
A virtual emulation system can be seen as a digital system made up of a prototype
system implemented using available and off-the-shelf components and a virtual system
implemented by a behavioral model running on a simulator.
These two systems communicate through a special purpose HW/SW layer
implementing the Emulation Interface Border (see Figure 13). The virtual system is
typically under development and its specification is not so detailed to make it
synthesizable. The specialized HW/SW interface (the so-called Emulation POD) is
implemented in an FPGA that converts electric signals from the prototype system in
logic signals for simulator and vice-versa.

5

#

$

# %=

7-)*

Bogartti also stresses the benefits of the incremental verification of the system. Starting
from a completely behavioral description one can smoothly go to a complete hardware
implementation while maintaining the same benchmarks and tools (Figure 14).

#

$

#=

70*

"
As initially conceived in [Bellows98], in a JHDL design there is always a HWSystem
object that, as its name implies, represents the whole Hardware System. It implements
the simulation kernel that invokes behavioral descriptions during software simulation. In
some systems it is also responsible to talk with the FPGA through calls to the
necessary APIs and device drivers. The purpose of this communication could be the
configuration of the device or the transmission of data, most of often to interface with a
given functional unit programmed in the device. Using this transmission link, input and
output ports of any JHDL circuit can be redirected to the FPGA device to interact with
its real hardware implementation, allowing its execution in real Hardware (Figure 15 a).

,

a)

b)
+,

>
#

"* 1

$
/

# %=

74)*;

*

#

However to be able to perform this redirection the circuit has to be synthesized before
and downloaded into the platform. This is done programmatically by instructing the
HWSystem to download a given bitstream into the device. Figure 15 b shows the steps
involved in advancing a clock cycle in a hardware circuit.
1. The inputs of the circuit are passed to the hardware implementation through the
necessary calls to device drivers.
2. The HWSystem issues a clock step to the whole system that eventually calls
the device to advance its clock.
3. The device advances a clock and buffers circuit outputs.
4. The outputs of the circuit are passed up to software and placed on the output
ports software buffers.
Running the above algorithm to be able to interact with the synthesized circuit version
has some price to be paid. The hardware platform must include some instrumentation
so it is possible to copy input and output values and control the advance of the circuit
clock.
Unfortunately [Bellows98] does not specify the details of how this is done in the
Hotworks platform. Later [Hutchings99], while presenting how the support for the
Annapolis Microsystems Wildforce platform is implemented, gives some clues about
how the instrumentation is performed. Is not that any circuit can be downloaded into
the hardware platform but just the ones that derive from a specified pelca class which
represents a programmable element from the platform. The pelca class adds some
instrumentation to communicate with the host system but the transmission of circuit
output values after a clock cycle (step 4 of previous algorithm) is performed by using
readback technology from Xilinx FPGAs. The configuration of the device is no longer
programmed in the source code but a configuration utility (Figure 16) is provided to do
it interactively.

6

-?

$

# %

77)*

Later work on JHDL [Hutchings00b], [Hutchings01], [Graham01], [Wheeler01],
[Bellows04] keep on the same approach, more addressing Hardware Debugging rather
than Hardware in the Loop Simulation.

Simulink is a graphical system modeling tool that provides a simulation environment
addressed to continuous and discrete time systems modeling. Simulink was especially
used for process control system modeling but with the years it has broaden its scope
towards DSP and hardware systems design.
Simulink lets to graphically describe system components with interconnected modules,
which can be composed of other basic modules or can be described behaviorally with
Matlab S language or other languages like C and Fortran. To integrate Hardware in the
loop simulation in Simulink there were to major problems to be solved.
1. How to generate Hardware from Matlab models ?
2. How to integrate generated hardware modules in the Simulation chain ?
There has been some research in creating HDL code from S code,
[Banarjee99],[Banarjee00]. In fact this is a problem of behavioral synthesis from a high
level language. However S language has some properties that make this task
challenging. It is a dynamically typed language and programs usually rely on working
on dynamically allocated multidimensional arrays and calls to a large preexisting library
of functions. This approach has not been much successful. On the other hand in year
2000 Xilinx presented an alliance with MathWorks that yield to the launch of Xilinx
System Generator [Xilinx00]. Its approach is not based on behavioral synthesis but on
structural design and IP reuse, which is similar to a previous proposal found in
[Krukowski99]. In System Generator a library of hardware blocks are provided by Xilinx.
Hardware blocks are described twice: as S functions, which can be integrated in
Simulink, and VHDL code, which can be synthesized to a Hardware platform. The
blocks can be simple, as primitive logic cells, or complex IP cores, like an FFT circuit.
By combining blocks the designer can implement and test a Hardware design from
Simulink and create its Hardware implementation.

Several IP blocks are provided: FFT, FIR filters, Multipliers, etc. Users can also
integrate their own existing VHDL code with a black-box model. The reported speedups
(Table 1) depend on several factors like complexity of the design and the
synchronization scheme.
"

; 1#

'

! #

Application
5 x 5 Image Filter
Cordic Arc Tangent
Additive White Gaussian
Noise Channel (AWGN)

$

# %1

Software
Simulation
170 s

Hardware
Execution
4s

187 s

27 s

7

600 s

80 s

7.5

<( )*

Speedup Factor
43

It is relevant to know that Hardware in the loop simulation is provided in conjunction
with platform manufacturers, because they have to provide all the middleware that
allows the communication between the simulation kernel and the circuit under test. In
2000 only few hardware platforms had support for Simulink ([Alpha], [Lyr]) but
nowadays there are a large number of them from different manufactures (Annanapolis,
Nallatech, Lyrtech, etc.).

#
The Ptolemy project [Lee01],[Lee01b] is focused in modeling, simulation and design of
concurrent embedded systems. Ptolemy is aimed to model various technologies, like
mechanical systems, analog electronics, digital systems and software. Several
computing domains are defined so that the models can accommodate the different
properties of the different technologies, especially regarding their notion of time.
Ptolemy is an actor-oriented design framework [Lee03], meaning that it is centered in
modeling the actors that interact in a system and not constraining their models of
computation. Moreover Ptolemy II allows domain-polymorphic definitions and the
integration of hierarchical heterogeneous domains (Figure 17).

0

#

#

0

>>$

#%

( )*

Ptolemy can describe actors in a variety of languages as programming languages tend
to be designed for a certain computing domain. Although there is no included support
for Hardware in the loop simulation in Ptolemy there is some interest by Ptolemy
community in having this feature. In fact, the group of Indrusiak is working on this topic.
[Indrusiak03] describes a method to integrate remote actors into a Ptolemy system.
Since remote actors execute in different process contexts, their implementation can be
anything that communicates successfully with their proxies, including a Hardware
circuit running on an FPGA platform. This is a good approach to share a limited
resource like an expensive FPGA board for educational purposes [Jimenez05] and can
be also useful to design complex systems like a WCDMA receiver [Indrusiak05]. The
development process is similar to the process imposed by Simulink based tools (Figure
18). A model in Ptolemy is created and it is refined in several iterations first to convert
from floating point to fixed point, then to create its equivalent JHDL model and finally to
netlist the resulting circuit.

4

#

$

# %>

'(+)*

However, this kind of integration heavily relies on manual processes, especially in the
interface between Ptolemy and JHDL. Figure 19 gives an idea of the coding that must
be developed to intercommunicate both worlds.

7>

"

#

@

$

# %>

'(+)*

Despite the availability of the “Hardware Mode” of JHDL that would allow the execution
of JHDL based actors on FPGA platforms, so eventually achieving the goal of
executing Hardware in the simulation loop, to my knowledge this has not been done.
To sum up Ptolemy provides a very powerful environment to design in various
computing domains but, by now, it has an excessive manual approach to integrate
Hardware in the loop.

DSP designers can aggressively shorten simulation time of complex systems with HIL
simulation. However, HIL simulation it is not the only application of hardware emulation.
Some ASIC prototyping systems, like the products from Quickturn [Butts92],[Quickturn]
and Mentor Graphics [Mentor] (originally IKOS), emulate large ASIC designs on
FPGAs. Since a single FPGA does not have enough capacity to embed a typical ASIC
design, these systems use multiple interconnected FPGAs enclosed in a big case and
controlled by a host computer via a high bandwidth link. The drawbacks of these
systems are that they are very expensive and need some expertise to be used.
Additionally the communication between hardware and simulation kernel is often based
on transactions and usually not transparent to designers.
Another related topic is the Hardware Debug concept [Tombs04],[Graham01] which
pursues to offer the equivalent features of Software Debuggers for Hardware design.
The main goal of a Hardware Debugging system is to allow detecting and removing
bugs from a design, so speed is not the central point, although an important one.
Software designers debug by running step-by-step, setting breakpoints, adding traces,
watching variables, and modifying values while debugging. These features can be
formalized as interactivity, controllability and observability. Hardware debuggers, for
instance, should allow controlling the clock (or clocks) execution, to watch any part of
the circuit, to add breakpoints (triggers), to add traces, and to modify register or
memory contents. Simple versions of some of these functions are being integrated in
EDA design flows. Embedded logic analyzers like Altera Signal Tap [Altera01] and
Xilinx Chip Scope [Xilinx00b] follow a simple approach to enable the acquisition of
signal values over some time after a triggering event has been reached.

5
6

$

JHDL is a design environment that provides a Java API for describing FPGA circuits in
a constructive way (mostly bottom-up) as well as a collection of tools and utilities for
their simulation and hardware execution.

'
In JHDL circuits are described as Java classes that follow a given design style. The
typical different levels of abstraction used by HDLs are specified by deriving classes
from specific base classes and interfaces. The basic class hierarchy is shown in Figure
20. Cell class is the main class that represents a hardware block with an I/O
interface. There are two different Java interfaces: Clockable and Propagateable,
which denote sequential and combinational logic respectively. Cell class has a
number of subclasses that map more specifically the nature of the circuits. CL class is
a Cell derived class that represents a completely combinational circuit, so it
implements the Propagateable interface. Synchronous class is a Cell derived
class that represents a completely synchronous circuit, so it implements the
Clockable interface. It is mandatory that CL and Synchronous derived classes
provide a behavioral model and they are often used internally in JHDL to represent
primitive logic from the hardware devices.
Cell

CL

Structural

HWProcess

Synchronous

+waitUntilClock()

«interface»
Propagateable
+propagate()

«interface»
Clockable
+clock()
+reset()
(

«interface»
Runnable
+run()

@

User defined designs consist on the instantiation of existing cells following a
constructional method. Moreover, their behavior can be inferred from their components.
For this reason, there is an additional Cell derived class named Structural. As
building blocks can be either combinational or synchronous Structural class
implements both Clockable and Propagateable interfaces but it is not mandatory
to provide a behavioral model since it can be inferred from containing cells.

The additional HWProcess class allows to describe circuits by only providing a
behavioral model based on a sequential description in which timing is specified by
calling the waitUntilClock function. Since there is only a behavioral model, and
they never map to primitive logic from the hardware device, HWProcess derived
classes are not synthesizable.
Whatever is the base class of a circuit, a behavioral model consists in having a
programmatically way of driving the circuit outputs, i.e. assigning values to the output of
the system depending on the values of the inputs and an internal state. Inputs and
outputs in JHDL are represented by a unique class named Wire. Wires also connect
different circuits. One can examine the value of an input wire by calling a get method
and can assign a value to an output wire by calling a put method. There are a number
of variants of get and put methods depending on the width of the wires.

4
Once a circuit is compiled it can be simulated by two methods. First, a custom
testbench can be developed by implementing the TestBench Java interface.
Simulation kernel is exposed as an object and can be easily controlled from
testbenches to feed stimuli to the circuit under test, extract and display outputs and
control execution by explicit invocation of clock advance functions.
In the following example code a testbench is build to verify a median filter design. In the
execute method HWSystem clock method is called to advance the system clock
after the data for the inputs of the system have been updated.
package org.cephis.MedianFilter;
...
public class tb_MedianFilter extends Logic implements TestBench
{
static HWSystem hw;
Wire in[] = new Wire[9];
int v[] = new int[9];
...
public static void main(String argv[])
{
hw = new HWSystem();
tb_MedianFilter tb = new tb_MedianFilter(hw);
tb.execute();
}
public tb_MedianFilter(Node parent)
{
super(parent);
for (int i=0; i < 9 ; i++) in[i] = wire(8, "in"+1);
median = wire(8, "median");
design = new MedianValue(this, in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7], in[8], median);
}
public void execute()
{
FileInputStream fis = new FileInputStream("c:\\test.jpg");
JPEGImageDecoder decoder = JPEGCodec.createJPEGDecoder(fis);
BufferedImage img = decoder.decodeAsBufferedImage();
BufferedImage dst = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_INT_RGB);
int shift = 0;
addSaltAndPepperNoise(img, 0.2);
for (int i=0; i < 9 ; i++)
v[i] = 0;
getSystem().cycle(1);
for (int mask = 0xFF; shift <= 16; mask <<=8, shift +=8)
for (int y=1; y < img.getHeight()-1; y++)

for (int x=1; x < img.getWidth()-1; x++);
{
fillKernelValues(v, img, x, y);
getSystem().cycle(1);
int rgb = dst.getRGB(x, y);
rgb &= ~mask;
rgb |= (median.get(this) << shift) & mask;
dst.setRGB(x,y,rgb);
}
}
public void reset()
{
for (int i=0; i < 9; i++)
in[i].put(this, 0);
}
public void clock()
{
for (int i=0; i < 9; i++)
in[i].put(this, v[i]);
}
...
}

Second, a circuit can be loaded into the interactive simulation environment
(DynamicTestBench, DTB), which provides several facilities for exercising and
viewing the state of the circuit during simulation. DTB includes a hierarchical circuit
browser with a tabular view of signals, a schematic viewer that include values of
signals, a waveform viewer, a memory viewer and a command line interpreter (see
Figure 21).

#
?

"
#

A
*6 #

*1

9 *
"
#

=
##
5

"*

Since all tools are public and available from the designer perspective, more complex
hybrid testbenches can be developed. For instance, a TestBench derived class could
instantiate the schematic viewer and waveform viewer for easy visual inspection of
results while programmatically feeding stimuli. On the other hand, DTB based
simulations could include some behavioral modules that generate complex stimuli or
display results in a custom way.

Most HDL simulators are based on an event-driven approach. In fact, as stated in
[Kulmala], VHDL and Verilog languages semantics assume there is an underlying
event driven simulator. An event driven simulator is based on the existence of a queue
of pending events, called Time Wheel. An event has two components: value of a signal
and time. At each simulation iteration the simulator takes the head of the Time Wheel
and evaluates all the dependant circuits that the event could trigger. Circuits that
depend on a signal are obtained from the sensitivity list that is always defined in VHDL
and Verilog designs. The evaluation process can probably cause the insertion of new
events in the Time Wheel. Events are always inserted in the Time Wheel in time order.
Simulation ends when the Time Wheel contains no pending events.
A simple example of the simulation dynamics is depicted in Figure 22. In this example
the Time Wheel is initialized with the test vectors for signals a, b, c. The first event in
the Time Wheel indicates that signal b is changed at time t1. When the simulator takes
this event it must look for all circuits that depend of signal b. The sensitivity list of the
behavioral model of the NAND2 gate includes the signal b, so it is processed and a
new event, associated to the output, is inserted in the Time Wheel. The same algorithm
is repeated until no events are pending in the Time Wheel.

&

&

0

0" &=0
"

=0

"

5
> >*

"

.

=
=0

"

5
47

=

"

.
"

=0

#

=0

"

=0

5
47

#

Several events can happen at the same time, so the order of evaluation of events is
critical for the accuracy of the results. If behavioral models of the circuit to test include

,

time semantics the system can be verified with some time precision. On other
occasions a zero delay is assumed.
On the other hand JHDL has a cycle-based simulation engine (CBSE). In cycle based
simulators time advances at discrete intervals, i.e. clock cycles. Combinational logic is
assumed to be zero delay and synchronous logic has a delay of one clock cycle.
The JHDL simulator has to differentiate between synchronous and asynchronous
circuits, and as a consequence between synchronous and asynchronous wires. This
allows the simulator to perform the correct method of value propagation to each circuit
wire. The model that allows the simulation of a JHDL circuit is build at the same time
that the circuit is build. In fact the simulation system is tightly coupled with the circuit
modeling and simulation structures are created and maintained even though there is no
intention of simulation. This approach is totally different to other simulators like
ModelSim that are totally uncoupled to circuit modeling.
A fundamental class in the JHDL framework is the ValuePropagater class that
models a channel that can propagate a value between two endpoints. A
ValuePropagater is associated to each line of each wire. During circuit building the
BuildListManager class is responsible for keeping track of all propagators of the
system that are stored in the all_value_propagaters member variable. This array
is populated progressively as wires are connected to Cells (Figure 23).

Cell

ValuePropagater

Wire

HWSystem

BuildListManager

all_clockable_cells : ClockableList

initCell()
addCell()

addCell()

connect()

insert()

linkSinkCell()

addSinkCell()

addValuePropagater()

all_value_propagaters : ValuePropagaterList
addValuePropagater()

insert()

B6 1 C

#

'"

When simulation is initialized all the ValuePropagaters are classified depending on
the nature of the cells that drive them. This process is performed by
PropagateManager.topologicalSort (Figure 24).

6

Simulator : Simulator

MCSimulator : MCSimulator

build_list_manager : BuildListManager

PropagateManager : PropagateManager

run()

initialize()
initializeSimulator()
Using all_value_propagaters

initializePropagateManagerWithDefaults()

initializePropagateManager()

initialize()

topologicalSort()

global_simulation_schedule : GlobalSchedule
getAllClockDrivers()
reset()

getGlobalPropagateSchedule()

resetSimulator()

reset()
all_clock_drivers : ClockDriver

global_propagate_schedule : PropagateSchedule

reset()

propagateAll()

B6 1 C

#

#

<

This classification is important to handle in different ways the different types of sources.
For instance, constant cells do not vary and there is no reason to evaluate their value
at every clock cycle, so they are handled different from the rest of the elements of the
circuit. Besides classification, the topologicalSort function also builds a directed
graph with the dependency of the different ValuePropagaters. This is useful for
propagating the results of combinational logic since all cells are evaluated in the order
of the list. This is also the major reason why asynchronous loops are not allowed in
JHDL.
The following example illustrates how the topological sort is performed and how this
affects the simulation. Let’s consider we have a very simple code in a JHDL circuit that
creates a couple of registers and some simple logic gates. We assume that signals
in0, and nor are the input and output signals of the circuit respectively. The
schematic view of this very simple circuit is shown in Figure 25.
...
Wire in0 = wire("in0");
new Stimulator(this, new Wire[]{in0});
Wire xor0 = wire();
Wire nor = wire();
Wire reg1 = reg(in0);
Wire reg2 = reg(nor);
xor_o(reg1, reg2, xor0);
Wire or0 = or(reg1, reg2);
Wire and0 = and(in0, or0);
nor_o(reg1, and0, nor);
...

+ 1#

#

As mentioned before, the topological sort during simulation initialization would classify
the different elements of the circuit, including Wires and Cells. The resulting graph is
shown in Figure 26, Cells are yellow colored while Wires are white colored. Three main
groups would be created for this circuit: constant cells, clockable cells, and
propagatable elements. Constant cells would contain all the constants of this circuit,
just power (VCC) and ground (GND) connections. Since there are only two flip-flops in
this circuit, clockable elements would contain only references to lpm_ff and lpm_ff-1.
Finally, we would have the list of the propagatable elements of the circuit. As the list
has been build taking dependencies into account the simulator can evaluate each
element in order and be sure that no inconsistency occurs.
Constant
elements

Propagatable
elements

Clockable
elements

lpm_ff

vcc

req1_q

lpm-ff-1

vcc-1

req2_q

vcc-2

in0

vcc-3

lpm_xor
gnd

or2

gnd-1

or_out

gnd-2

and2

gnd-3

and_out
nor2
nor_out

-8

#

The simulator evaluates all synchronous blocks before propagating the asynchronous
elements (see Figure 27). The order of evaluation of the synchronous blocks is
irrelevant since the values they compute is not made public to the rest of the circuits
until Wire propagation occurs as part of the propagation of asynchronous elements.

0

Simulator : Simulator

MCSimulator : MCSimulator

global_simulation_schedule : GlobalSchedule

ClockDriverList : ClockDriverList

ClockSchedule : ClockSchedule

run

simulate()

step()

risingEdgeClock()

clockAll()

SettlePropagate()

settle_propagate_schedule : GatedPropScheduleList

tmp_schedule : PropagateSchedule

propagateAll()

propagateAll()

0 B6 1 C

#

'

#

JHDL allows synthesizing circuits created by the user. The synthesis method is based
on generating an EDIF [Edif] netlist from the circuit model to be used by FPGA provider
for final Place & Route.
The EDIF (Electronic Design Interchange Format) format is a data interchange format
defined by the Electronic Industries Alliance (EIA) and US based industry association
to make CAD tools interoperable.
As previously described, the internal circuit model consists of a hierarchical tree of cells
connected by wires. The EDIF format completely matches this model, so the
generation of EDIF files from the model is straightforward.

The EDIF system describes interconnections in text format by using reserved keywords
(or tags) that are organized hierarchically. Being a text based hierarchical format, it has
similarities with XML and HTML.

The following example code shows the netlist of a simple cell in EDIF 2 0 0. As seen
below the andX_g_1 cell defines two input ports and one output port and instantiates a
primitive cell and_2, which is part of the primitive elements of the FPGA library.
(cell (rename andX_g_1 "andX_g_1")
(cellType GENERIC)
(view view_1
(viewType NETLIST)
(interface
(port in1 (direction INPUT))
(port in0 (direction INPUT))
(port out (direction OUTPUT))
)
(contents
(instance andX
(viewRef view_1 (cellRef and_2))
)
(net (rename in1 "in1")
(joined
(portRef (member i 1) (instanceRef andX))
(portRef in1)
)
)
(net (rename in0 "in0")
(joined
(portRef (member i 0) (instanceRef andX))
(portRef in0)
)
)
(net (rename out "out")
(joined
(portRef o (instanceRef andX))
(portRef out)
)
)
)
)
)

JHDL cannot create the final bit-stream to program the FPGA, it is mandatory to use
the tools provided by the manufacturer of the device, e.g. ISE for Xilinx devices. This is
not a problem of JHDL but the result from the industry tactics who is very reluctant to
make the bit-stream format publicly available.

-"
JHDL offers an integrated simulation/execution environment [Hutchings01] meaning
that designer can use the same facilities when working in simulation mode and when
working in hardware mode. For instance, clock control and schematic viewer, whose
signal value annotation should be available on both modes. These features are based
on the following facts:
1) Xilinx devices allow retrieving the state of the complete configuration memory,
including flip-flop states, through readback.
2) JHDL classes that represent stateful device primitives, like flip-flops, implement
the ExternallyUpdateable interface, so when the simulator kernel is
running in hardware mode only updates their value after retrieving readback
data.
A drawback of this approach is that is limited to devices that support readback or an
equivalent technology, so at the end is limited to few Xilinx devices. A more general
approach consists in instrumenting the designs with scan chains [Wheeler01b] to be
able to access all circuit flip-flops independently from the kind of used device.
Unfortunately, the cost in area overhead can be very high, from 30% to 100%, and
speed is degraded by 20% in average.

JHDL hardware execution model provides a method to transparently update state of
the model from the executing hardware but lacks a method to update the state in the
other direction, which might be based on JBits [Ballagh01], [Poetter04]. This drawback
is solved by providing a transaction-based model for each current supported hardware
platform, i.e. testbenches communicate with circuits through register read/write
operations. This makes difficult to incrementally test parts of the design on its hardware
implementation because the interface should be redesigned in each iteration.
Additionally JHDL hardware execution model requires having a bitstream of the design
to be downloaded into the hardware platform but the invocation of Place & Route tools
to produce this bitstream is not included in the design flow.

#
JHDL has support for few hardware boards. Some info of supported platforms can be
obtained from http://splish.ee.byu.edu/lab/ but most of the detailed info is spread in
several research papers that describe applications implemented on them.

6

,

#

The initial paper from JHDL [Bellows98] has references of support for the HotWorks
platform, a PCI board from Virtual Computer Corp.
Unfortunately, there is very little information about the details of how this platform was
supported in JHDL.

4 522D

6

'

#

%#

The Systems Level Applications of Adaptive Computing project was leaded by
Information Sciences Institute of the University of Southern California. As stated in their
website, the mission of the project was to create an open, standards-based, scalable,
COTS based reference-platform that could be used for high performance demanding
defense applications.

The SLAAC1 platform (Figure 29, Figure 30) was build as part of the project. It consists
of an FPGA-based accelerator on a full-sized 64-bit PCI board containing a userprogrammable Xilinx 4085 device, two user-programmable Xilinx 40150 devices, and
ten 256Kx18 100MHz ZBT synchronous SRAMs

71

(1

2 "

'

2

# $

#

# %

( )*

To implement a hardware design using in the SLAAC platform, and be able to simulate
and netlist it, the design class must extend the super class pelca and define the
input/output port of the circuit first.
Simulations and executions can be controlled automatically from programmatic
testbenches or manually through a graphical user interface (Figure 31).
Communication with the host is possible through the IF FPGA.

@

6

,

1

2 !5 $

# %6 ( )*

#

Annapolis Micro Systems Inc. manufactures various FPGA based boards for rapid
prototyping and educational purposes. The Wildcard board (Figure 32) is a CardBus
board for which there is a JHDL execution model. However the available model does
not support readback.

?

"

As shown in the logic block diagram (Figure 33) the board contains a processing
element (Virtex FPGA) connected to two memory chips, and has two I/O banks and a
bus (LAD, Local Address Data Bus) connected to the CardBus interface.

?

#

The JHDL platform model contains the WCBoard class describing the board as a
whole. The application specific circuits contained inside the board (like memories and
bus controller) are exposed to the JHDL user as behavioral modeled circuits. In fact the
user designs can only implemented into the processing element (PE).
To do so, the user must implement a Java class extending the LogicCore class. The
elements external to the PE can be accessed through the PE interface, i.e. their
input/output pins. Some helper interfaces are made available to ease the design.
The communication with the external world is achieved by going through the LAD bus,
which is transaction based. Since all the fixed functionality hardware circuits (all but the
PE) have alternative behavioral models, the user can simulate a host/board system.
When execution mode is used the real hardware is used obtaining a significant
speedup. Since the Wildcard platform does not support readback when executing in
hardware mode the visibility of the total circuit state is lost.

6

)

#

The Osiris platform (Figure 34) is another internal platform developed by USC/ISI and
later commercialized by CoreTech, a division of Atlantic Coast Telesys.

,

#

The Osiris platform uses a large FPGA that connects with large SDRAM memory and
some ZBT RAM modules. It also integrates a current and thermal monitoring system.

+,

#

The platform is supported by JHDL [Osiris] by providing the typical behavioral models
of the non-programmable blocks of the board. The user is not forced to extend a
particular class to implement a circuit, but only to conform to a given interface, i.e. a list
of the defined input and output pins. The readback support is missing so the state of
the system can be obtained indirectly through the transactional interface.
Summing up, several platforms are available that support the execution mode of the
JHDL framework. They provide a smooth method to go from circuit simulation to real
system execution while offering a good level of observability when using the readback
technology. JHDL execution mode aim is to offer a final execution environment for
hardware designs. However, the JHDL execution mode is unacceptably dependent on
technologies non-universal to different FPGA manufacturers (like readback). Another
great problem is that it usually forces to identify the elements that where produced by
the final Place & Route process and reference each one to their original design entities.
Place & Route processes are often tightly integrated with the Synthesis process, and
they are doing a better and better job to get rid of unused logic or refactor circuits to
more efficient ones. So at the end you can download a bitstream that implements a
functionally equivalent circuit but use different resources that you initially planned. In
this situation the tools have trouble to offer valuable information.
In the following section I propose a different approach to provide hardware execution in
a broader range of platforms. However the aim is not to offer the final execution
platform but one that you can use to speedup simulations during your design process.
It comes for free that you can eventually use it as the final execution platform.

,

7
-"

'

The execution model proposed by JHDL has an important drawback: it is tightly
coupled to the underlying hardware platform. The user has to design specific
testbenches for the given platform, in which the platform is explicitly referenced. The
user often knows the FPGA pin-out and uses it to access external resources or
communicate with the Host. Moreover, often the circuit to execute in hardware must
extend a particular class, e.g. pelca or LogicCore.
This approach can be examined from the following point of view: the platform
manufacturer provides good simulation models of the board, and simulation
environment is augmented so it can switch from board simulation to board execution in
a very easy way as depicted in Figure 36. In this model the user is designing a board
application, and since all the examined platforms are connected to a PC host through a
particular flavor of PCI, probably a PC accelerator.
This kind of board application uses an HWSystem object, which can be executed in
either hardware mode or simulation mode. This causes to switch between behavioral
circuit models or the interfaces to the real hardware. In most JHDL supported platforms
the necessary synthesis, bitstream generation and configuration steps are not fully
automated and user intervention is also needed in this step.

Programable Element interface
Host
Application

Host/Platform
interface
simulation model

User
Hardware
design

External Hardware
simulation models

JHDL
Simulation
Environment

User Hardware Design Synthesis
Place & Route
PE configuration

Programable Element interface
Board
APIs &
Drivers

Host
Application

PCI
Interface

User
Hardware
design

External
Hardware

JHDL
Simulation
Environment

-@

PCI
board

/

#

6

Jumble proposal is more radical in the sense that it tries to hide as much as possible
the existence of a Hardware platform to the designer. Instead of developing a hardware
accelerator bound to the platform, the user implements a hardware design totally
unrelated with the platform. The user is not forced to conform to a given interface and
creates the exact same design that would create without having the Jumble simulation
feature. When a certain block (target) of the design is desired to run in hardware the
Jumble tool automatically creates the logic to implement the selected block in the
programmable element of the available platform. The target circuit is synthesized and
downloaded to the platform, but some logic is added to make the communication with
the simulation possible. On the software side, the target simulation block is substituted
by a redirector that performs the communication with the hardware implementation of
the target.

Module
A

Module
C

Module
B

Module
E

User
Hardware
Design

Module
D

JHDL
Simulation
Environment

Selected block
Synthesis
Place & Route
PE configuration

Programable Element interface
Board
APIs &
Drivers

Module
A

Module
C

Module
B
redirector

Module
D

PCI
Interface

Module
B

Module
E

PCI
board

User
Hardware
Design

JHDL
Simulation
Environment

0 @ #"

1#

#

=

/

Host Interface
Scan Enab
Scan Chain Control

DoScan
CLK

Clock Control Register

Unstoppable
CLK

Scan Chain Data Register
Output

Input
Input

Output

Target
Block

...

...
Input

Output

4; 1

'

The added logic to communicate with the simulator has to be able to put inputs and get
outputs the circuit under test and control its clock advance. To do so the target design
is wrapped around a boundary scan chain (Figure 38), which is build as a circular
register that has a window accessible from the Host. This window is a 32 bits wide
register called Scan Chain Data (SC). A Scan Chain Control (SCC) register is used to
control how to shift the scan chain. The SCC controls the two important signals:
DoScan and EnabScan. EnabScan indicates that the scan chain should shift one bit.
The less significant bits of SCC contain a counter value to instruct how many bits
should be shifted in the scan chain. The DoScan flag indicates that a scan operation is
being performed. It keeps activated during several transactions of the PCI bus until all
the data has been correctly shifted on the scan chain. While DoScan is active the
circuit under test does not see any change on its inputs. Only when DoScan gets down
the inputs reflect the values that have been feed to the system through the scan chain.
After all data has been shifted in the inputs a clock cycle can be scheduled. Clock
control register (CC) accepts a number of clocks to be run in the hardware system. A
gated clock circuit is controlled by a countdown counter that stops running when zero is
reached. Its design is shown in Figure 39.

0

isZero

I=0?

I

Down
COUNTER

Count

D
Load
Dec
isZero

Load

load out

IDLE

load in

Q

dec

load in / load out

out enable

LOAD

load in / load out

load in / load out

COUNT
DOWN

isZero / out enable,
dec

- / out enable,
dec

Gated CLK

CLK
7=

'

#

'

When the clock cycles are complete all output registers are shifted out through the
scan chain.
Full scan chain registers uses the ScanOut value for chain connection and for regular
Q. However the dangling of register output during shift operations could change the
state of possible asynchronous designs. So an asynchronous safe boundary scan
chain node is used as shown in Figure 40.

D
Scan In
Scan Enab
Enab

0
1

1

Main
Register
D
Q
Enab

Unstoppable
CLK

Backup
Register
Scan
Out

Scan Out

D
Q
Enab

Unstoppable
CLK

0
1

DoScan

0

Q

(; =

The simulation time reduction, or speedup, that I expect is mainly determined by the
percentage of circuit design that is implemented in hardware. This is quite an evident
conclusion if we recall Amdahl’s law. The more circuit is implemented in hardware the
faster the simulation. However in our design we have to take into account the width of
the interface as well, since we spend quite a lot of “slow” cycles to place the correct
values into the inputs and outputs of the hardware version of the target circuit.

From here after we can consider only a simulation of a single clock run, multiple clock
runs can be generalized from the values that we get by a simple multiplication.
Speedup is determined by the factor between the standard simulation time and the
time used by Hardware in the loop simulation. Of course, in the later case we have to
consider that there is a part that is implemented in Hardware and another part that
remains unmodified. Their contributions to simulation time are TJumble and TRest
respectively.

=

4

(3)

4
?

=

7

+

A '&

(4)

=

7

+

4@

(5)

?

4

The part of the circuit implemented in hardware needs to get input data from the
simulator, run a clock cycle, and send the output data again to the simulator. This time
can be grouped by transfer time (TI/O Transfer) and clock run (THWTarget).
A '&

=

+

? ?1

@

(6)

Input/Ouput data transfer is performed by reading and writing the SC and SCC
registers. We do several PCI bus operations to complete a complete clock cycle but the
TI/O Transfer is related with the width of the circuit interface. As the slowest part of the
process is the shift of the SC register and runs at PCI clock speed we can do the
simplification to make TI/O Transfer linear to the width of the circuit interface.

=@

? ?1

⋅

5?

(7)

Since we only run a clock cycle we can totally ignore the contribution of THWTarget to
TJumble.
A '&

=@

⋅

5?

(8)

The last simplification that we can make is to consider that the clock period of the host
computer is a fraction of the period of the PCI clock. And then, express the time of the
Software parts of the simulation in CPU operations.

=

5?

5 C

=

5?

7

(9)

⋅1

7

(10)

=

4@

5?

⋅1

(11)

Obviously to get significant speedup TJumble should be smaller than TSWTarget , but also
greater than TRest to avoid the effects of an small α in the Amdahl’s law.
Let’s consider some situations to see how the numbers affect to the expected speedup.
In all the cases we will consider a target circuit with a 1000 wires interface.
If we had a big part of the circuit as Hardware we would expect TRest << TJumble. The
unaccelerated software model of the circuit should use less than 60K operations per
cycle of the host processor
(13).
5?

1

7

⋅1

<< @

7

<<

⋅

5?

= 06 ⋅

5?

(12)

(13)

6

In this case the Speedup would be given by

4

=

4@

(14)

A '&

5?

4

=

@

⋅1
⋅

=

1

5?

⋅ 06

(15)

For instance, if we are looking for a speedup factor of 100 we should have a target that
uses more than 6M host operations to simulate each clock cycle.

1

>4

⋅

6 >0

⋅

6

(16)

Comparing this value with the rest of the circuit it is clear that moving a large part of the
circuit to the target gives an opportunity to achieve some important speedup. If rest part
is small enough (much less than 60K operations) the factor of operations used by the
rest part versus the target part is approximately equal to the obtained speedup.
Consider now opposed case, having TJumble << TRest. This does not necessary mean
that target is small compared with the rest of the circuit. As TJumble is constant for a
given interface size this means that the rest of the circuit model is much more complex

than 60K operations per cycle (18), but gives to information about the complexity of the
target.

=

7

5?

1

⋅1

>>

7

>> @

7

=@

A '&

⋅

>>

⋅

(17)

5?

(18)

6

Again, in this case the speedup is determined by the complexity of the target software
model.

=

4

+

7

4@

(19)

7

5?

= 0+

4

5?

⋅1

= 0+

⋅1

1
1

(20)
7

7

For instance, if we are looking for a speedup factor of 100 we should have a target that
uses much more than 6M host operations to simulate each clock cycle (21).

> 4

1

− 0$ ⋅ 1

>> 0 @

7

(21)

6

Until now we have considered two extreme cases, and in both cases we need a
complex target to achieve a significant speedup. But what speedups can we get in
more balanced cases? By balanced I mean having a similar complexity of the software
models of the target part and the rest part.
Let’s consider such a case. In this case we have a circuit in which the target part needs
100K operations and the rest part needs 100K operations as well. As previous
examples the interface has 1000 wires.

=

4

+

7
7

5?

4

=
5?

⋅1
⋅1

7

+

7

+@

5?

4@

+

(22)

A '&

⋅1

=
⋅

5?

0

6

+

0

0

6

+0

6
=0

(23)

This math troughs a rather moderate value for the obtained speedup, but we should be
expecting this kind of results after looking at Amdahl’s law. In this case we could
improve the speedup if we had a smaller interface, but even with an interface of a
single wire the speedup would be just almost 2. The only way of having significant
speedups is to have a good percentage of circuit implemented in the target, or in
Amdahl’s terminology, to have a value of α very close to 1.

8
!
-"
The JHDL framework had several drawbacks that limited its potential for circuit design.
One of the drawbacks of JHDL was its limited support for FPGA devices; this was a
serious drawback as our research group has traditionally been working with Altera
devices. Another important drawback was the lack of behavioral synthesis. Complex
control schemes are better implemented with behavioral code (at the RTL level) and
behavioral synthesis is required to transform this code into hardware blocks. Finally,
another detected drawback was the faulty support of sequential behavioral model. This
feature was present in initial versions of JHDL through the HWProcess class but
somehow was unsupported by newer versions.
The following subsections describe the work undergone to solve these problems and
overcome these limitations. Parts of this works were done with Jordi Farré and Alexis
Morugó as part of their respective final year projects and were later published in
[Castells04b] and [Castells06b].

(
Most of current FPGA and CPLDs are based on the use of simple called computational
blocks (CLBs) or logic elements (LEs). These blocks are often built up from LUTs,
registers and multiplexers. Any logic circuit, either combinational or sequential, can be
build by the combination of several of these blocks. Every FPGA has a different CLB
design. The goal of any design tool is to make the best use of the available CLB
resources. This process is known as technology mapping [Cong94].
JHDL includes specific TechMappers for every supported FPGA device. Their function
is to translate logic functions in their equivalent optimal structures for every FPGA
device. Figure 41 is a clear example illustrating the objective of this process. To the left
side, there is the structure of a CLB of the Virtex family devices from Xilinx. To the
right, the result of mapping a 9-input and gate, performed by the VirtexTechMapper
class, is shown. To make a good use of resources the VirtexTechMapper has
divided the and9 function in two and4 that can be implemented by two LUT4 present
in the CLB and has completed the function by using two multiplexers also present in
the CLB structure.

5

/2 =

#

7

"

To add support for Altera devices in JHDL, we need to know the structure of the LE
(equivalent to CLB in Altera technology) for all of their devices and then implement a
TechMapper that performs an optimal adaptation of the logic functions to the LE
structure. Instead of this, we took a simpler approach: use the LPM standard. The LPM
standard [Altera96] allows including high level elements in the netlist file, like EDIF
format [Edif]. The LPM elements can include parameters and need a logic synthesis
step before the Place & Route. The previous logic synthesis process ensures an
optimal mapping to the LE structure for Altera devices but causes a lose of control
about the number of used FPGA resources from JHDL viewpoint.
The first step consists in implementing the classes of primitive logic elements based on
LPMs. The primitive logic elements are not decomposable in other simpler elements,
they are the leaves in the circuit hierarchy. They also need a behavioral model for
simulation. Either propagate, or clock and reset functions must be defined for
combinational or sequential logic respectively.
We have implemented LPM_AND, LPM_OR, LPM_XOR, LPM_INV, LPM_MUX,
LPM_FF, LPM_ADD, LPM_ADD_SUB and LPM_CONSTANT in a new
com.Altera.lpm package. These primitives are similar to primitives from Xilinx
devices but, as they have a higher level of abstraction, they make more use of generic
parameters.
Once the primitives are implemented, the next step is to develop some technology
mapping classes (ApexTechMapper, CycloneTechMapper, StratixTechMapper)
that make use of the new available LPM primitives. Thanks to the parametric nature of
LPM primitives the mapping process is much simpler than the equivalent process for
other technologies like Xilinx.
Finally, it is necessary to develop a custom Netlister due to the differences in the
interpretation of EDIF files between Altera and Xilinx tools. The main problem is how
GND and VCC signals are handled. Xilinx tools define two primitive logic elements for
this purpose. They are like logic gates that have no inputs and drive a constant value.
Each VCC or GND connection in a JHDL circuit ends up in an instantiation of one of
these custom gates connected each target. Altera tools do not define such primitives
and assume VCC and GND as being global networks of the circuit. In addition LPMs

,

make heavy use of bidimensional signal arrays, which are not directly supported by the
EDIF standard.
For these reasons, a custom Netlister called CephisNetlister has been developed
to address the particularities of Altera Place & Route tools.

-

(

4

Behavioral JHDL code has to main advantages over structural code: it is much more
human readable when describing reactive systems and is faster to simulate.
Behavioral synthesis from Java code was proposed in previous works such
GALADRIEL [Cardoso98], NENYA [Cardoso99], Wirthlin’s work [Wirthlin01] and Sea
Cucumber [Tripp02].
Most of these approaches are based on the analysis of sequential code and do not
match the usual RTL like descriptions used in JHDL behavioral models. All methods
are based on the analysis of the CFG and DFG derived from the java code to build
either EDIF or VHDL code.
The VHDL language [VHDL98] offers a great flexibility to model digital electronic
circuits. Designs can be described in various levels of abstraction (sequential behavior,
RTL and structural) and even mix them in the same source code. Since not all
descriptions are synthesizable, designers have to know which subset to use in order to
avoid rewriting.
LIBRARY ieee;
USE ieee.std_logic_1164.all;
USE ieee.std_logic_unsigned.all;
ENTITY count_a IS
PORT(clk, rst, updn : in std_logic;
q : out std_logic_vector(15 downto 0));
END count_a;
ARCHITECTURE logic OF count_a IS
BEGIN
PROCESS(rst, clk)
VARIABLE tmp_q : std_logic_vector(15 downto
0);
BEGIN
IF rst = '0' THEN
q <= 0;
ELSIF rising_edge(clk) THEN
IF updn = '1' THEN
tmp_q := tmp_q + 1;
ELSE
tmp_q := tmp_q - 1;
END IF;
q <= tmp_q;
END IF;
END PROCESS;
END logic;

import byucc.jhdl.base.*
import byucc.jhdl.Logic.*
public class Count extends Logic {
public static CellInterface[]
cell_interface={
clk("clk"), in("rst",1),
in("updn",1),out("q",16)};
Wire out, updn;
int tmp;
public Count(Wire clk, Wire rst, Wire updn,
Wire out){
connect("clk", clk);
connect("rst",rst);
this.updn = connect("updn", updn);
this.out = connect("out", out);
}
public void reset() {
q.put(this, 0);
}
public void clock() {
if (updn.getB(this)) {
tmp = tmp+1;
} else {
tmp = tmp-1;
}
out.put(this, tmp);
}
};

The following code shows (left column) a fragment of VHDL code mixing RTL and
behavioral coding styles. The same circuit coded in behavioral JHDL (right column) has
many similarities. Both programs contain a section where the interface, i.e. the inputs
and outputs of the circuit, is defined. Since this is a synchronous circuit, the VHDL
process sensitivity list only contains the clock and reset signals. This definition is

6

implicit in JHDL synchronous circuits. The reaction of the circuit to reset and clock
signals is clearly separated in both descriptions and is very similar with minor syntax
differences.
Not all VHDL and JHDL circuits are suitable for such comparison. However it is
applicable to a large number of designs like FSMDs and reactive systems.
In this case, a new netlister that produces VHDL has been build to substitute the
default EDIF netlister. The EDIF format only allows describing the circuit structure, but
the VHDL language allows descriptions of both structure and behavior in a single
language.

5

The runtime model of designed circuits can be manipulated, by the implemented VHDL
netlister, to generate the desired output. The netlist generation for structural circuits is
straightforward and mimics the approach followed by the EDIF netlister. Behavioral
circuits are decompiled to extract the original Java code and translate it into its
equivalent VHDL. Advantages of decompiling over parsing source code are:
decompilation is simpler than parsing and can assume there are not syntax errors in
the input, moreover source code localization is not needed.
The selected decompilation framework is the open source project JODE [Hoenicke01].
The translation to VHDL has three main blocks: the interface declaration, the variable
declaration and the description of the behavioral process. The interface declaration
contains the ENTITY clause and can be easily created from the runtime model. JHDL
offers methods like getPortRecords that enable a full exploration of any circuit
interface.
The variable declaration section cannot be neither directly derived from runtime
information nor from the member variables of the decompiled class, because some
member variables could not be used by the behavioral model. So, a deeper analysis of
the member variables usage in the behavioral model is needed to determine them and
create the final VHDL section.
VHDL behavioral circuits contain the PROCESS keyword with a sensitivity list
containing a list of the signals that trigger a change of state in the circuit. In
synchronous circuits, the sensitivity list always contains the clock and reset signal and
its body contains the functional description derived from the translation of clock and
reset methods.

if (reset = '1') then
<reset method translation>
elseif clk'event and clk = '1' then
<clock method translation>
end if;

On the other hand combinational circuits can be expressed as a process with all the
input signals as part of the sensitivity list. The body of the process is translated from
the propagate method.
<propagate method translation >

The structure of the control instructions in VHDL and Java are quite similar so the
translation process consists in adapting the final rendering process of Jode to generate
VHDL instead of Java.
Besides expressions and blocks, there are significant differences in how both
languages handle variables and signals. In VHDL, signals are assigned by using the <=
operator and variables are assigned with :=. VHDL is a strong typed language so
variables and signals have to have the same type and width to interoperate. To bypass
these rules, conversion functions can be used. JHDL uses get and put to obtain and
assign values of signals. The behavioral models can use few primitive types like
boolean, int, long and the bit-vector (BV) class. There are multiple versions of get
and put methods that accept these primitive types.
It is necessary to keep track of the variables and signals that are used in the behavioral
model and their type and size. This information is used to know the conversion function
that has to be applied in each situation during translation process.
Signal width information is very important in VHDL. For instance, when assigning a
constant to a signal, the constant must have the same exact length as the signal it is
assigned to, i.e. the same number of binary digits. Obtaining the width of design
elements is crucial for a correct conversion. Design wires can be handled easily since
wire width is a fundamental property of JHDL designs and can be easily obtained.
Design variables are a little bit trickier. Variables are part of the behavioral description
and have often Java types like int, long or boolean. We need to define an VHDL
equivalent data type for each possible variable type so we can propose a width for
design variables.
"

C

"

@

5

JHDL data type

VHDL data type

boolean

std_logic

int

integer

long

std_logic_vector(63 downto 0)

BV(n)

std_logic_vector(n-1 downto 0)

0

The behavioral models can use expressions that contain operation involving variables
and signals. Both languages have little common operators and an equivalence table
definition is needed to perform the translation.
"

,

C

"

@

Java operator

VHDL operator

=

:=

==

=

!=

/=

&

and

|

or

^

xor

!

not

5

as an example the following code has been created with the translator.
package org.cephis;
import byucc.jhdl.base.*;
import byucc.jhdl.Logic.*;
public class SAdd extends Logic
implements
com.Altera.BehaviourallyModeled {
public static CellInterface[]
cell_interface={
in("start",1), in("ops",8), out("sum",
8) };
Wire start,ops,sum;
// Wires
int a,b, state = 0;
// FSM state
public SAdd(Node parent, Wire start,
Wire ops, Wire sum) {
super(parent);
this.start = connect("start", start);
this.ops = connect("ops", ops);
this.sum = connect("sum", sum);
}
public void reset(){
state = 0;
sum.put(this, 0);
}
public void clock(){
switch (state) {
case 0: // idle
if (start.getB(this)) state = 1;
break;
case 1: // fetch A
a = ops.get(this); state = 2;
break;
case 2: // fetch B
b = ops.get(this); state = 3;
break;
case 3: // output
sum.put(this, a + b); state = 0;
break;
}
setDefaultValues();
}
public void setDefaultValues() {
if (!sum.hasBeenPut())
sum.put(this, sum.get(this));
}
}

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_arith.all;
entity SAdd is port ( c : in std_logic ;
start : in std_logic ;
ops : in std_logic_vector(7 downto 0) ;
sum : out std_logic_vector(7 downto 0)
;
reset : in std_logic );
end SAdd;
architecture JHDL of SAdd is begin
SAdd:process(reset, c)
variable b :integer;
variable a :integer;
variable state :integer;
begin
if (reset = '1') then
state := 0;
sum <= conv_std_logic_vector(0,sum'length);
elsif c'event and c ='1' then
case state is
WHEN 0=>
if (conv_integer(start) /= 0) then
state := 1;
end if;
WHEN 1=>
a :=
ieee.std_logic_unsigned.conv_integer(ops);
state := 2;
WHEN 2=>
b :=
ieee.std_logic_unsigned.conv_integer(ops);
state := 3;
WHEN 3=>
sum <= conv_std_logic_vector(a +
b,sum'length);
state := 0;
WHEN OTHERS =>
end case;
end if;
end process;
end JHDL;

((

9

(

Behavioral JHDL descriptions are intuitive and convenient but were limited to the model
known as RTL in most hardware design languages. To effectively follow a refinement
process from a Software specification, it is much more convenient to describe behavior
in a sequential way with some extensions to incorporate the notion of time and parallel
execution of statements.
Most high level languages, as HandleC, SystemC, VHDL and Verilog, allow using a
programmer friendly sequential description model which adds time semantics by using
wait statements. This level of abstraction is called “behavioral model” in most HDL
languages, a bad choice from my point of view since RTL is also behavioral. It would
be less ambiguous to use the term “sequential model”. Original JHDL implementations
include this design style with the HWProcess class, but its support was bound to the
Single Clock Simulator and was lost when the Multi Clock Simulator was introduced.
SystemC SC_CTHREAD constructs forces to describe processes in a sequential way,
which is more programmer-friendly. Since regular sequential descriptions have no time
semantics, they must be incorporated by a language extension. The special instruction
wait is used for this purpose. Any code between two consecutive wait statements must
be executed in the same clock cycle. A SystemC SC_CTHREAD usually involves the
creation of a real Thread of the Simulator process. During system execution, when wait
function is called the Thread passes to a suspended state. All the SC_THREADs of the
system have the same behavior. The SystemC simulator has to wait until all
SC_THREADs are suspended to advance the clock value and propagate signals. After
this step SC_THREADs return to execution state.

1

#2 #

To revive sequential design style, I have created a new class ThreadedLogic that
supports the wait statement. All classes derived from ThreadedLogic must
implement the functions thread_clock, thread_reset and thread_run.
When an object of a ThreadedLogic derived class is instantiated, the
ThreadedLogic constructor automatically creates a worker thread that will call to the
thread_run method. The thread_run method should use the different flavors of
sc_wait function to advance the clock run.
All operations between two consecutive calls to the sc_wait function should occur in
the same clock cycle as it happens in SystemC or HandleC. To achieve this goal, we

will use the synchronization primitives of Java. The sc_wait function will place a wait
operation on the current object that will cause the thread to be blocked. A sc_notify
will be called when the JHDL simulator call the clock method of the ThreadedLogic
class, after calling thread_clock.
Special care must be taken with the synchronization of the worker thread with the rest
of the system. Different conditions depending on the order of the calls to sc_wait and
thread_clock can produce unexpected results. A formal approach is followed to
avoid this kind of undeterminism. We define a global invariant as described in
[Mueller01] in the following way, being bW an indication that there are pending clocks
to run,

clock method
await bW = true
assign output values
bW = false
worker thread sc_wait
await bW = false → bW = true
await bW = false
The resulting ThreadedLogic class ensures the no race conditions will occur and
reopens the richness of design style to allow sequential descriptions.

#
JHDL allows defining platform models [Bellows04]. There are several supported
platforms models: Wildcard [Wildcard], SLAAC [Slaac], and Osiris [Osiris]. A platform
model includes the description of the details of a hardware board that can host a
design. These details include characteristics of the resources mounted on the PCB, like
memories and oscillators; characteristics of the programming element (FPGA) like pin
details; and possible optional IP cores like external interfaces. A hardware model also
includes some API to be able to communicate with the instantiated hardware from Java
applications.

2>!E "

#

A hardware model for the PLD Applications PCI-X board (Figure 44) has been created.
The board contains an Altera Stratix S30 FPGA device, four LEDs, a 100Mhz oscillator
and two DDR memory banks. The model (Figure 45) includes the description of the
external devices (related to the FPGA) and some IP cores that implement interfaces to
them (PCI controller, DDR controller, Led Interface and Clock interface).

DDR SDRAM
Memory
Led
Interface

LogicCore

Memory
Interface
S_clk
U_clk

Clock
Interface

100Mhz
Oscillator

pci_clk
PCI Interface
+;

,

ext_osc

2>!E

#

!

One of the nice features of JHDL is that the circuit object model can be manipulated in
real time. In fact, one of the original goals of JHDL was to support Runtime
Reconfiguration [Bellows98], which had to be addressed by using Java object
construction/destruction as a method to dynamically program and release circuits on
the FPGA. We manipulate the circuit design by replacing a designer selected circuit by
an implementation that redirects its input/output wires to its real hardware
implementation.

Simulated
block

Simulated
block

Target Block Redirector

General
Redirector
JHDL
Enrivornment
over JVM

PLD Board Java Interface
(JNI)

OS User
mode

PLD Board OS Interface
(Windows DLL)

OS kernel
mode

PLD Board
PCI Driver
(WinDriver)
FPGA
Design

PLD
Hardware
Platform

PCI Interface

Scan & Clock control
Target
Block

Available

Specific Module
-; ?

Design Specific Module

1

Although designer is hidden from the underlying details a number of systems are
involved in this process:

The JHDL simulator is the application that hosts the circuit blocks that are being
simulated. It presents a GUI that includes some buttons to control clock advance, and
command line interface. The command line is used by the user to instruct which block
has to be executed by the hardware platform.

6

&

When the user orders to execute a block into hardware, the original block is removed
and replaced by a redirector. The feature of dynamically modify the hierarchy of the
circuit by adding, removing or substituting design entities during a simulation session is
unique to JHDL and becomes crucial for this work.
The redirector copies the interface of the substitute class to keep an accounting of the
input and output pins and their widths. This block contains a behavioral model that
redirects inputs and outputs and controls the real hardware clock advance using a
general redirector.

:
The general redirector has a low level control the hardware platform interface by
reading and writing its memory mapped registers. On the one hand controls the
transfer of inputs and outputs by commanding the scan chain. On the other hand, it
controls the clock advance. In our case, we always advance the clock with one step,
but future applications could use a different approach.

#

&

(

The developed PLD board platform model include a Java Native Interface (JNI) class to
access to OS dependent communication primitive operations such as hardware
detection, and read and write to memory mapped registers.
The PCIXNative class main functions are open, close, writeMem and readMem.
In addition, another class (RenablePLDA) has been developed to control the execution
of a utility application that forces PCI reenumeration.

#

&

)

!

The final software interaction with the hardware platform is performed by a kernel
driver. The development of a kernel device driver is a complex task and error prone.
The bugs in this kind of software are catastrophic since their cause the reboot of the
machine. So, for this kind of work, it is better to use a commercial driver as WinDriver.
The basic low-level functions are encapsulated into a Windows DLL to mitigate the
hassles of kernel mode programming.
Nevertheless, there is an issue that is not covered by the DLL, which is the resource
negotiation with the OS. PCI devices are Plug & Play. This means that their OS needed
resources (memory ranges, I/O ranges, and interrupts) are flexible and are determined
by a central resource manager, which is part of the OS. This avoids conflicting address
spaces or interrupts like we had in the old ISA days. There is a special PCI
configuration mode that allows the resource manager to resolve the resource needs for
each plugged device during a process that is known as PCI enumeration. However,
enumeration is only done at boot up or after device insertion for hot swappable
devices. The user can force this process for new devices from Windows Device
Manager (as seen in Figure 47). But to rerun the negotiation for an already connected
device, you need to manually disable and enable the device.

0

?

6

When the FPGA device of the PLD is configured through JTAG, the host computer
does not receive any event that causes a reenumeration of the PCI. A reenumeration is
needed to reassign the resources needed by the device. Since the previously
described way of forcing the reenumeration is not practical for an automated
framework, an utility application has been developed to programmatically perform the
disabling and enabling of the device, and resolving its resource needs.

#

&

# !

(

A Windows kernel driver exposes very simple functions via IOCTLs to user mode
applications. These functions are basically read/write operations to memory.

$#:

0,

3

The FPGA design is based on the developed Hardware Platform. It combines a set of
predefined blocks, like the PCI-X interface and clock interface, with the user block
wrapped by the scan chain and all the register based control interface.

# !. !
PLDApplications provides a PCI-X core together with the PCI-X platform. The core is
highly configurable via a Core Configuration Wizard (see Figure 48) supporting various
advanced PCI features. A specific set of parameters have been used and a VHDL
implementation of the version of the core has been included as part of the platform
model.

,

4

2>E > ? <

!
The host interface consists of three registers accessible from the PCI-X bus: Scan
Chain Control (SCC) register, Scan Chain Data (SC) register and Clock Control (CC)
register.

6

&

The original block, selected by the user for hardware execution, is the central part of
the FPGA. The design inputs and outputs are wrapped around by boundary scan
registers controlled by the register interface.
The target block runs in a different clock domain.

!

(

The process of hardware substitution is not implemented in a single push button but is
separated into four processes as shown in Figure 49.

6

makewrapper command
JHDL Circuit
(Runtime Java
objects)

WrapperWriter

WrappedDesign
(java file)

WrappedDesign
(class file)

Compiler

makeexe command
Load into
PLD Hardware
Plaftorm

netlist

WrappedDesign
(edif file)

Quartus
Synthesis Flow

WrappedDesign
(SOF file)

download command
Quartus
Programmer

Force OS PCI
reenumeration

replace command
JHDL Circuit
(Runtime Java
objects)

Replacer

7; 2 ##

JHDL Circuit with HIL
(Runtime Java objects)

>

#

The makewrapper command writes the Java Source code of the JHDL circuit that will
go into the FPGA and compiles it. This circuit contains the target block wrapped around
a scan chain with a clock control unit and the commercial PCI interface.
The makeexe command loads the compiled class into the PLD applications platform
environment produces the netlist, and call the Quartus tools to end up with a binary file.
The download command uses Quartus programmer tool to download the created
.SOF file into the board. It also calls a custom application that enable and disable the
driver so that PCI reenumeration is done.
Finally, replace command substitutes the selected block by a redirector to its
implementation into the hardware platform.

;
In order to experiment the expected benefits from Jumble simulation I have developed
three different designs having increasing complexity. In each example design I have
selected different parts of the circuit for Jumble replacement (hardware execution) and
measured the simulation speedups achieved by this way.

'

$

The first test is performed with a simple system: A Median Filter application based on
[Maheshwari97]. In this example the powerful testbench facilities can be clearly shown.
We have developed two custom modules to integrate picture viewing on the schematic
viewer. SchematicImageSource takes the path of a JPEG image in the host file
system, decompresses and renders the image, in the schematic viewer. The behavioral
model outputs a pixel of the image in RGB every clock following a row scan basis.
SchematicImageSink receives the RGB value and coordinates of a pixel each clock
and renders them into an image viewable from the schematic view.
This allows a straightforward environment for the verification of the system. A
comparable testbench with VHDL or Verilog would be extremely complex, in case it is
possible. Figure 50 shows the schematic view of the system. The original image with
added noise generates the signals that are sent to the median filter, which in turn send
the results to the image sink. Simulation is completed by instructing to run 90000 clock
cycles using the interpreter command line. The simulation is run twice, the first time
with the default behavior and the second one replacing the median filter with its
hardware version.

0

+(;

#

#

The speedup factor is between one and more orders of magnitude depending on the
complexity of the circuit being replaced and the capacity of the host computer. Table 4
shows some example circuits and the speedup factor achieved when Jumble HIL
simulation is used instead of a regular simulation. The host system consist of a PC with
a hyperthreaded Pentium IV CPU running at 2.80Ghz with 512MB of RAM. The first
design is the example shown above consisting of a simple median filter circuit applied
to a noisy input image. The standard simulation uses 150 seconds to simulate 90000
cycles, enough cycles to produce the final filtered image. The jumble simulation takes
only 3 seconds, what produces a speedup factor of 50. Second and third examples
consist on a different design that chains a number of median filter units. In the first case
the number of chained filters is 4 and in the second case the number is 10. Jumble
simulation time keeps constant, as the interface to the hardware implementation is
equivalent in all cases. The hardware implementation complexity increases with each
design but anyhow it runs at the same speed because the clock frequency remains the
same and hardware is inherently parallel.

"

; 7(3

#

1 Median Filter

Std.
Simulation
150 s

Jumble
Simulation
3s

4 Median Filter Datapath

754 s

3s

251.3

10 Median Filter Datapath

1902 s

3s

634

Design

Speedup Factor
50

The system is limited by the capacity of the FPGA. Obviously designs that do not fit in
a single FPGA cannot be completely simulated with Jumble. As an alternative, not the
whole system but a subset can often be downloaded to achieve some speedup.
Table 5 depicts the details of the resource usage of the second design example. The
design uses a small fraction of the FPGA resources. The most important contribution to
resource usage comes from the circuit under test followed by the PCI-X core instance.
Clock control and LED interface have a very low contribution in resource usage. The
third larger contribution comes from the scan and control system, which anyway
supposes an acceptable overhead.

"

+; 8

1 (

Block of the design

LEs

Memory bits

PCI-X interface

2253 (39%)

1152 (2%)

Clock interface

57 (1%)

0 (0%)

LED interface

76 (1%)

0 (0%)

4 Median Filter Data Path

3127 (54%)

61156 (98%)

Scan & Control

309 (5%)
5826
(17% of device total)

0 (0%)
62308
(1% of device total)

TOTAL

)
The second set of tests is performed with an OCR system based on
[Castells05],[Castells06]. The aim of the design is to include it into a commercial
system (Figure 51) that will allow the remote reading of water meters by attaching a
device on top of conventional mechanical meters, which will periodically take a picture,
extract the counter reading and transmit it to a remote system through a wireless
connection. The system has been patented [Ayuso06] and is currently commercialized
by the company Mirakonta.

4-61
9
54

$

11

&

-$

+

6

'

#

#

#

In a first version, the automated meter reading (AMR) device was only taking the
picture and transmitting it to the remote system. But in order to extend battery life by
reducing transmission time it was necessary to perform optical digit recognition inside
the device.
The challenge was to use the unused resources of the existing low cost FPGA, which
was mainly used for the control of the radio link and ultra low power management. This
is a tough goal since good OCR algorithms rely on performing several complex
analysis steps on the images and need a non-negligible amount of memory and a
microprocessor indeed.

+

1

#

The result was a novel optical digit algorithm very well suited for our specific problem
that takes ideas from cross-crossing OCR algorithms to produce a symbol string and
use sequence alignment algorithms, often used in genomics, to identify the best
matching sequence with a given set of predefined digit patterns.
The idea is quite simple. An image sensor produces pixels in a row-scan fashion.
Segmentation and binarization can be performed by some simple data flow processing
circuits. Once we have segmented and binarized a character, we process each row of
the character and produce a symbol. The symbol basically classifies the row into a set
of observed row patterns (Figure 53 a). The symbol generation does not require
intermediate memory since a simple FSM that analyzes the occurrence of white pixels
depending on the pixel position can be used (Figure 53 b).
a)

b)
5 " $)

-

&

&

R

1 & (c>w)

e/R

%
@

7

0

.

0
+

1 / r++
0 & (c>w) & (r>5)
1& (c w)

*

0 0
000

e/W
e/W

U

0 & (c>w) & (r 5)

0

2

W

S
0 & (c w)

1 & (c w)
L

e/S
e/L

D

e/D

1 & (c>w)

+

*8

#" ; "*

16

#"

As a result, four sequences of symbols are created, one for each counter digit. When
the sequence generation is complete a custom algorithmic machine is used to compute
the maximum alignment of the sequences with a set of test patterns that describe the

ideal sequence of each digit. The alignment is computed using the Smith-Watterman
algorithm (24).

" G

=

− 0" G − 0
− 0" G

−

" G −0

−

+4"G

(24)

The digit reference pattern that obtains the higher score is chosen as the recognized
digit. This process is repeated for each digit of the AMR.
The overall system’s block diagram is depicted in Figure 54.
#

! &
9

A

8

%

<

!

<

!

<

8
< -

2 /
<
!

1

&
<

+

8

1

# =

'

#

Most of the design units are easily implemented using an structural design style with
only an exception: the EditDistanceProcessor which is implemented following a
RTL design style since it has greater behavioral complexity. Figure 55 shows an
schematic view of the implemented system. The module RowColDeriver processes
the synchronicity signals from the image sensor to create row and column coordinates
for each pixel. Downsampler module takes the red pixels from the Bayer pattern
produced by the sensor, resulting in a downsampled monochrome (red channel)
image. Since the digit positions are fixed in relation with a configurable point,
LocationGenerator module takes this point to derive all the digit positions, which is
then used by LocationMatch module to determine when the sensor is producing a
pixel which is part of a digit. Since the image sensor produces a noisy image the
MedianFilter module filters the image that is later binarized by the Threshold
module, which uses the max and min values obtained by the WindowMean module in
the previous frame. The binary pixels identified as part of a digit are processed by the
SymbolEncoder, and the resulting symbol is stored in a sequence memory by the
SymbolWriter. Finally, the EditDistanceProcessor computes the local
alignment algorithm producing the recognized digit values.

++ 1

#

5

,28

#

The testbench for such a system is not straightforward, first we need an accurate
model of the sensor, and then we need to view the results of the most sensible parts of
the process in order to easily identify the possible errors. Two issues have special
interest: the resulting binary digits and the sequence patterns produced by the symbol
encoder. By looking at the image of the binary digits we can immediately identify if an
error occurred in the segmentation or the binarization phase. Errors at the sequence
production phase are also evident if you can see the produced sequences in text form.
Some custom schematic modules have been developed to allow such a rich interactive
test environment (see Figure 56).

+-

"

,28

#

Additionally, unit tests for each module have been developed and simulated. Table 6
shows the simulation time for each unit test with and without using Jumble for the
circuit under test. The maximum achieved speedup is 29.39, much lower than what
was achieved with the previous Median Filter design. On one hand, as the complexity
of the design are relatively low, the standard simulations run not so slowly. On the
other hand, the redirected blocks have a relatively large interface to synchronize at
each clock cycle of the simulation (e.g. 130 bits for the complete system) slowing down
the Jumble simulation.
"

-;

6

#

RowColDeriver unit test

Std.
Simulation
107 s

Jumble
Simulation
92 s

Downsampler unit test

190 s

87 s

2.18

LocationGenerator unit test

553 s

155 s

3.56

LocationMatch unit test

715 s

226 s

3.16

MedianFilter unit test

2,292 s

159 s

14.41

WindowMean unit test

3,318 s

286 s

11.60

Threshold unit test

3,662 s

244 s

15.00

SymbolWriter unit test

3,951 s

266 s

14.85

EditDistanceProcessor unit test

3,939 s

134 s

29.39

OCR System

3,426 s

227 s

15,09

Design

,

Speedup Factor
1.16

The resource usage of the design, which can be seen in Table 7, is quite low as was
intended due to the project requirements.
"

08

1 (

Design Entity

LEs

Memory bits

RowColDeriver

34

0

Downsampler

9

0

LocationGenerator

111

0

LocationMatch

35

0

MedianFilter

263

3120

WindowMean

200

0

Threshold

9

0

SymbolWriter

40

0

SymbolWriter

20

0

EditDistanceProcessor

871

0

Complete System
(includes instrumentation and memories)

4,321
(13% of device total)

139,440
(4% of device total)

' #-:
The third set of tests is performed on several IDCT designs. The different IDCT designs
are tested from complex testbench consisting on a full Mpeg 1 [Mpeg1] decoder based
on Java. The original code was using multithreading and synchronization between
threads to perform the various tasks of the decoder. The code has been refactored to
allow its integration into the JHDL simulation framework.
To perform this refactoring, first, a software implementation of the IDCT process was
implemented and wrapped into a new JHDL circuit. This IDCT implementation was
used to identify the circuit interface but had no time notion, which means that a valid
result was produced immediately in a one clock cycle. Nevertheless, the circuit
activation is based in the usage of the two signals start and busy (inspired in BlueSpec
methodology [Arvind04]) to make the circuit independent of the number of cycles
needed to complete the processing.
Next, the rest of the MPEG decoder code was wrapped in a new JHDL circuit, which
includes all the necessary ports to communicate with the external IDCT modules.
While the IDCT block is designed in a RTL way, which in JHDL terminology is called
behavioral model, this design style is not convenient for the rest of the circuit as we
already have a sequential implementation of the circuit that we would like to reuse with
minor modifications.

6

So, finally the Mpeg1Decoder uses a sequential behavioral design style and consists in
ThreadedLogic derived class with a simple interface in which the original code main
loop has been moved inside the thread_run method that calls the sc_wait function
as needed when interfacing the external IDCT circuit. The sequential code does not
directly access the wires values through get and put methods but modifies the values
of interposed variables. A call to sc_wait, lead to an invocation of the thread_clock
function that in turn uses the interposed variables to drive the wires.
public class Mpeg1Decoder extends ThreadedLogic
{
public static CellInterface[] cell_interface=
{
in("reset", 1),
out("x", 8),
out("y", 8),
out("rgb", 24),
out("set", 1),
out("dct_start", 1),
in("dct_busy", 1),
};
...
Mpeg1Decoder(Node parent, Wire reset, Wire x, Wire y, Wire rgb, Wire set,
Wire dct_start, Wire dct_busy,
Wire[] dct_ins, Wire[] dct_outs, File file)
{
super(parent);
// basic initialization
...
}
public void reset()
{
x.put(this, 0);
y.put(this, 0);
rgb.put(this, 0);
set.put(this, 0);
dct_start.putB(this, vdct_start = false);
}
public void thread_clock()
{
x.put(this, vx);
y.put(this, vy);
rgb.put(this, vrgb);
set.putB(this, vset);
dct_start.putB(this, vdct_start);
for (int i=0; i<64;i++)
{
dct_ins[i].put(this, vdct_ins[i]);
}
}
public void thread_run()
{
// Original code main loop
...
}
}

To enhance the verification experience, a schematic image viewer is used to get an
immediate feedback about the correctness of the system. So, instead of diving into
huge waveforms or analyzing endless traces we can just look if the image resulting
from the decoding process is the expected one in the circuit schematic view (Figure
57). Note again that this is interactively shown during simulation time, so that we can
still use all the other standard features, like waveforms to detect a flaw in the design.

,

+0 1

6

#

5

#

/

"

6

6

The DCT consists on a transformation of an image block of NxN from space to
frequency domain. This transformation of data gives no compression by itself.
MPEG standard uses a value of 8 for N. In this way, simple implementations can be
designed, both in Hardware and in Software, with reasonable requirements of memory
and computational load.
The mathematical expression of the DCT is (25)

,0

H " $=

5 $5 $

3 + 0$ π
0

F 3" $
3=

=

+ 0$ π
0

(25)

where

" =

0

5 $" 5 $ =
0

(26)

"

If we define the matrix T as the DCT transform of the identity matrix, we can rewrite the
DCT expression in matricial format as

H= F

(27)

Using the orthogonal property, the inverse transform (iDCT) can be written as

F =

(28)

H

A typical approach to simplify the computation of the iDCT transform (and the DCT as
well) is to separate it using a row column decomposition method. Using the matrix
expression, this can be done by using a new Z variable. Finally, we end up with an
expression of X consisting of two multiplications by the same constant matrix T (29).

F =I
I=

H

I =H
F =H $

(29)

:
There are lots of possible designs to implement the iDCT operation. The designs can
be classified by the method they used. There are some that use the Row Column
decomposition Method (RCM) and others that are Not based on the Row Column
decomposition Method (NRCM).
The designs can also be classified by the input/output interface that can be either serial
or parallel.
"

Property
No. of multipliers

4;

2

Cho[]
*

*

,

Chang[]

Gong[]

*

*

*− *

*

No. of adders

+

* + *

*

* +0

0

Latency

not reported

Cycles/block

0

*

*

Speed (pixels/cycle)

*

*

0

If we look at the resulting form of the T matrix after computing (25), we get a matrix as
shown below
"
+

"

G
=

"

−"
−+
−G
−

"

"

"
−

"
−

"
−

"
−+

−
−+

−G
−

−G

−
+

−

G
−

−"

"

"
−
−

−"
−
G

−" "
+ −
−G

+

−

G

−
−+

−

It is interesting to note that all the coefficients are constant values. The cost of a
constant multiplier in hardware is much lower than the cost of a generic multiplier, so
my proposal is to use constant multipliers instead of generic ones to compute the iDCT
transform using a large combinational circuit when possible.
However some variations can be made in order to decrement the number of used
functional units (multipliers, adders, or subtractors) while introducing some sequencing.
The following subsections analyze some of these design variations.

&
We can simply implement all the operations of the matrix multiplication. For each cell,
we have N multiplications and N-1 adders. As we compute N2 cells, the number of
multipliers is N3 and number of adders is N3-N2. As we are working with N=8, this gives
512 multipliers and 348 adders.

,

I4 ,7

*e

*g

*k

*i

*i

*k

I4 ,6

*e

*g

*f

*j

*j

*f

*j

*f

I4 ,5

*j

*f

*g

*k

*e

*i

*i

I4 ,0

*k

*e

*h

*g

*h

*h

*h

*h

I4 ,1

*h

*h

+

+

+

+

*h

*k

+

+

-

-

+

-

-

-

+

+

+

+

*k

*j

*j

*j

*f

*f

I4 ,4

I4 ,3

*f

*f

*j

*i

*e

*g

*k

*g

*k

*e

*i

*h

I3 ,7

*h

*h

*h

*h

*h

*h

*h

*e

*i

*g

*k

*k

*k

*g

*i

*f

*j

*f

*j

*j

*f

*f

*j

*g

*i

*e

*k

*i

*k

*e

*h

*g

*h

*h

I3 ,1

*h

*h

*h

O 4 ,3

+

+

*f

*g

*k

*e

*e

*i

*i

*g

*k

*h

*h

*h

*h

*h

*h

+

+

*h

O 4 ,5

*k

+

O 4 ,6

*i

+

-

-

I5 ,2

*i

*k

*j

*f

*f

*j

*j

I5 ,3

*f

*j

*f

*i

*e

*g

*k

I6 ,7

I5 ,4

*g

*k

*e

*h

*h

*i

*h

*h

*h

*h

*h

*h

*e

*g

*i

*k

I6 ,6

*k

*i

*e

*g

*f

*j

*j

I6 ,5

-

+

-

+

O 3 ,0

O 3 ,1

O 3 ,2

*f

*f

*j

*j

*g

*f

*k

*i

*e

*i

*e

*g

*k

*h

*h

*h

*h

+

-

*h

*h

I0 ,6

*e

*f

*j

*f

*j

*f

*j

*j

-

-

-

+

+

-

-

-

+

+

+

+

-

+

+

-

+

-

+

-

O 5 ,0

O 5 ,1

O 5 ,2

O 5 ,3

O 5 ,4

O 5 ,5

O 5 ,6

O 5 ,7

I0 ,5

*f

*g

*k

*e

*i

*i

*g

*h

*h

*h

*h

*h

*h

I1 ,7

I2 ,6

*e

*f

*j

*j

*f

*j

*f

*f

*j

*f

*i

I3 ,4

*e

*k

*g

*g

*i

*e

*k

*h

*h

*h

*h

*h

*h

*h

*h

+

O 3 ,5

O 3 ,6

+

O 3 ,7

*j

*f

*g

*k

*e

*i

*i

*g

*e

*e

*g

*h

*h

*h

*h

-

-

-

+

-

-

-

-

-

*k

*f

*f

*j

*f

*f

*j

*j

*j

*i

*e

*g

*g

*k

I1 ,6

*i

*e

*h

*h

*h

*h

*e

*h

*h

*h

*g

*h

*k

*i

*k

*i

*e

*g

*j

*f

*j

*f

*f

*j

*f

*j

*g

*e

*k

+

+

*i

*i

*k

*e

*h

*g

*h

*h

-

O 0 ,0

O 0 ,1

O 0 ,2

O 0 ,3

O 0 ,4

O 0 ,5

O 0 ,6

O 0 ,7

-

+

+

-

+

-

-

-

+

+

O 6 ,2

O 6 ,3

O 6 ,4

O 6 ,5

O 6 ,6

O 6 ,7

-

+

+

-

+

+

-

+

O 2 ,1

O 2 ,2

O 2 ,3

O 2 ,4

O 2 ,5

*i

*f

*j

*f

*j

*j

*j

*f

*f

*f

*f

*f

*f

*i

*e

*i

*e

*i

*e

I1 ,2

*g

*i

*e

-

-

-

-

*k

*k

*g

*g

*g

*g

*k

*g

*g

*k

*i

*e

*h

*h

*h

*h

*h

*h

*h

*h

*h

*h

*e

*k

*j

*f

*j

*f

I1 ,4

I1 ,3

*j

*j

*k

*h

*e

*h

*i

*h

*h

*h

*h

-

+

-

-

-

-

-

+

+

+

-

+

+

+

-

-

-

+

I2 ,3

I2 ,2

*i

*k

*j

*f

*f

*j

*j

*f

*f

*j

*k

*e

*i

I7 ,7

I2 ,4

*g

*g

*i

*e

*k

*h

*h

*h

*h

*h

*h

*h

*h

*e

*g

*i

*k

I7 ,6

*k

*i

*e

*g

*f

*j

*j

I7 ,5

*f

*f

*j

*f

*j

*g

*k

*e

-

-

-

+

*g

*k

*e

*h

*h

*h

-

+

-

+

-

+

-

O 1 ,0

O 1 ,1

O 1 ,2

O 1 ,3

O 1 ,4

O 1 ,5

O 1 ,6

O 1 ,7

I7 ,1

*h

*h

*h

O 2 ,6

*h

*h

+

+

+

+

-

-

-

-

-

-

-

-

-

+

-

-

-

-

-

O 2 ,7

+

-

+

-

O 7 ,0

O 7 ,1

O 7 ,2

O 7 ,3

O 7 ,4

O 7 ,5

+4 2

#

*j

*k

*h

*i

*e

*h

*h

*h

*h

*h

*h

*h

+

+

-

+

O 7 ,6

Po w e r e d b y y File s

I7 ,4

I7 ,3

*j

-

+

+

-

-

*k

-

+

*e

-

+

-

+

+

+

*e

-

+

+

+

+

-

I7 ,2

*g

*i

+

-

-

+

+

*k

+

+

+

-

+

-

I7 ,0

*i

*i

-

+

+

-

+

-

+

+

-

+

-

+

-

-

-

*k

-

+

+

-

+

+

-

*h

*h

*h

+

+

-

-

-

+

*g

I1 ,1

*h

*j

-

O 6 ,1

*h

I6 ,4

*f

+

O 6 ,0

+

-

O 2 ,0

-

-

-

+

+

-

*g

-

+

+

-

+

+

+

*e

*e

*g

+

*f

+

-

-

+

I2 ,1

*i

*k

-

-

+

+

-

+

+

+

+

-

-

-

-

-

+

-

*h

+

+

+

+

+

+

*i

*j

-

-

+

-

+

*g

*k

+

+

*h

-

-

-

-

+

*i

-

-

+

+

+

+

+

+

-

-

-

*g

+

-

+

+

-

-

+

I1 ,0

I1 ,5

I0 ,4

*k

*e

-

+

-

+

-

+

+

+

-

*h

*h

*i

*e

-

+
-

+

+

I6 ,2

*g

*i

-

-

+

-

-

I2 ,0

*g

*k

*e

*i

-

+

-

+

-

*k

-

+

+
+

I2 ,5

I2 ,7

*g

*k

+

+

*i

I0 ,3

I0 ,1

*h

*h

+

+

*k

O 3 ,4

-

-

-

+

I0 ,0

*k

*e

+

*k

+

-

-

-

+

-

+

+

*i

*j

*j

-

-

+

-

+

-

-

-

-

+

-

+

*g

I3 ,3

*f

-

-

+

*h

+

+
+

+

+

*e

-

+

+

-

O 3 ,3

*h

+

+

+

I0 ,2
I0 ,7

*g

*f

I6 ,1

+

*i

-

-

-

-

+

I6 ,0

-

+

+

*g

*e

*e

-

-

-

-

-

+

+

*k

*j

-

-

+

+

*k

*i

+

+

+

-

-

*g

*e

-

+

O 4 ,7

*g

-

+

-

+

*k

-

-

+

*e

-

-

+

-

I3 ,2

*g

*i

-

-

-

I5 ,1

*h

+

+

+

*i

*k

-

-

+

*g

-

+

-

+

I5 ,0

+

*e

+

-

+

+

-

O 4 ,4

+

+

+

-

-

+

*h

*h

+

I6 ,3

*j

*j

*f

I3 ,0

I3 ,5

*f

*j

*j

-

-

-

-

O 4 ,2

O 4 ,1

I5 ,5

*e

*f

+

+

-

O 4 ,0

*i

*e

*g

+

I5 ,6
I5 ,7

*g

*i

+

-

+

-

+

+

*e

I3 ,6

*k

+

-

-

*i

-

+

+

+

I4 ,2

*g

*e

*e

-

+

+

-

-

*g

-

-

-

+

+

*i

-

-

O 7 ,7

/#

#
If we look at the formulas that produce each result matrix cell, we realize that some
product terms appear in multiple occasions. For instance if we take Ri,3 (30) and Ri,4
(31) we can see that all product terms in Ri,3 appear in Ri,4, the only difference is how
the product terms are added or subtracted.
9 " = ? " " + ? "0 − ? " G − ? "

+? " "+? " −? "

−? " +

(30)

9 " = ? " " − ? "0 − ? " G + ? "

+? " "−? " −? "

+?" +

(31)

If we reuse the terms we end up using 176 terms.
I6,3

*e

*k

I6,2

*g

*i

*f

I6,1

*j

*i

+

+

-

I6,4

*h

+

+

+

I6,0

*g

-

*e

-

-

*k

-

-

-

I6,5

*h

+

+

-

+

*e

+

+

-

+

-

-

-

-

-

+

+

*k

*e

*i

*g

*f

I7,0

+

*i

*g

*h

*e

*k

*e

*k

-

+

-

+

-

+

-

-

-

+

-

-

-

-

+

-

-

+

-

-

-

-

I1,3

*k

*e

*e

*k

-

-

+

-

+

+

I1,2

*i

*f

+

+

*i

-

*g

*i

-

+

+

O6,6

*j

+

+

O6,2

O6,5

O6,4

-

O6,3

+

O6,0

*g

*i

*k

*e

*j

*f

*g

*k

*e

I4,2

*i

*f

+

+

+

+

*i

*h

*g

*e

*i

*e

+

-

+

-

+

+

-

+

-

+

-

+

-

+

+

+

-

+

+

-

+

+

-

+

-

-

-

-

+

-

-

*k

*e

*e

I2,3

*g

*i

*g

I2,2

*e

*k

*j

+

+

+

+

-

+

+

O1,2

I4,4

*k

*g

-

+

+

-

O1,6

-

+

+

O1,5

O1,4

*g

*k

-

O1,3

+

O1,0

*e

*e

*f

I3,7

*j

*e

*k

I3,4

*g

*i

I3,2

*h

*j

I3,1

*f

*k

-

-

+

-

+

-

+

-

+

-

+

-

+

-

-

+

-

*g

-

-

-

+

+

+

-

-

+

+

-

+

-

O1,7

O2,3

O2,4

O2,0

O2,7

O2,1

O2,5

O2,6

O2,2

O3,0

O3,3

O3,7

O3,4

O3,6

O3,1

O3,2

O3,5

I0,5

*k

*e

*i

I0,4

*g

I0,2

*h

*f

-

+

-

-

-

-

I0,1

*i

*g

*h

*e

*k

*e

*k

-

+

-

+

-

+

+

-

-

-

+

I0,0

*j

+

+

-

-

+

-

+

-

+

-

-

+

-

+

-

+

+

-

-

-

+

+

+

+

-

+

I0,3

+

-

*g

I0,6

*i

*j

I5,7

*f

*k

*e

*g

I5,3

*i

*g

*i

*e

I5,2

*k

-

-

I5,0

*f

*e

+

+

+

*j

-

+

-

*k

-

+

+

+

+

-

+

-

-

+

-

+

-

-

-

*j

-

+

+

+

+

-

-

+

+

-

-

+

+

-

+

-

-

+

+

-

-

+

+

-

+

-

-

-

+

+

+

-

+

-

O7,1

O7,6

O7,2

O7,5

O7,3

O7,0

O7,4

O7,7

O4,7

O4,6

O4,0

O4,5

O4,2

O4,1

O4,3

O4,4

O0,6

O0,5

O0,1

O0,2

O0,4

O0,3

O0,0

O0,7

O5,3

O5,7

O5,4

O5,0

O5,6

O5,1

O5,2

O5,5

#

*f

-

+

-

-

+7 2

I5,6

*e

+

-

+

*k

-

-

-

+

*g

-

-

+

+

-

*i

+

+

I5,5

*h

+

+

+

+

*g

-

+

I5,4

*i

+

-

I5,1

*h

-

*j

+

-

*e

*f

-

+

-

*k

*e

-

+

+

*i

I3,6

*k

+

-

-

+

*g

*i

-

-

-

*k

-

-

+

-

I3,5

*e

+

+

+

+

*g

+

+

+

+

*i

-

-

+

+

*g

-

-

+

+

*i

+

-

+

-

+

I3,3

*h

-

+

-

-

I3,0

*e

+

-

I0,7

*i

+

I2,6

*k

-

I4,7

*h

-

*g

-

-

-

*i

+

-

-

+

+

I2,5

*h

-

+

+

+

-

*g

-

-

+

+

I2,4

*i

+

+

-

I2,1

*h

-

+

+

-

+

*k

+

+

-

-

-

*e

-

-

+

+

I2,0

*f

+

-

Powered by y Files

'

*i

+

+

+

*k

-

-

-

*i

+

-

I4,3

*k

+

*g

-

-

+

-

+

+

I4,0

I2,7

I1,7

*f

+

+

*j

-

-

I4,1

*j

-

O1,1

I4,5

*g

+

O6,7

I4,6

I1,6

*i

-

I7,7

*f

+

-

*e

-

+

-

+

-

-

*k

-

-

+

-

+

+

-

+

-

I1,5

*h

+

-

+

-

+

-

*k

-

-

-

-

+

*e

-

+

-

I7,6

*h

I1,4

*h

+

+

-

I1,0

*g

-

-

-

+

I7,4

+

-

I1,1

*j

+

+

+

-

*g

-

+

+

-

+

*i

-

I7,3

+

+

*g

+

+

*j

+

I6,7

*f

+

O6,1

I7,1

*j

+

+

-

I7,2

*g

+

-

I7,5

I6,6

*i

+

-

-

*k

-

/#

9

In the previous solutions, we compute all product terms in parallel using a big
combinational circuit. Each R row is computed using a single I row. As seen in (30) and
(31) each R row is computed using the same formulas.

,

We can see that for each row the constant coefficients are only multiplied by a few
input matrix values:

← ? "0 " ? " " ? " " ? "
← ? " "? "
← ? "0 " ? " " ? " " ? "
" ← ? " "? "

(32)

← ? "0 " ? " " ? " " ? "
G ← ? " "? "
+ ← ? "0 " ? " " ? " " ? "
This gives 22 multipliers per row. And considering all rows (8) gives 176, which is the
previous result.
If we multiplex the rows in 8 cycles we can produce all the product terms in 8 cycles
using only 22 multipliers.
I0,0

I4,0

I2,0

I3,0

I7,0

I6,0

I1,0

I5,0

I1,1

I7,1

I4,1

I6,1

mux

I2,1

I5,1

I0,1

I3,1

I5,2

I2,2

I3,2

mux

I1,2

I6,2

I0,2

I4,2

I7,2

I7,3

I2,3

I4,3

I3,3

I0,3

I1,3

I6,3

I5,3

I2,4

I1,4

I4,4

I6,4

I5,4

mux

Ii1

*h

*g

*e

Ii3

Ii2

*k

*i

*f

*j

*g

+

-

+

-

+

+

-

-

+

O0,4
O0,3
Pow ered by yFiles

-

-

+

-

-

demux

+

-

-

-

-

+

-

-

demux

-

-

+

+

+

-

+

-

-

+

-

-

-

-

+

+

+

-

+

-

-

+

-

-

-

-

+

+

+

-

-

-

+

-

+

-

+

+

+

+

+

+

+

+

-

-

+

+

+

+

+

+

-

+

demux

+

+

+

-

-

+

+

+

-

-

demux

I0,4

I1,5

I6,5

+

-

+

+

+

+

+

+

+

+

demux

*i

+

demux

+

demux

+

+

+

+

+

+

-

+

-

+

-

+

+

+

-

demux

I3,5

I7,5

I4,5

I5,6

I2,6

I4,6

I0,6

I6,6

*i

*e

I1,6

I7,6

I3,6

I2,7

+

-

+

+

-

-

-

-

-

-

*k

-

demux

+

*f

+

+

demux

-

-

I3,7

I5,7

I7,7

I0,7

mux

Ii7

*j

+

+

+

+

+

+

+

+

+

-

-

demux

-

+

-

+

-

-

-

-

-

-

-

+

-

+

+

-

-

-

-

-

-

+

-

+

+

-

-

-

-

I1,7

*k

*i

*e

*g

demux

+

+

-

-

+

demux

-

-

+

+

-

I4,7

Ii6

*g

-

+

-

-

+

+

-

-

+

+

-

+

-

+

+

+

-

-

+

+

-

-

-

-

-

I6,7

mux

demux

-

+

-

+

-

-

+

+

+

+

-

+

-

+

-

+

+

+

+

+

+

+

-

+

-

-

I5,5

Ii5

*e

+

+

+

+

+

-

+

-

+

-

+

I0,5

mux

demux

-

+

-

+

-

+

+

+

+

+

-

I2,5

Ii4

*k

demux

+

+

+

-

-

demux

-

+

+

-

-

+

+

-

-

+

-

+

+

-

-

-

+

+

-

+

+

-

+

+

-

-

-

+

-

-

I7,4

*h

demux

-

I3,4

mux

mux
Ii0

+

-

demux

+

+

+

-

-

-

+

-

-

-

-

+

+

+

-

+

-

+

+

+

-

-

-

-

+

-

-

-

-

-

+

-

-

+

-

+

-

+

-

-

+

+

-

-

-

+

demux

demux

-

-

-

-

-

+

+

-

-

-

+

-

-

+

-

-

-

+

+

+

+

-

-

+

-

-

-

+

+

+

+

-

-

+

-

-

demux

+

+

+

+

-

-

+

-

-

-

-

+

-

-

-

-

+

-

+

-

+

-

+

-

+

-

+

+

+

-

+

+

-

+

-

+

-

+

-

+

-

+

-

-

-

+

+

+

-

+

-

+

-

+

-

+

-

+

-

-

-

+

-

-

+

+

-

+

-

+

-

+

-

+

-

+

O1,3

O2,3

O2,4

O3,3

O3,4

O4,3

O4,4

O5,3

O5,4

O6,3

O6,4

O7,3

O7,4

O1,4

O0,0

O0,7

O1,0

O2,0

O2,7

O3,0

O3,7

O4,0

O4,7

O5,0

O5,7

O6,0

O6,7

O7,0

O7,7

O1,7

O0,5

O0,2

O1,2

O2,2

O2,5

O3,2

O3,5

O4,2

O4,5

O5,2

O5,5

O6,2

O6,5

O7,2

O7,5

O1,5

O0,1

O0,6

O1,1

O2,1

O1,6

O2,6

O3,1

O3,6

O4,1

O4,6

O5,1

O5,6

O6,1

O6,6

O7,1

O7,6

-( 2

#

/#

9
We do the same for the adders’ network. We have 56 adders for each matrix row. If we
compute the whole matrix at once this means 56*8 adders.
Using only 56 can reduce the area usage.

,

I5,4

I5,7

I5,3

I5,5

I5,1

I2,3

I2,6

I5,0
I5,6

I5,2

I2,2

I2,1

I2,7

I2,5

I0,2

I2,4

I2,0

I0,4

I0,0

I0,1

I3,5

I0,6
I0,7

I0,5

I1,5

I1,0
I0,3

I1,2

I1,1

I1,3

I1,7

I1,6

I1,4

I3,4

I3,3

I3,2

I3,6

I3,7

I4,3

I4,5

I4,6

I3,1

I6,7

I4,0

I4,4

I6,0

I6,1

I6,2

I4,1

I4,7

I4,2

I3,0

I6,5

I6,3

I7,6

I7,2

I7,0

I7,1

I6,4

I6,6

I7,3

I7,4

I7,7

I7,5

mux
mux

mux

mux

mux

mux

mux
Ii0

Ii5

*i

*g

Ii1

*j

*e

Ii3

*k

*f

*e

*i

*h

*g

*g

*i

-

+

+

-

239
+

+

+

-

-

-

*h

*k

*e

+

+

+

-

+

-

-

+

-

-

+
+

+

Oi3

Oi7

Oi1

Oi4

demux

Oi6

O0,3

O0,4

O0,2

O0,0

O0,5

O7,5
O7,1

O7,2

O7,7

O7,4

O7,0

O7,6

Oi2

Oi5

demux
demux

O0,7

-

+

-

-

-

demux

O0,1

-

-

+

+

O0,6

+

-

+

-

-

+

-

-

+

+

+

*i

-

+

*g

*k

-

+

-

-

-

*e

*j

*f

+

+

+
+

Oi0

Ii7

Ii6

Ii4

Ii2

*k

mux

O7,3

O3,5

O3,4

O3,7

O3,1

O3,6

O3,3

O3,0

O4,2

O3,2

O4,1

O4,4

O4,3

O4,7

demux

demux

O4,5

O6,0

O4,0
O4,6

O6,1

O6,7

O6,6

O1,2
O6,2

O6,5

O6,3

demux

demux

O1,1

O1,3
O1,5

O6,4

O1,6

O1,0

O1,4

O5,6

O5,0

O1,7

O5,4

O5,7

-

2

#

'

#

O2,6
O2,2

O5,2

O5,5

Powered by y Files

O2,0

O5,3

O5,1

O2,4

O2,5

O2,7

O2,1

O2,3

#

9

From (32) we can see that each multiplier takes 4 different columns at most. So we can
share the constant multipliers among the different terms by serializing the inputs and
deserializing the outputs.
As a result we use 7 constant multipliers, which is the minimum possible number of
constant multipliers.
I5,4

I5,7

I5,6

I5,0

I5,3

I5,5

I5,2

I5,1

I7,3

I7,5

I7,7

I7,6

I7,1

I7,4

I7,0

I7,2

I1,4

I1,3

I1,2

I1,7

I1,6

I1,1

I1,0

I1,5

I3,6

I3,2

I3,3

I3,7

I3,0

mux

mux

mux

mux

Ii5

Ii7

Ii1

Ii3

Ii7*e

Ii1*k

Ii3*i

I3,4

Ii5*g

Ii7*k

demux

O7,5

O7,4

O7,1

O7,0

O4,2

-

'

O4,7

O4,0

2

O4,4

O4,1

O4,6

O4,5

O0,1

mux

*g

*i

Ii3*g

Ii1*e

O0,7

O0,5

O0,6

O0,0

#

I4,4

I4,3

I4,2

I4,0

I0,0

I0,6

O3,1

O3,3

Ii7*g

Ii3*e

O3,5

*f

Ii5*k

Ii2*f

Ii6*j

Ii3*k

-

I0,2

I6,2

I6,1

-

+

-

Oi3

Oi6

Oi1

O6,3

O6,2

#

O6,1

O6,0

O6,4

I6,3

I6,5

I2,5

I2,7

I2,1

Ii2

I2,0

I2,2

I2,3

I2,4

I2,6

Ii7*i

-

+

+

-

Oi2

O6,6

I6,4

mux

Oi5

demux

O6,5

I6,0

Ii6

-

+

demux

O6,7

I6,6

mux

-

-

-

I6,7

+

+

-

-

O3,4

Ii1*g

-

+

O3,7

I0,5

-

-

+

-

O3,6

I0,3

Ii5*e

+

+

-

+

O3,0

I0,7

+

+

demux

O3,2

mux

*j

-

+

+

+

Ii1*i

mux

-

+

-

Oi0

Ii0*h

-

+

Oi4

*h

Ii4*h

I0,4

Ii0

mux

Ii6*f

I0,1

mux

-

-

Oi7

O0,2

Ii5*i

+

+

+

O0,3

I4,6

+

demux

O0,4

Ii2*j

+

-

demux

O4,3

mux

*e

+

+

O7,2

mux

+

-

O7,3

I4,5

*k

-

-

O7,7

I4,7

mux

-

+

I4,1

Ii4

+

O7,6

I3,1

mux

-

Powered by y Files

I3,5

O1,2

0#

demux

O1,4

O1,0

O1,7

O1,3

O1,6

O1,1

O1,5

/

demux

O2,5

O2,4

O2,2

O2,6

O2,0

O2,7

O2,1

O2,3

O5,2

O5,0

O5,3

O5,5

O5,1

O5,7

O5,4

O5,6

#

!

A constant multiplier can be implemented as a sequence of add and shift operations. I
denote this type of constant multiplier as AS (add shift) multiplier.
For instance Y = X·15, is expressed as Y = X·1111 in binary notation and can be
decomposed as follows as Y = X·23+X·22+X·21+X·20
As the shift operation comes for free in digital logic, in this example we only need 3
adders to compute the product. In fact, the number of used adders depends on the
active bits in the constant multiplier. The more “ones” in the constant value, the more
adders we use.

,

However, this can be further optimized if we notice that 15 = 16-1. So we can rewrite
the previous example as Y=X·(16-1), which in binary notation becomes Y= X·(10000 1) = X·10000 – X. Finally we get Y=X·24-X·20.
In essence, we introduced the subtract operation to obtain a much more compact
expression that allows to save computing resources. In this case we only need 1
subtractor to compute the product, compared with the previous 3 adders. I denote this
type of constant multiplier as ASS (add subtract shift) multiplier.

4
The design variations have been tested in the complex Mpeg1Decoder testbench. The
brute force approach has been ignored, as it does not fit into the available FPGA. The
first test consists on implementing the product term reuse with AS type constant
multipliers and replace the whole ConstantMatrixMultiplier module for FPGA
execution. As it is a pure combinational circuit, it is possible to reuse its hardware
implementation when using jumble simulation. As simulator will evaluate the propagate
function in different times, we can redirect the two different constant matrix multipliers
towards the single hardware implementation. The second test follows this approach to
execute the two ConstantMatrixMultiplier modules present on the iDCT design.
The third test introduces the use of ASS type of constant matrix multipliers. About an
11% of the area of the original module is saved in this step (see first and second row of
Table 10). The fourth test repeats the trick of reusing the hardware implementation but
now with the ASS based constant matrix multiplier. The fifth test uses the design with
22 multipliers and 56 adders, but as the design is smaller, we replace the entire
HwIdct block for FPGA execution. Finally, the sixth test uses the design with 7
multipliers and 56 adders.
"

7;

6

Design
ConstantMatrixMultiplier
2 ConstantMatrixMultiplier
ConstantMatrixMultiplier
using ASS multipliers
2 ASS ConstantMatrixMultiplier
HwIdct
(IDCT64_22cm_56ad)
HwIdct
(IDCT64_7cm_56ad)

#

Std.
Simulation
183,186 s

Jumble
Simulation
100,489 s

183,186 s

11,619 s

15.76

143,957 s

76,654 s

1.87

143,957 s

11,384 s

12.64

36,133 s

3,569 s

10.12

27,372 s

3,558 s

7.69

Speedup Factor
1.82

As shown in Table 9, simulation speedups go from 2 to 12 approx. It is very interesting
to observe that reusing combinational blocks have a great impact in the simulation
speedup (goes from 1.82 to 15.76). Also noticeable is that the speedup achieved when
replacing the whole HwIdct is lower than the previous one (achieved when replacing
some of its parts). This is due to the width of the interface that must be synchronized in
both cases; the interface of the HwIdct
has 4097 bits while the
ConstantMatrixMultiplier has an interface of 2049 bits. The greater the
interface, the more time must the simulator dedicate to transfer data to the real
hardware.

,

"

(; 8

1 (

Block of the design

LEs
Complete Design

LEs
Constant Matrix
Multiplier

Computing Elements

Memory bits
Complete Design

ConstantMatrixMultiplier

26,800
(82% of device total)

20,576
(63% of device total)

176 M
448 A/S

1,152
(<1% of device total)

ConstantMatrixMultiplier
using ASS multipliers

24,592
(75% of device total)

18,304
(56% of device total)

176 M
448 A/S

1,152
(<1% of device total)

CMatrixMultiplier22

20,879
(64% of device total)

14,716
(45% of device total)

22 M
448 A/S

1,152
(<1% of device total)

CMatrixMultiplier56

9,546
(29% of device total)

3,906
(12% of device total)

22 M
56 A/S

1,152
(<1% of device total)

HwIdct
(IDCT64_22cm_56ad)

14,150
(43% of device total)

3,616 & 3,906
(12% of device total)

22 M
56 A/S

2,097
(<1% of device total)

HwIdct
(IDCT64_7cm_56ad)

13,680
(42% of device total)

3,330 & 3,733
(11% of device total)

7M
56 A/S

2,097
(<1% of device total)

,,

<
$

,

I have presented a method to interactively select a part of a design during a simulation
session and download it into a supported hardware platform for hardware execution.
The system could reduce simulation time by some orders of magnitude providing a
convenient system for HIL verification, but this would be achieved only if Amdahl’s α is
very close to 1. However, in more realistic experiments, I have got a speedup factor of
10 to 30. This could be optimized by implementing faster methods to transfer the data
to the interface.
In future work, I will try to improve the obtained speedups through the use of larger
blocks instead of bit-by-bit scan chain. Another interesting idea is to directly map the
input and output interface to the host memory and avoid the use of the shift registers.
The actual TJumble is proportional to the width of the interface as formulated in (8). This
approach would significantly reduce by a factor of 32 (33) the time spent in inputs and
outputs transfers as would benefit from burst PCI transfers, since each PCI bus
read/write operation would be enough to place inputs and outputs.

A '&

=

@

⋅

(33)

5?

The current system is limited to download a single block at a time. Future work will
address the need to download multiple independent blocks.
Another pitfall of the system is that it does not provide a method to verify that the
replaced block corresponds to the block that is currently programmed into the hardware
platform. This problem will be addressed by adding some metadata information into the
synthesized design so it can be compared with the object instructed to be replaced.
The four-command process could be merged into a single command that offers a
simple single push button solution for hardware emulation of selected blocks.
Jumble is practical for design verification but could also be useful in the deployment of
the final designs, especially if the designs that we are verifying are though to be PCI
coprocessors. With some more work, and following a similar process that have been
used, we could create an automated way of wrapping the design and create an API so
that hardware functions are directly usable from end user applications. This API could
be a Windows DLL, a COM object or a Java class.
As has been stressed through this work, JHDL has the ability to manipulate the circuit
hierarchy during simulation sessions. Here this is used to substitute a circuit and
redirect its interface to its hardware implementation. By doing this, we can compare
different simulation sessions and verify that the system is equivalent. However, it can
be difficult to ensure the equivalence if we do not keep track of all the signals that get in
an out of the circuit, and this is obviously prohibitive for most designs. As a possible
improvement, we could avoid removing the software model of the hardware circuit and
maintain both: the software implementation and its hardware implementation through

,6

the redirector together with an additional module aimed to verify its equivalence at each
clock cycle. This would eliminate any speedup buy could offer a safe intermediate step
before going to the pure hardware implementation.
Finally, Jumble can be very useful for complex designs in which simulation costs are
considerable. In our research group, we are currently working on projects where this is
the case, like NoC and Soft Core simulations.

6

"
In order to clarify the extension of this work we detail the various contributions.
"

2

"

'

Source Code
Files

Source Code
Lines

125

29,491

21

1,679

David Castells

126

39,236

JHDL

David Castells

108

19,154

C++
VHDL/JHDL/
Java
JHDL

David Castells
Alexis Morugó,
David Castells
David Castells

20

4,339

3

2,116

49

5,808

C++/Java

David Castells

4

1,961

JHDL/Java

David Castells

20

3,159

Quartus Automation

Java

David Castells

5

528

Threaded Logic

JHDL/Java

David Castells

1

287

Common utility logic

JHDL

David Castells

Module

Language

Altera support

JHDL

Median Filter Test Case

JHDL

Jordi Farré,
David Castells
David Castells

Mpeg Test Case

JHDL/Java

OCR Test Case
PCI renenumerator
VHDL netlister
PLD Platform Model
JNI native interface to PLD
board
Wrapping infrastrucure

Authors

TOTAL

60

62

7,567

544

115,325

6

[Aldec]

Aldec, Inc., http://www.aldec.com

[Amdahl67]

G.M. Amdahl. Validity of single-processor approach to achieve large-scale computing capability.
Proceedings of AFIPS, 483-485, 1967.

[Altera96]

Altera Corporation, “LPM (Library of Parametised Modules) Quick Reference Guide”, December 1996.
http://www.altera.com/literature/catalogs/lpm.pdf

[Altera00]

Altera Corporation, “Instatiating LPM in EDIF”

[Altera01]

Altera, San Jose CA. SignalTap Embedded Logic Analyzer Megafunction, April 2001 ver.2.0

[Altera05]

Altera Corporation, Altera DSP Builder Reference Manual, January 2005, version 2.1.3,
http://www.altera.com

[Alpha]

Alpha Data Systems, Simulink Board Support Blockset, http://www.alpha-data.com/simulink bsb
dsheet.html

[Annapolis04]

Annapolis Micro Systems, “Wildcard Reference Manual Rev 3.4,” Annapolis Micro Systems, Inc,
Annapolis, MD, 2004, (http://www.annapmicro.com/)

[Arnold92]

J. Arnold, D. Buell and E. Davis, "Splash II", 4th ACM Symposium on Parallel Algorithms and
Architectures, San Diego, CA, USA, pp. 316-322, 1992.

[Arnold93]

J. M. Arnold, "The Splash 2 software environment", in Proceedings of IEEE Workshop on FPGAs for
Custom Computing Machines, D. A. Buell and K. L. Pocek, Eds., Napa, CA, Apr. 1993, pp. 88-93.

[Arvind04]

Arvind, Rishiyur S. Nikhil, Daniel L. Rosenband, and Nirav Dave. “Highlevel Synthesis: An Essential
Ingredient for Designing Complex ASICs.” in Proceedings of ICCAD'04, San Diego, CA, 2004.

[Ayuso06]

N. Ayuso, J. Pico, N. Benitez, J. Carrabina, E. Pons, B. Martinez, D. Castells-Rufas, M. Monton, L.
Terés, J. Merino, E. Gonzalez, A. Guerendiain, G. Alvarez, C. Amuchastegui. UNIVERSAL
RECONFIGURABLE SYSTEM AND METHOD FOR THE REMOTE READING OF COUNTERS OR
EQUIPMENT COMPRISING VISUAL INDICATORS, European patent number 11446, 2006.

[Axis]

Axis Systems, Inc., http://www.axiscorp.com

[Babb97]

J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal. “Logic emulation with virtual wires”.
IEEE Transactions on CAD, 16(6):609–626, Jun. 1997

[Ballagh01]

J. Ballagh, P. Athanas, and E. Keller, “Java Debug Hardware Models using JBits,” 8th Reconfigurable
Architectures Workshop, San Francisco, CA, April 27, 2001.

[Banarjee99]

P. Banarjee et al, "MATCH: A MATLAB Compiler for Configurable Computing Systems”. Technical
Report, Center for Parallel and Distributed Computing, Northwestem University, Aug. 1999, CPDCTR-9908-013.

[Banarjee00]

P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, M. Haldar, P. Joisha, A. Jones, A. Kanhare, A.
Nayak, S. Periyacheri, M. Walkden, and D. Zaretsky, “A MATLAB Compiler for Distributed
Heterogeneous Reconfigurable Computing Systems”, International Symposium on FPGA Custom
Computing Machines (FCCM’00) IEEE Computer Society Press, Los Alamitos, Calif., 2000.

[Basu98]

A. Basu, R. S. Mitra, and P. Marwedel. “Interface synthesis for embedded applications in a co-design
environment”. In 11th IEEE International conference on VLSI design, pages 85{90, C, 1998.

[Bauer94]

T. J. Bauer. “The design of an efficient hardware subroutine protocol for FPGAs”. Master’s thesis,
MIT, 1994.

[Bauer98]

J. Bauer, M. Bershteyn, I. Kaplan, and P. Vyedin. “A reconfigurable logic machine for fast event-driven
simulation”. Design Automation Conference, .June 1998.

[Bazeghi05]

C. Bazeghi, F. J. Mesa-Martinez, J. Renau: “µComplexity: Estimating Processor Design Effort”.
Proceedings of MICRO 2005: pp 209-218

[Bellows98]

P. Bellows and B. L. Hutchings. “JHDL - an HDL for reconfigurable systems”. In J. M. Arnold and K. L.
Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages
175-184, Napa, CA, April 1998.

[Bellows04]

Peter Bellows. ”High-Visibility Debug-by-Design for FPGA Platforms”. ERSA 2004: 247-258

[Benitez04]

Domingo Benitez, “Análisis de Prestaciones de Coprocesadores Reconfigurables”, in Proceedings of
IV Jornadas de Computación Reconfigurable y Aplicaciones JCRA. Barcelona, September, 2004.

[Bershad90]

B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. "Lightweight Remote Procedure
Call". ACM Trans. on Computer Systems, 8(l), February 1990.

[Birkner98]

J. Birkner, “From Simple PALs to High-Speed, High-Density Leading Edge FPGAs, Their
Technologies and Applications” MAPLD 98 Proceedings, 1998.

[Bishop97]

W.D. Bishop, W.M. Loucks, “A Heterogeneous Environment for Hardware/Software Cosimulation,”
Proceedings of the IEEE Annual Simulation Symposium, 1997, pp. 14-22.

6

[Borgatti96]

M. Borgatti, R. Rambaldi, G. Gori, R. Guerrieri, “A Smoothly Upgradable Approach to Virtual
Emulation of HW/SW Systems,” Proceedings of the International Workshop on Rapid System
Prototyping, 1996, pp. 83-88.

[Borgatti97]

M. Borgatti, E. Cevenini, R. Rambaldi, M. Felici, A. Ferrari, R. Guerrieri, "Fast board-level prototyping
of a speech recognition system using virtual emulation" Proceedings of the 8th IEEE International
Workshop on Rapid System Prototyping, 1997.

[Budiu02]

M. Budiu, M. Mishra, A. Bharambe, and S. C. Goldstein, "Peer-to-Peer Hardware-Software Interfaces
for Reconfigurable Fabrics," in FCCM, 2002.

[Butts92]

M. Butts, J. Batcheller, and J. Varghese, “An efficient logic emulation system”, in IEEE 1992
International Conference on Computer Design: VLSI in Computers and Processors, 1992, pp. 138–
141.

[Canellas00]

Canellas, N., Moreno, J. M., “Speeding up hardware prototyping by incremental
Simulation/Emulation”, in Proceedings of 11th International Workshop on Rapid System Prototyping,
2000.

[Cardoso98]

J. M. P. Cardoso, H. C. Neto, “Towards an Automatic Path from JavaTM Bytecodes to Hardware
Through High-Level Synthesis,” In Proc. of the 5th IEEE International Conference on Electronics,
Circuits and Systems, Lisbon, Portugal, September 7-10, 1998, pp. 85-88.

[Cardoso99]

J. M. P. Cardoso and H. C. Neto, “Macro-based hardware compilation of java bytecodes into a
dynamic reconfigurable computing system,” in Proceedings of the IEEE Workshop on FPGAs for
Custom Computing Machines (K. L. Pocek and J. M. Arnold, eds.), (Napa, CA), p. n/a, IEEE, 1999.

[Carloni02]

L.P. Carloni, F. De Bernardinis, A. Sangiovanni-Vincentelli, and M. Sgroi, “The art and science of
integrated systems design”, in Proc. of 28th European Solid-State Circuits Conference (ESSCIRC
2002), 2002,Florence, Italy, 25-36.

[Cho01]


 

;
, "An approach to combining emulation and simulation for efficient debugging of
system-on-chip design", CAD
VLSI
, pp.210-214, 2001. 5



    

[Chou92]

P. Chou, R. Ortega, and G. Borriello. “Synthesis of the hardware/software interface in microcontrollerbased systems”. In Proceedings of ICCAD, pp.488–495, Nov. 1992.

[Cong94]

Cong, J., and Ding, Y. “On Area/Depth Trade-off in LUT-Based FPGA Technology Mapping”, IEEE
Transactions on VLSI Systems, vol. 2, no. 2, 137-148, June 1994.

[Cosic92]

K. Cosic, I. Kopriva, I. Miller, “Workstation for Integrated System Design and Development,”
Simulation, vol 58, n 3, Mar. 1992, pp. 152-162.

[Çakır01]

M. Çakır, E. Grimpe, “ProtoEnvGen: Rapid ProtoTyping Environment Generator”, in Proceedings of
VLSI-SOC 2001.

[Çakir03]

M. Çakir, E. Grimpe, “HW-Driven Emulation with Automatic Interface Generation”, in Proceedings of
FPL 2003, pp. 627-637, 2003.

[DeHon04]

A. DeHon, J. Adams, M. DeLorimier, N. Kapre, Y. Matsuda, H. Naeimi, M. Vanier, M. Wrighton.
“Design Patterns for Reconfigurable Computing”. FCCM 2004: 13-23.

[Dick01]

C. H. Dick and H. M. Pedersen, “Design and Implementation of High-Performance FPGA Signal
Processing Datapaths for Software Defined Radios”, Embedded Systems Conference Apr. 2001.

[Dozza98]

D.Dozza, R.Rambaldi, M.Borgatti and R.Guerrieri , “OMI-Compliant Model for Virtual Emulation” in
Proceedings of the Ninth International Workshop on Rapid System Prototyping, 1998.

[Edenfeld03]

D. Edenfeld, A. B. Kahng, M. Rodgers, and Y. Zorian. “2003 Technology Roadmap for
Semiconductors”. IEEE Computer, 37(1):47-56, 2004.

[Edif]

http://www.edif.org

[Edwards97]

M. Edwards: “Software Acceleration Using Coprocessors: Is it Worth the Effort ?” Proceedings of 5th
International Workshop on Hardware/Software Codesign (Codes/CASHE‘97), pp. 135-139,
Braunschweig 1997

[Fischer98]

F. Fischer, A. Muth, G. Fiirber “Towards interprocess communication and interface synthesis for a
heterogeneous real-time rapid prototyping envimnment”. 6th International Workshop on
Hardware/Software CO-Design (Codes/CASHE ’98). Seattle, USA, 1998.

[Fritsch99]

Ch. Fritsch, J. Haufe, Th. Berndt: Speeding Up Simulation by Emulation - A Case Study. in Design,
Automation and Test in Europe Conference, Munich 1999, User Forum, 127-134

[George99]

George, Alan D., Ryan B. Fogarty, Jeff S. Markwell, and Michael D. Miars, “An Integrated Simulation
Environment for Parallel and Distributed System Prototyping,” Simulation, Vol. 75, No. 5, May 1999,
pp. 283-294.

[Graham00]

P. Graham, B. Hutchings, and B. Nelson, “Improving the fpga design process through determining
and applying logical-to-physical design mappings”, Technical Report CCL-2000-GHN-1, Brigham
Young University, Provo, UT, April 2000.

[Graham01]

P. S. Graham. “Logical Hardware Debuggers for FPGA-based Systems”. PhD thesis, Brigham Young
University, Provo, UT, USA, December 2001.

6

[Graham01b]

Paul Graham, Brent Nelson, and Brad Hutchings, “Instrumenting Bitstreams for Debugging FPGA
Circuits,” 2001 IEEE Symposium on Field-Programmable Custom Computing Machines, Rohnert
Park, California April 29 - May 2, 2001.

[Guerendiain05] A. Guerendiain, G. Alvarez, C. Amuchastegui, N. Ayuso, J. Pico, N. Benitez, J. Carabina, E. Pons, B.
Martinez, D. Castells, M. Monton, L. Teres, J.L. Merino. ”Método y Sistema Universal y reconfigurable
de lectura remota de contadores o equipos provistos de indicadores visuales”. Patent Number:
P200500991, Spain, April, 2005
[Hanono95]

S. Z. Hanono, "Innerview hardware debugger: A logic analysis tool for the virtual wires emulation
system," M.S. Thesis, Massachusetts Univ. Technol., 1995.

[Haufe98]

J. Haufe, P. Schwarz, T. Berndt, J. Große, “Accelerated Logic Simulation by Using Prototype Boards”.
In Proceedings of Design Automation and Test in Europe, Paris 1998, pages 183-189.

[Hemani04]

A. Hemani, “Charting the EDA Roadmap”, The Chip, IEEE Circuits & Devices Magazine, NovemberDecember, 2004.

[Hoenicke01]

J.
Hoenicke,
“Java
Optimize
http://jode.sourceforge.net/ , 2001

[Hunt02]

W. Hunt, "Introduction: Special issue on microprocessor verification", in Formal Methods in System
Design, Kluwer Academic Publishers, 2002.

[Hutchings99]

B. Hutchings, P. Bellows, J. Hawkins, S . Hemmert, B. Nelson, and M. Rytting, "A cad suite for highperformance fpga design," in Proceedings of the IEEE Workshop on FPGAs for Custom Computing
Machines (K. L. Pocek and J. M. Arnold, eds.), (Napa, CA), p. n/a, IEEE Computer Society, IEEE,
April 1999.

[Hutchings00]

B. L. Hutchings, B. E. Nelson, "Using general-purpose programming languages for FPGA design," in
Proc. 37th Design Automation Conf., Los Angeles, CA, June 2000, pp. 561-566.

[Hutchings00b]

B. Hutchings, B. Nelson, M. Whirthlin, “Designing and Debugging Custom Computing Applications”,
IEEE Design & Test of Computers, January-March 2000.

[Hutchings01]

B. L. Hutchings and B. E. Nelson. “Unifying simulation and execution in a design environment for
FPGA systems”. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9:201-205,
February 2001.

and

Decompile

Environment

(JODE)

program”.

[Hutchings04]

B. Hutchings and B. Nelson. “Giga Op DSP On FPGA”. In Proceedings of ICASSP 2001, May 2001.

[Hwang98]

S. Hwang, T. Blank, K. Choi, “Fast Functional Simulation : An Incremental Approach,” in IEEE
Transactions. CAD, 1988.

[Hwang01]

J. Hwang, B. Milne, N. Shirazi, J. D. Stroomer, “System Level Tools for DSP in FPGA”, in Proceeding
of FPL2001, pp 534-543, 2001.

[Indrusiak03]

L. S. Indrusiak, F. Lubitz, M. Glesner, R. A. L. Reis, “Ubiquitous Access to Reconfigurable Hardware:
Application Scenarios And Implementation Issues”. Proceedings of DATE ‘03, Munich, 2003. p.940 –
945.

[Indrusiak05]

Indrusiak, L.S.; Prudêncio, R. B.; Glesner, M. Modeling and Prototyping of Communication Systems
using Java: a Case Study. In: Proceedings of 16th IEEE International Workshop on Rapid System
Prototyping (RSP), 2005, Montreal,Canada.

[Jimenez05]

D. F. Jiménez, L. S. Indrusiak, M. Glesner, “Proxy-based Integration of Reconfigurable Hardware
within Simulation Environments: Improving E-Learning Experience in Microelectronics”, in
Proceedings of IEEE International Conference on Microelectronic Systems Education (MSE’05),
2005.

[Kahng00]

A. B. Kahng, "Futures for DSM Physical Implementation: Where is the Value, and Who Will Pay?",
12th DA Show keynote, July 14, 2000.

[Kim96]

K. Kim, Y. Kim, Y. Shin, K. Choi, “An Integrated Hardware-Software Cosimulation Environment with
Automated Interface Generation,” Proceedings of the International Workshop on Rapid System
Prototyping, 1996, pp. 66-71.

[Kim04]

Y. Kim, W. Yang, Y. Kwon, C. Kyung, “Communication-Efficient Hardware Acceleration for Fast
Functional Simulation”, Proceedings of DAC 2004, June, 2004, pp 293-298.

[Krukowski99]

A. Krukowski and I. Kale. “Simulink/matlab-to-vhdl route for full custom/FPGA rapid prototyping of
DSP algorithms”. In Matlab DSP Conference (DSP99), Tampere, Finland, November 16-17 1999.

[Krupnova00]

H. Krupnova, G. Saucier, “FPGA-Based Emulation: Industrial and Custom Prototyping Solutions”, in
Proceedings of FPL 2000.

[Kudlugi01]

M. Kudlugi, S. Hassoun, C. Selvidge, and D. Pryor. “A Transaction-Based Unified
Simulation/Emulation Architecture for Functional Verification”. In ACM/IEEE Design Automation
Conference (DAC), June 2001.

[Kulmala]

A. Kulmala, “HDL Verification – Simulation Engines”, course materials of System Design I (TK2400) of
Tampere University of Technology.

6

[Le97]

T. Le, F.-M. Renner, M. Glesner, “Hardware in-the-loop Simulation - a Rapid Prototyping Approach for
Designing Mechatronics Systems,” Proceedings of the International Workshop on Rapid System
Prototyping, 1997, pp. 116-121.

[Lee01]

Edward A. Lee. “Overview of the Ptolemy Project. Technical Memorandum UCB/ERL M01/11”. March
6, 2001.

[Lee01b]

Edward A. Lee, "Design Methodology for DSP" Final Report 2000-01, University of California at
Berkeley, 2001.

[Lee03]

Edward A. Lee, Stephan Neuendorfer, and Michael J. Wirthlin. “Actor-oriented design of embedded
hardware and software systems”. Journal of Circuits, Systems, and Computers, 12(3):231 – 260,
2003.

[Lehmann02]

T. Lehmann, “Towards Device Driver Synthesis,” PhD. Thesis, University of Paderborn, 2002.

[Liu03]

Jie Liu and Edward A. Lee, “Timed Multitasking for Real-Time Embedded Software,” IEEE Control
Systems, special issue on Software-Enabled Control, vol. 23, no. 1, January, 2003, pp 65-75.

[Liu04]

Jie Liu, Johan Eker, Jorn W. Janneck, Xiaojun Liu, and Edward A. Lee, “Actor-Oriented Control
System Design: A Responsible Framework Perspective” IEEE Trans. on Control System Technology,
vol. 12, No. 2, March 2004, pp. 250-262.

[Lyr]

Lyr Signal Processing, DSP Link, FPGA Link: DSP + FPGA co-design, hardware-in-the-loop cosimulation, http://www.signal-lsp.com/

[Ma03]

J. Ma, "Incremental Design Techniques with Non-Preemptive Refinement for Million-Gate FPGAs",
PhD Thesis, Virginia Tech, January 2003.

[Maheshwari97] R. Maheshwari, S. S. S. P. Rao and P.G. Poonacha, “FPGA implementation of median filter”, 10th
International Conference on VLSI Design, Jan’97, pp-523-524.
[Mentor]

http://www.mentor.com/products/fv/emulation/

[Model]

http://www.model.com/

[Molina07]

A. Molina, O. Cadenas, “Functional verification: approaches and challenges”. Latin American Applied
Research, January 2007, vol.37, no.1, p.65-69. ISSN 0327-0793.

[Mpeg1]

MPEG1, ISO/IEC 11172-2:1993 “Coding of moving pictures and associated audio for digital storage
media at up to about 1,5 Mbit/s, Part 2: Video”

[Mueller01]

W. Mueller, J. Ruf, D. Hoffmann, J. Gerlach, T. Kropf, W. Rosenstiehl, "The Simulation Semantics of
SystemC", DATE 2001, Munich, 2001

[Muller97]

Pierre-Alain Muller, “Modelado de objectos con UML”. Ed. Gestión 2000, Barcelona, 1997. ISBN 848088-226-3.

[Nakamura04]

Y. Nakamura, K. Hosokawa, I. Kuroda, K. Yoshikawa, T. Yoshimura, “A Fast Hardware/Software CoVerification Method for System-On-a-Chip by Using a C/C++ Simulator and FPGA Emulator with
Shared Register Communication”, Proceedings of DAC 2004, San Diego, CA, USA, June 2004.

[Ofner04]

E. Ofner, J. Nurmi, J. Madsen, J. Isoaho, and H. Tenhunen, ”SoC-Mobinet – R&D and Education in
System-on-Chip Design,” in Proceedings of International Symposium on System-on-Chip, November
2004, Tampere, Finland, pp. 77-80.

[Osiris]

http://splish.ee.byu.edu/lab/osiris/osiris.html

[Poetter04]

A. Poetter; J. Hunter; C. Patterson; P. Athanas; B. Nelson and N. Steiner: JHDLBits: The Merging of
Two Worlds. Proceedings of the 14th Field-Programmable Logic and Applications (FPL’04), Leuven,
Belgium, Springer 2004 ISBN 3-540-22989-2, pp. 414 - 423.

[Price01]

T. Price and C. Patterson, “Reconfigurable breakpoints for co-debug”, in Field-Programmable Logic
and Applications. Proceedings of the 11th International Workshop, FPL 2001, G. Brebner and R.
Woods, Eds., Belfast, Northern Ireland, August 2001, vol. 2147 of Lecture Notes in Computer
Science, pp. 473–482, Springer-Verlag.

[Quickturn]

Quickturn Home Page, //www.quickturn.com

[Ramaswamy02] Ramaswamy Ramaswamy, Russel Tessier: “The Integration of SystemC and Hardware-Assisted
Verification”, in Proc. FPL 2002
[Ramon05]

E. Ramon, J. Carrabina. "Using FPGAs for Software-Defined Radio Systems: a PHY layer for an
802.15.4 transceiver". V Jornadas de Computación reconfigurable y Aplicaciones (JCRA). Granada,
14-16 de September, 2005.

[Sangiovanni01] Alberto Sangiovanni-Vincentelli and Grant Martin, "Platform-Based Design and Software Design
Methodology for Embedded Systems", IEEE Design and Test of Computers, Volume 18, Number 6,
November-December 2001, pp. 23-33.
[Sarmadi02]

S. B. Sarmadi, S. G. Miremadi, G. Asadi, A. R. Ejlali: “Fast prototyping with Co-operation of
Simulation and Emulation”, in Proceedings of FPL 2002.

6

[Schumacher05] Schumacher, P.; Mattavelli, M.; Chirila-Rus, A. and Turney, R. “A software/hardware platform for rapid
prototyping of video and multimedia designs”. In Proceedings of the 5th International Workshop
System-on-Chip for Real-Time Applications. IEEE, 2005. pp.30-33; (20-24 July 2005; Banff, Canada.)
[Shirazi03]

Shirazi, N., Ballagh, J., “Put Hardware in the Loop with Xilinx System Generator for DSP”, Xcell
Journal, Issue 47, May 2003.

[Sima00]

Sima, M., S. Vassiliadis, S. Cotofana, J.T.J. van Eijndhoven, and K. Vissers, "A Taxonomy of Custom
Computing Machines," in PROGRESS Workshop on Emheclcled Systems, Utrecht, The Netherlands,
2000, pp. 87-93.

[Singh03]

V. Singh, A. Root, E. Hemphill, N. Shirazi, J. Hwang. “Accelerating Bit Error Rate Testing Using a
System Level Design Tool”. FCCM 2003: 62-68.

[Siripokarpirom04] R. Siripokarpirom and F. Mayer-Lindenberg, “Hardware-Assisted Simulation and Evaluation of IP
Cores Using FPGA-based Rapid Prototyping Boards”, International Workshop on Rapid System
Prototyping, Geneva, Switzerland, June 2004.
[Siripokarpirom06] R. Siripokarpirom, “Platform Development for Run-Time Reconfigurable Co-Emulation”, in
Proceedings of 17th IEEE International Workshop on Rapid System Prototyping , Chania, Crete, June
14-16, 2006.
[Slaac]

http://splish.ee.byu.edu/lab/jhdl/slaac1/index.html

[Slade03]

A. Slade and B. Nelson. “Reconfigurable Computing Application Frameworks” . In Kenneth L. Pocek
and Jeffrey M. Arnold, editors, Proceedings of the IEEE Symposium on FPGAs for Custom
Computing Machines (FCCM ’03). IEEE Computer Society, IEEE Computer Society Press, April 2003.

[Talavera03]

G. Talavera “Hardware Software debugging techniques for Reconfigurable Systems on Chip”. M.S.
Thesis, Universitat Autonoma de Barcelona, Sep. 2003.

[Tessier01]

R. Tessier and W. Burleson, "Reconfigurable Computing and Digital Signal Processing: A Survey," J.
VLSI Signal Processing, pp. 7-27, vol. 28, no. 3, May 2001.

[Tombs04]

J. Tombs, M. Aguirre Echanóve, F. Muñoz, V. Baena, A. Torralba, A. Fernandez-León, F. Tortosa:
“The Implementation of a FPGA Hardware Debugger System with Minimal System Overhead”. FPL
2004: 1062-1066

[Touhafi96]

A. Touhafi, W. Brissninck, E.F. Dirkx: The Implementation of a Field Programmable Logic Based CoProcessor for Acceleration of Discrete Event Simulators. Proc. 6th International Workshop on FieldProgrammable Logic and Applications (FPL‘96), pp. 415-424, Springer Verlag 1996

[Tripp02]

J. L. Tripp, P. A. Jackson, and B. L. Hutchings. Sea Cucumber: a synthesizing compiler for FPGAs. In
M. Glesner, P. Zipf, and M. Renovell, editors, Field-Programmable Logic and Applications, volume
2438 of Lecture Notes in Computer Science, pages 875–885, Montpellier, France, September 2002.
Springer-Verlag.

[Turner99]

R. Turner, "System-level verification -- a comparison of approaches," in Proc. 10th International
Workshop on Rapid System Prototyping (RSP ’99), pp. 154-159, Clearwater, Fla, USA, June 1999.

[Valderas04]

M. G. Valderas, Eduardo de la Torre, F. Ariza, Teresa Riesgo: “Hardware and Software Debugging of
FPGA Based Microprocessor Systems Through Debug Logic Insertion”. FPL 2004: 1057-1061

[VHDL98]

IEEE Standard VHDL Language Reference Manual, IEEE, Inc., NY, March 1988

[Vuillemin96]

J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati, and P. Boucard, "Programmable active
memories: Reconfigurable systems come of age", IEEE Transactions on VLSI Systems, vol. 4, no. 1,
pp. 56-69, 1996.

[Waterson01]

Waterson, Mark F. “The hardware subroutine approach to developing custom co-processors”. M.S.
thesis, University of Hawai'
i at Mänoa, May 2001.

[Wildcard]

http://splish.ee.byu.edu/lab/wildcard/index.html

[Wirthlin01]

M. J. Wirthlin, B. L. Hutchings and C. Worth, “Synthesizing RTL Hardware from Java Byte Codes”, in
Field Programmable Logic and Applications, G. Brebner and R. Woods (Eds), pp. 123 – 132, Belfast,
Northern Ireland, UK, August 2001.

[Wisniewski01]

R. Wisniewski, A. Bukowiec, M. Wegrzyn, “Benefits of Hardware Accelerated Simulation”,
International Workshop on Discrete-Event System Design, (DESDes'
01), June, 2001; Przytok, Poland

[Wheeler01]

T. Wheeler, "Improving design observability and controllability for circuit debugging in FPGAs using
design-level scan techniques," Master’s thesis, Brigham Young University, Provo, UT, 2001.

[Wheeler01b]

T. Wheeler, P. Graham, B. Nelson, and B. Hutchings, ”Using design-level scan to improve FPGA
design observability and controllability for functional verification”, in Proceedings of the Eleventh
International Workshop on Field Programmable Logic and Applications, pp. TBA, Belfast, Northern
Ireland, August 2001

[Xilinx00]

Xilinx
Corp.,
The
MathWorks
and
Xilinx
http://www.xilinx.com/ipcenter/dsp/mathworks_xilinx_presentation.pdf

[XilinxSG]

Xilinx Corp., Xilinx System Generator for DSP, Version 6.3i, http://www.xilinx.com

6

Strategic

Alliance,

[Xilinx00b]

Xilinx, San Jose CA. ChipScope software and ILA Cores User Manual, v. 1.1. June 2000

[Xilinx00b]

UNIVERSAL RECONFIGURABLE SYSTEM AND METHOD FOR REMOTE READING OF
COUNTERS OF EQUIPMENT COMPRISING VISUAL INDICATORS

6,