Energy-And Performance-Aware Mapping For Regular Noc Architectures
Energy-And Performance-Aware Mapping For Regular Noc Architectures
4, APRIL 2005
551
AbstractIn this paper, we present an algorithm which automatically maps a given set of intellectual property onto a generic
regular network-on-chip (NoC) architecture and constructs a
deadlock-free deterministic routing function such that the total
communication energy is minimized. At the same time, the performance of the resulting communication system is guaranteed to
satisfy the specified design constraints through bandwidth reservation. As the main theoretical contribution, we first formulate
the problem of energy- and performance-aware mapping in a
topological sense, and show how the routing flexibility can be
exploited to expand the solution space and improve the solution quality. An efficient branch-and-bound algorithm is then
proposed to solve this problem. Experimental results show that
the proposed algorithm is very fast, and significant communication energy savings can be achieved. For instance, for a
complex video/audio application, 51.7% communication energy
savings have been observed, on average, compared to an ad hoc
implementation.
Index TermsEnergy, low power, networks-on-chip (NOCs),
optimization, performance.
I. INTRODUCTION
Manuscript received June 25, 2003; revised December 25, 2003 and March
26, 2004. This work was supported in part by the National Science Foundation
under CAREER Award CCR-0093104, in part by DARPA/Marco Gigascale Research Center (GSRC), and in part by the Semiconductor Research Corporation
under Award 2001-HJ-898. Parts of this paper appeared as Energy-Aware Mapping for Til-Cabes NoC Architectures Under Performance Constraints, in the
Proceedings of the ASP-DAC, Kitakyushu, Japan, 2003, pp. 233239, and as
Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of
Regular NoC Architectures, in the Proceedings of the Design, Automation, and
Test in Europe Conference, Munich, Germany, 2003, pp. 688693. This paper
was recommended by Associate Editor M. Pedram.
The authors are with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890 USA (e-mail:
[email protected]; [email protected]).
Digital Object Identifier 10.1109/TCAD.2005.844106
cation problems [1], [4], [5]. As shown in the left side of Fig. 1,
such a chip consists of a grid of regular tiles where each tile can
be a general-purpose processor, a DSP, a memory subsystem,
etc. A router is embedded within each tile with the objective of
connecting it to its neighboring tiles. Thus, instead of routing
design-specific global on-chip wires, the intertile communication can be achieved by routing packets.
Three key concepts come together to make this tile-based
architecture very promising: 1) structured network wiring;
2) modularity; and 3) standard interfaces. More precisely, since
the network wires are structured and wired beforehand, their
electrical parameters can be very well controlled and optimized.
In turn, these controlled electrical parameters make it possible
to use aggressively signaling circuits that help reduce the power
dissipation and propagation delay significantly. Modularity and
standard network interfaces facilitate reusability and interoperability of the modules. Moreover, since the network platform
can be designed in advance and later reused directly with many
applications, it makes sense to highly optimize this platform
as its development cost can be easily amortized across many
applications.
On the other hand, the regular tile-based architecture may
lead to significant area overhead if applied to applications whose
IPs sizes vary significantly. In order to achieve the best performance/cost tradeoff, the designer needs to select the right NoC
platform (e.g., the platform with the right size of tiles, routing
strategies, buffer sizes, etc.) and further customize it according
to the characteristics of the application under design. For most
applications, the area cost overhead is fully compensated by
the design time savings and performance gains because of the
regular NoC architecture. The advantages of using the regular
NoC approach can be further increased if the IPs in the library
are developed with regularity (in terms of size) taken into consideration as well. Moreover, partitioning the application with
regularity in mind can also help in reducing the cost overhead.
Finally, the region-based design [5] can be used to further reduce the area overhead by embedding irregular regions inside
the NoC, which can be insulated from the network.
From the design perspective, given a target application described as a set of concurrent tasks which have been assigned
and scheduled, to exploit the architecture in Fig. 1, two fundamental questions need to be answered: 1) to which tile each IP
should be mapped and 2) which routing algorithm is suitable
for directing the information among tiles, such that the metrics of interest are optimized. More precisely, in order to get
the best energy/performance tradeoff, the designer needs to determine the topological placement of these IPs onto different
tiles. Referring to Fig. 1, this means to determine, for instance,
, , etc.) each IP (e.g., DSP2, DSP3,
onto which tile (e.g.,
etc.) should be placed. Since there may exist multiple minimal
552
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 4, APRIL 2005
routing paths, one also needs to select one qualified path for
each communicating pair of tiles. For example, one has to deor
termine which path (e.g.,
, etc.) should the packets follow
in order to send data from DSP2 to DSP3, if these two IPs are
and , respectively.
meant to be placed to tiles
While task assignment and scheduling problems have been
addressed before [2], the mapping and routing problems described above represent a new challenge, especially in the
context of the regular tile-based NoC architecture, as this
significantly impacts the energy and performance metrics of the
system. In this paper, we address this very issue and propose
an efficient algorithm to solve it. To this end, we first propose a
suitable routing scheme (Section III) and a new energy model
(Section IV) for NoCs. The problem of mapping and routing
path allocation are formulated in Section V. Next, an efficient
branch-and-bound algorithm is proposed to solve this problem
under performance constraints in Section VI. Experimental
results in Section VII show that significant communication energy savings can be achieved, while guaranteeing the specified
system performance. For instance, for a complex video/audio
application, on average, 51.7% communication energy savings have been observed compared to a randomly generated
implementation.
II. RELATED WORK
In [1], Dally et al. suggest using the on-chip interconnection
networks instead of ad hoc global wiring to structure the toplevel wires on a chip and facilitate truly modular design. Along
the same lines, Hemani et al. [4] present a honeycomb structure
in which each processing core (resource) is located on a regular
hexagonal node connected to three switches. In [5], Kumar et al.
describe a NoC architecture implemented by a two-dimensional
(2-D) mesh of switches and resources.
While these papers discuss the overall advantages and challenges of the regular NoC architecture, to the best of our
knowledge, our work is the first to address the mapping and
routing path allocation problems for tile-based architectures
and provide an efficient way to solve them. Although routing
(especially wormhole-based routing [3]) has been a hot research topic in the area of direct networks for parallel and
distributed systems [6][8], the specifics of NoC design force
us to rethink the standard network techniques and adapt them
to the context of NoC architectures. In what follows, we address this issue by presenting a suitable routing technique for
Resource limitation and stringent latency requirements. Compared to deterministic routers, implementing adaptive routers requires by far more
resources. Moreover, since in adaptive routing the
HU AND MARCULESCU: ENERGY- AND PERFORMANCE-AWARE MAPPING FOR REGULAR NoC ARCHITECTURES
Fig. 2.
553
A. Architecture
tiles
The system under consideration is composed of
interconnected by a 2-D mesh network (see Fig. 2). Each tile
in Fig. 2 is composed of a processing core and a router. The
router is connected to the four neighboring tiles and its local
processing core via channels [each consisting of two one-directional point-to-point links].
Due to limited resources, the buffers are implemented using
registers, typically in the size of one or two flits each. A 5 5
crossbar switch is used as the switching fabric in the router.
Each router has a routing table. Based on the source/destination address, the routing table decides which output link the
packet should be delivered to.
B. Energy Model
Ye et al. [9] proposed a model for energy consumption of
) metric is defined as the
network routers. The bit energy (
energy consumed when one bit of data is transported through
the router
(1)
E. Programmability
Since the traffic characteristics vary significantly across different applications, it is necessary to reallocate the routing paths
when the NoC platform is used for different applications. Since
reallocation only involves reprogramming the routing table for
each router, the cost of the programmability is almost negligible.
In summary, we argue that the appropriate routing technique
for NoCs should be deterministic, deadlock-free, and minimal,1
wormhole-based. Moreover, traffic characteristics should be
considered when allocating the routing paths.
IV. PLATFORM DESCRIPTION
In this section, we describe the regular tile-based architecture
and the energy model for its communication network.
1Minimal
,
, and
represent the energy consumed
where
by the switch, buffering and interconnection wires inside the
is the energy conswitching fabric, respectively. Since
sumed on the wires inside the switch fabric, the energy conshould also be insumed on the links between tiles
cluded. Thus, the average energy consumed in sending one bit
of data from a tile to a neighboring tile can be calculated as
(2)
Since the length of a link is typically in the order of miland internal
limeters, the energy consumed by buffering
is negligible2 compared to
; (2) reduces to
wires
(3)
2We evaluated the energy consumption using Spice simulations for a 0.35-m
= 0:073 pJ, which is indeed negligible
technology. The results show that E
(typically in the order of a few pJ).
compared to E
554
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 4, APRIL 2005
(4)
size APCG
is the number of routers the bit traverses from tile
where
to tile .
It is interesting to note that, with (4), the communication
energy consumption can now be analytically calculated independently of the underlying traffic model (e.g., Markovian,
long-range dependence, etc. [12]), provided that the communication volume between any communicating IP pair is known.
For 2-D mesh networks with minimal routing, (4) shows that
the average energy consumption of sending one bit of data
from to is determined by the Manhattan distance between
them.
size ARCG
(5)
(6)
such that:
map
map
(7)
(8)
map
(9)
A. Problem Formulation
Simply stated, for a given application, our objective is to
decide on which tile should each IP be mapped to and how
should the packets be routed, such that the total communication energy consumption is minimized under given performance
constraints. To formulate this problem, we need the following
definitions.
Definition 1: An application characterization graph (APCG)
is a directed graph, where each vertex reprecharacterizes
sents one selected IP, and each directed arc
has the following
the communication from to .3 Each
properties:
where
and
To give a little bit of intuition, conditions (7) and (8) mean that
each IP should be mapped to exactly one tile and no tile can host
more than one IP. Equation (9) specifies the communication performance constraints for the problem in terms of the aggregated
bandwidth requirements for each link. More precisely, the resulting network has to guarantee that the communication traffic
(workload) of any link does not exceed the available bandwidth,
such that the bandwidth requirements between each communicating IP pair can be satisfied.
We need to note that although latency is another very important performance metric, it is also a very difficult metric to
evaluate, especially since the characterization of the traffic itself is difficult for most applications. Moreover, the accurate
latency estimation may depend on many other factors, such as
packet/flit size, routers arbitration scheme, etc. Thus, similar
to other work in the literature (e.g., [13]), we use the bandwidth
requirement as a performance constraint. Another advantage is
that the bandwidth requirement is actually indirectly related to
the packet latency. For instance, the designer can calculate the
bound of packet latency with the specified bandwidth using the
techniques presented in [14].
Having this problem formulation, the network to be synthesized guarantees that the packets from the same source IP to the
same destination IP will always arrive in order, as the resulting
network is indeed deterministic. As shown in Definition 3, a deto one routing path
,
terministic routing function maps
. This means that given a source IP and a
where
destination IP , the algorithm decides only one routing path
for all packets sent from to . Thus, the packets that belong
HU AND MARCULESCU: ENERGY- AND PERFORMANCE-AWARE MAPPING FOR REGULAR NoC ARCHITECTURES
555
to the same message will never arrive out of order since they
have the same source and destination.
B. Significance of the Problem
To prove that the choice of mapping heavily affects the
communication energy consumption, we consider the following
experiment. A series of task graphs are generated using the
TGFF package [10]. Then the output graph is randomly assigned
to a given number of IPs, with the computational times and
communication volumes randomly generated according to a
specified distribution. Our tool is then used to preprocess and
annotate these task graphs and build the communication task
graphs (CTGs), which characterize the application partitioning,
task assignment, scheduling, communication patterns, and task
execution time. Also, the bandwidth requirements between any
communicating IP pairs are calculated.
The number of IPs used in the experiment ranges from 3
3 to 13 13. For each benchmark, we generate 3000 random
mapping configurations and the corresponding energy consumption values are calculated. At the same time, an optimizer
based on simulated annealing (SA) was also developed and used
with the goal of finding the legal mapping which consumes the
least amount of communication energy. The resulting energy
ratios are plotted in Fig. 3.
The dashed line in Fig. 3 shows the energy consumption ratio
between the best solution among the 3000 random mappings
(Random_min) and the solution found by the SA (SA_sol). The
solid line shows the ratio between the median solution among
the 3000 random mappings (Random_med) and SA_sol.
As we can see, although the SA optimizer does not necessarily find the optimal solution, it still saves around 50% energy
compared to the median solution for the system consisting of
3 3 tiles. Moreover, the savings increase as the system size
scales up. For instance, for the system with 13 13 tiles, the
savings can be as high as 75%. Another observation is that the
best solution among the 3000 random mappings is far from satisfactory, even for a system as small as 3 3 tiles.
From the routing perspective, it is not unusual to apply
routing to this kind of systems since minimal, deterministic, and
556
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 4, APRIL 2005
the newly occupied tile are allocated and added to its PAT. Caution must be taken to ensure freedom of deadlock.
To further explain how our algorithm works, the following
definitions are needed.
Definition 4: The cost of a node is the total energy consumed
by the communication among all IPs that have already been
mapped.
Definition 5: Let
be the set of vertices in the APCG that
have already have been mapped. A node is called a legal node
if and only if it satisfies the following two conditions.
,
, where
is the routing path from tile map
to map
specified by the PAT.
Definition 6: The upper bound cost (UBC) of a node is defined as a value that is no less than the minimum cost of its legal,
descendant leaf nodes.
Obviously, based on this definition, if a node has a UBC cost
of , then it has at least one legal descendant leaf node whose
cost is no larger than .
Definition 7: The lower bound cost (LBC) of a node is defined as the lowest cost that its descendant leaf nodes can possibly achieve.
Differently stated, this means that if a node has the LBC equal
to , then each of its descendant leaf nodes has at least a cost
of .
B. Branch-and-Bound Algorithm
Given the above definitions, finding the optimal solution of
(6) is equivalent to finding the legal leaf node which has the
minimal cost. To achieve this, our algorithm searches the optimal solution by alternating the following two steps.
Branch: In this step, an unexpanded node is selected and its
next unmapped IP is enumeratively assigned to the remaining
unoccupied tiles to generate the corresponding new child nodes.
The PAT of each child node is also generated by first copying its
parent nodes PAT and then allocating the routing paths for the
traffic between the newly occupied tile and the other occupied
tiles. The routing paths specified by the PAT have to be deadlock-free.
Bound: Each of the newly generated child nodes is inspected
to see if it is possible to generate the best leaf nodes later. A node
can be trimmed away without further expansion if either its cost
or its LBC is higher than the lowest UBC that has been found so
far, since it is guaranteed that other nodes will eventually lead
to a better solution.
How the algorithm allocates the routing paths and computes
UBC/LBC are critical to its performance. Better routing path
allocation helps balancing the traffic which leads to better solutions, but needs more time to compute. Tight UBC and LBC
help in trimming away more nonpromising nodes early in the
search, but also demand more computational time.
Next, we describe our routing path allocation heuristic which
can find a good routing path allocation within reasonably short
4North-last
5A
HU AND MARCULESCU: ENERGY- AND PERFORMANCE-AWARE MAPPING FOR REGULAR NoC ARCHITECTURES
557
In each step, the next unmapped IP with the highest communication demand is selected and its ideal topological location
on the chip is calculated as:
(10)
(11)
where
and represent the row id and column id of the tile
is the set of mapped
that is mapped onto, respectively, and
IPs. is then mapped to an unoccupied tile whose topological
.
location has the smallest Manhattan distance to
This step is repeated until all IPs have been mapped, which
leads to a leaf node. The aforementioned heuristic is then used
to allocate routing paths for the unallocated traffic. If this leaf
node is illegal, then the UBC of the node under inspection is set
to be infinitely large; otherwise, it is set to be the cost of that
leaf node.
Step 3: LBC Calculation: The LBC cost of a node can be
decomposed into three components
LBC
(12)
(14)
C. Pseudocode of the Algorithm
Fig. 6 gives the pseudocode of our algorithm. Two speedup
techniques are proposed to trim away more nonpromising nodes
early in the search process.
have larger impact on the overall communication energy consumption, fixing their positions earlier helps
exposing those nonpromising nodes earlier in the
searching process; this reduces the number of nodes to
be expanded. As most applications have nonuniform
traffic patterns, this heuristic is quite useful in practice.
558
Fig. 8.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 4, APRIL 2005
2)
HU AND MARCULESCU: ENERGY- AND PERFORMANCE-AWARE MAPPING FOR REGULAR NoC ARCHITECTURES
Fig. 9.
559
TABLE II
COMPARISON BETWEEN SA AND EPAM-OE
560
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 4, APRIL 2005
Fig. 11.
number of dummy IPs for a region depends on the size of the region.
HU AND MARCULESCU: ENERGY- AND PERFORMANCE-AWARE MAPPING FOR REGULAR NoC ARCHITECTURES
Fig. 12.
Fig. 13.
561
regular architectures with different network topologies. This remains to be done as future work.
The presented mapping algorithm takes APCG graph as the
input, which assumes that the tasks and communication transactions have already been scheduled onto a set of selected IPs. The
separation of the scheduling procedure and the mapping/routing
procedure may lead to the suboptimality of the solution. Some
of our initial work has been presented in [15] to address the
communication and task scheduling for regular NoCs. However,
more work needs to be done in order to efficiently merge the
scheduling and mapping procedures.
Fig. 14. Mapping result obtained by the modified algorithm with the specified
region shape maintained.
ACKNOWLEDGMENT
The authors would like to thank the associate editor and
anonymous reviewers for their suggestions that contributed to
improving several drafts of this paper.
REFERENCES
[1] W. J. Dally and B. Towles, Route packets, not wires: on-chip interconnection networks, in Proc. Design Automation Conf., Jun. 2001, pp.
684689.
[2] J. Chang and M. Pedram, Codex-dp: co-design of communicating systems using dynamic programming, IEEE Trans. Coputer-Aided Design
Integr. Circuits Syst., vol. 19, no. 7, pp. 732744, Jul. 2002.
[3] W. J. Dally and C. L. Seitz, The torus routing chip, J. Distributed
Comput., vol. 1, no. 3, pp. 187196, 1986.
562
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 24, NO. 4, APRIL 2005
[4] A. Hemani et al., Network on a chip: An architecture for billion transistor era, in Proc. IEEE NorChip Conf., Nov. 2000, pp. 166173.
[5] S. Kumar et al., A network on chip architecture and design methodology, in Proc. Symp. VLSI, Apr. 2002, pp. 105112.
[6] L. M. Ni and P. K. McKinley, A survey of wormhole routing techniques
in direct networks, Computer, vol. 26, no. 2, pp. 6276, 1993.
[7] C. J. Glass and L. M. Ni, The turn model for adaptive routing, in Proc.
Int. Symp. Comput. Archit. (ISCA), May 1992, pp. 278287.
[8] G. Chiu, The odd-even turn model for adaptive routing, IEEE Trans.
Parallel Distributed Syst., vol. 11, no. 7, pp. 729738, Jul. 2000.
[9] T. T. Ye, L. Benini, and G. De Micheli, Analysis of power consumption
on switch fabrics in network routers, in Proc. Design Automation Conf.,
Jun. 2002, pp. 524529.
[10] R. P. Dick, D. L. Rhodes, and W. Wolf, TGFF: Task graphs for free,
in Proc. Int. Workshop Hardware/Software Codesign, Mar. 1998, pp.
97101.
[11] M. R. Garey and D. S. Johnson, Computers and intractability: A guide
to the theory of NP-completeness, Freeman, 1979.
[12] G. Varatkar and R. Marculescu, On-chip traffic modeling and synthesis
for MPEG-2 video applications, IEEE Trans. VLSI Syst., vol. 12, no. 1,
pp. 108119, Jan. 2004.
[13] A. Pinto, L. P. Carloni, and A. L. Sangiovanni-Vincentelli, Constraintdriven communication synthesis, in Proc. Design Automation Conf.,
Jun. 2002, pp. 783788.
[14] J.-Y. Le Boudec and P. Thiran, Network Calculus: A Theory of Deterministic Queuing Systems for the Internet. New York: SpringerVerlag,
2001.
[15] J. Hu and R. Marculescu, Energy-aware communication and task
scheduling for network-on-chip architectures under real-time constraints, in Proc. Design, Automation, Test Eur., Feb. 2004, pp.
234239.
[16] Mentor Graphics IP Core Catalog. http://www.mentor.com/products/
ip/product_index.cfm [Online]