Design of Image Processing Embedded Systems Using Multidimensional Data Flow

2 months ago
Full text

Embedded Systems

  Next, a multidimensional data flow model of computation is intro- Target Audience As a consequence of the encompassing description of a system level design methodology using multidimensional data flow, the book addresses particularly all those active or interestedin the research, development, or deployment of new design methodologies for data-intensive embedded systems. xPreface Erlangen, GermanyJoachim Keinert Jürgen Teich Acknowledgments This book is the result of a 5-year research activity that I could conduct in both the FraunhoferInstitute for Integrated Circuits IIS in Erlangen and at the Chair for Hardware-Software-Co-Design belonging to the University of Erlangen-Nuremberg.

2.5 Requirements for System Level Design of Image Processing Applications . 18

  19 19 2.5.4 Tight Interaction Between Static and Data-Dependent Algorithms 2.5.3 Capability to Represent Control Flow in Multidimensional Algorithms . 18 2.4.5 Lack to Simulate the Overall System .

2.6 Multidimensional System Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Fundamentals and Related Work

3.1 Behavioral Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

  86 4.2 Case Study for the Motion-JPEG Decoder . 87 4.2.2 Influence of the Input Motion-JPEG Stream .

5.4.1 Multidimensional FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 Memory Mapping Functions for Efficient Implementation of WDF Edges . . . . 133

  99 7.4.2 Formal Description of the Lattice Wraparound . 139 5.6.1 Derivation of the WSDF Balance Equation .

7 Buffer Analysis for Complete Application Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.4 Lattice Wraparound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.5 Scheduling of Complete WSDF Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.6 Buffer Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206


8 Communication Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

  270A.1.3 Determination of the Minimum Tables . 272 A.1.4 Determination of the Maximum Tables .

8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

8.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

  Contentsxvii A.2.2 Determination of the Lexicographically Smallest Live Data Element . 280 A.2.4 Complexity of the Algorithm .

1 Book organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

  279A.7 Data structure for determination of min ≺Lh 8.8 Concatenation of two FIFOs in order to fulfill the preconditions for the multidimensional FIFO . 203 6.1 Simulation results comparing the rectangular and the linearized memory model 145 2.1 Filter types for generation of the wavelet subbands .

Chapter 1 Introduction

1.1 Motivation and Current Practices

  However, the constant increase of functionality provided per chip has led to a design productivity gap, which according to the International Technology Roadmap for Semiconduc-tors [ 6 ] threatens to break the long-standing trend of progress in the semiconductor industry. Whereas Moore’s law, which postulates the doubling of components per chip within 2 years, is considered to outlast at least the year 2020, the number of available transistors growsfaster than the ability to meaningfully design them.

1.2 Multidimensional System Level Design Overview

  In particular, this includes generation of communication channels for high-speed communication in form of one- and multidimensional FIFOs . They have to guarantee (i) that the source never overwrites data still required by the sink, (ii) that the sink only reads valid data, (iii) andthat the data are presented in the correct order to the sink.

Chapter 2 Design of Image Processing Applications In today’s information society, processing of digital images becomes a key element for inter-

  Corresponding examples can be found in constantly increasing function ranges of cellular phones, in various types of object recognition, in com-plex acquisition of volumetric scans for medical imaging, or in digitization of production, transmission, and projection of cinematographic content. First, it wants to give a rough survey of the dif- ferent kinds of algorithms used in complex image processing systems in order to explain thefocus of the multidimensional design methodologies discussed in this book.

2.1 Classification of Image Processing Algorithms

  Furthermore, as static algorithms often execute on huge amount of data, it is of utmost importance to select the implementation alternative that fits best the applicationrequirements. Global algorithms finally are often less regular and more difficult to parallelize, which makes them particularly adapted for 2.2 JPEG2000 Image Compression 11In order to clarify these concepts, the next section considers a concrete application in form of a JPEG2000 encoder.

2.2 JPEG2000 Image Compression

  In order to achieve these capabilities, the image has first to be transformed into the input format required by the JPEG2000 standard. As each of 1 the following operations can be applied to an individual tile without requiring the other tile data, this tiling operation increases the achievable parallelism during compression, reducesthe required memory resources for the individual processing steps, and simplifies extraction of sub-images.

1 With some restrictions for the EBCOT Tier-2

  Each subband is derived by filtering and downsampling the input image in both horizontal and vertical directions assummarized in Table 2.1 . Figure 2.3 illustrates the principles of such an algorithm for the 5–3 kernel by means of a 1 × 5 tap-filter including vertical subsampling by a factor of 2.

2 Part-1 of the JPEG000 standard demands the usage of the so-called Mallat decomposition [ 03 ], in which only the LL subband is further refined into corresponding subbands

  The lifting scheme, on the other hand, exploits the particular properties of the filtercoefficients in order to combine both the low-pass and the high-pass filtering to one sliding window algorithm, thus reducing the required computational effort. Then, 1 it returns to the beginning of the image row and moves by two in direction e .

2.3 Parallelism of Image Processing Applications

  Operation parallelism finally corresponds to the fact that for the generation of a result value several independent calculations have to be performed. 2.4 .

2.4 System Implementation

  In order to implement such a real-time JPEG2000 encoder, the abstract algorithm descrip- tion given in Section 2.2 has to be successively refined to a hardware–software system. This, however, requires an encoder model that uses the same arithmetic operations as in hardware, butemploys a higher level of abstraction than RTL in order to ease analysis and debugging.

2.4.2 Lack of Architectural Verification

  This is, for example, the case when decidingwhether to employ resource sharing in the wavelet transform as described in Section 2.2 . For local algorithms, sliding windows with different sizes, movement, and border processing have to be supportedas discussed in Section 2.2 .

2.6 Multidimensional System Level Design

  In particular, it shows how these kinds of algorithms can be modeled by multidimen- sional data flow in order to enable 2.6 Multidimensional System Level Design 21 As this is a very huge and complex task, two well-defined sub-problems are considered, namely automatic scheduling and buffer analysis as well as efficient communication syn-thesis. Whereas the first approach [ 167 ] bases on simulation and can thus be combined with arbitrary scheduling strategies or even control flow, the second [ 163 , 171 ]is limited to static subsystems in order to profit from powerful polyhedral analysis.

Chapter 3 Fundamentals and Related Work After having presented the challenges and requirements for system level design of image

  processing applications, this chapter aims to discuss fundamentals on system level design and to give an overview on related work. Communication and memory synthesis techniques are discussed sepa- rately in Section 3.4 .

3.1 Behavioral Specification Specification of the application behavior is of utmost importance for efficient system design

  It must be unambiguous, has to present all algorithm characteristics required for optimization and efficient implementation, and must not bother the designer with too many details whileallowing him to guide the design flow in the desired manner [ 94 ]. As the latter belong to the category of data-dominant systems, special focus shall be paid tosequential languages (Section 3.1.2 ), one-dimensional (Section 3.1.3 ) and multidimensional data flow (Section 3.1.4 ).

3.1.1 Modeling Approaches

  The iteration nvariables i are often grouped to an iteration vector i ∈ Z . Similarly, the loop boundariesj can be expressed by vectors i = (i , i , i ) and i = (i , i , i ) .

3.1.2 Sequential Languages

  In [ 94 ], sev- eral challenges are listed that have to be overcome in order to achieve these goals:Sequential languages do only define an execution order but do not contain any informa- Consequently, in order to simplify these problems, several C-variants have been proposed that are more adapted to hardware implementations [ 94 ]. Other approaches like Handel-C [ 61 ] and SHIM [ 93 , 289 ] enrich the C language with the capability to express communicating sequential processes. Communicating Sequential Processes

  Consequently, in case of out-of- order communication as described in Section 2.2 , an additional translation unit has to reorder the data due to different write and read orders of the data source and sink. Furthermore, as SHIM is limited to deterministic models, the user would be forced to decide exactly in which order to accept and to output the data in the translation unit. SystemC

  Besides the extension of the C language by communicating sequential processes, SystemC[ 230 ] is another approach for the design of complex systems. It bases on the C++ language by providing a set of classes for application modeling.

3.1.3 One-Dimensional Data Flow

  For this purpose, the application is modeled as a graph G = (A, E) consisting of vertices a ∈ A interconnected by edges e ∈ E ⊆ A × A . These edges represent communication, which takes place by transport of data items, also called tokens, from the source vertex src (e) to the sink snk (e) of the edge e. Synchronous Data Flow (SDF)

  In synchronous data flow (SDF) [ 193 ], data consumption and production occur in constant rates defined by two functions p : E → N, (3.1) c : E → N.(3.2) In other words, each actor invocation, also called firing, consumes c (e ) tokens from each i input edge e and produces p (e ) tokens on each output edge e . Γ is the so-called topology matrix and can be calculated as G ⎡ ⎤Γ · · · Γ · · · Γ 1,1 1,a 1,|A| .

1 G

  Γ · · · Γ · · · Γ |E|, 1 |E|,a |E|,| A| with⎧ p (e) if src (e) = a and snk (e) = a ⎪⎪⎨ −c (e) if snk (e) = a and src (e) = a= Γ . Since such a schedule does not cause a net G graph change of the buffered tokens on each edge, the overall application can be executed infinitely by periodic repetition of the minimal periodic schedule. Cyclo-static Data Flow (CSDF)

  The number of phases L (a) can vary for each graph actor a ∈ A and is given by a function L : A → N, a → L (a) . ( 3.1 ) is replaced by 4 The split actor belongs to the class of CSDF, described in Section .

5 In [ 166 ], an alternative representation has been chosen in order to emphasize the behavior of a

  This fact can be exploited by hardware implementations in order toalready read the next input image with the aim to reduce the execution time of the initial phase. In addition, since CSDF is a one-dimensional model of computation, the address calculation in order to find the correctdata elements in the buffer is left to the user. Fractional Rate Data Flow (FRDF)

  To this end, FRDF allows production and consumption of fractional tokens . For each data flow invo- cation of subsystem H, the data flow parameters of the latter have to be fixed. Data-Dependent Data Flow

  For example, whereas BDF graphs permit to establish a balance equation, which depends on the ratiobetween true and false tokens on the Boolean inputs of each switch and select actor, the existence of a solution does not guarantee that the application can run with bounded memoryas this would be the case for SDF graphs [ 50 ]. To sum up, none of them has been able to cover all require-ments formulated in Section 2.5 .

3.1.4 Multidimensional Data Flow

  As illustrated in Section 3.1.3 , one-dimensional data flow models of computation are not able to represent image processing applications in an encompassing manner. The underly- ing problem can be found in the elimination of important information when transformingan algorithm working on a multidimensional array into a one-dimensional representation. Multidimensional Synchronous Data Flow (MDSDF)

  3.4 , the solution of the balance equations (one for each dimension) leads to ∗ Tr = ( 3.1 Behavioral Specification∗ 37 Tr = ( 2, 1) , 2 meaning that the first actor (the source) has to execute three times in direction e and two 1 times in e . Based on the origin o , the data elements belonging to a certain tile can be calculated by the ifitting matrix F using the following equation: ∀j, 0 ≤ j < s : x = (o + F × j) mod s .pattern arrayj i s defines the pattern size and equals the required array size annotated at each port.pattern Usage of so-called mode-automata permits the modeling of control flow [ 185 , 186 ].

3.2 Behavioral Hardware Synthesis

  Independent of the methodology chosen for modeling, behavioral hardware synthesis is a key element for system level design of complex (image processing) applications. As this book does not primarily address behavioral synthesis, the current section is not intended to give an encompassing overview on existing tools or proposed approaches.

3.2.1 Overview

  44 3 Fundamentals and Related Work 3.2.2 SA-CSA-C [ 91 , 216 , 250 , 290 ] is a variant of the C language and has been developed in the con- text of the Cameron project [ 68 , 128 ]. To this end, SA-C is limited to single assignment programs and introduces a specialcoding primitive for sliding window algorithms in order to describe image processing kernels like a median filter or a wavelet transform.


  To this end, the system is split 7 Note that this optimization differs from the JPEG2000 tiling described in Section 2.2 in that the latter introduces an additional border processing in order to avoid the resulting multiple data accesses. By means of different heuristics that try to optimize chip area and pro- cessing speed, the parameters for the different loop optimizations can be adjusted [ 25 , 238 ] and evaluated in a new run of the above-described processing steps. Data Reuse

  Each array α can be accessed with P array references obeying α the following restrictions:All array indices are affine functions A × i + b , 1 ≤ k ≤ P of the iteration vector • α α α,ki = (i , i , . i is called 1 2 a reuse vector, because it permits to derive the iteration i that reads the same data element 2 than the considered iteration i .

1 K er ( A) + B where B is called distance vector and solves A × B = b − b

  3.8 a, i and j 1 T 2 T are such free variables leading to base vectors of b = ( 1, 0, 0) and b = ( 0, 1, 0) . This helps in construction of modularsystems and is in contrast to the DEFACTO communication, which takes place by direct handshaking between the sink and the source [ 317 ].

3.2.5 Synfora PICO Express

  A compact representation of the loop nest in form of a so-called polytope model together with the OMEGA software system [ 4 ] allows complex dependency analysis and permits to map the different loop iterations to processing elements and to sched- ule their execution [ 252 ]. For instance, buffer sizes have to be determined during simulation, which can be problematic as shown later on inSection 7.2 .

3.2.6 MMAlpha

  They are characterized by the fact that some of the actors or processes execute less often than others due to the presence of up- and downsamplers. Then, an integer linear program (ILP) can be established in order to determine the execution times of each individual 3.2 Behavioral Hardware Synthesis 55considered multirate systems are limited to one-dimensional arrays instead of two- or more- dimensional arrays.

3.2.7 PARO

  Based on this dependence graph, the operation scheduling and allocation are performed via mixed integer linear programming . Both aspects can beefficiently implemented in hardware by means of a communication primitive called multidi- mensional FIFO .

3.3 Memory Analysis and Optimization

  After having intensively discussed how to model image processing applications in order to represent important characteristics as well as how to synthesize them from a higher level ofabstraction, this section addresses related work that considers determination of the required memory sizes. Consequently, a variety of different techniques have been proposed that aim to determine the minimum amount of memory necessary to perform the desired taskwith the required throughput.

3.3.1 Memory Analysis for One-Dimensional Data Flow Graphs

  Assuming that each actorrequires the same execution time, the buffer of edge e has to be increased to 3 + 2 tokens 5 because after the third invocation of actor a it takes still two time units until actor a finally 3 5 11 consumes the first token. 11 It is assumed that the buffer has to be allocated at the beginning of the producing invocation and can be released at the end of the consuming invocation.

3.3.2 Array-Based Analysis

  In ii order to solve the question of good modular mappings, the authors present a corresponding theory that starts from a so-called index difference set DS . The latter contains the differenceof all conflicting array indices or iteration vectors that cannot be assigned to the same memory 3.3 Memory Analysis and Optimization 61D S = {i − j | i and j cannot be assigned to the same memory cell} .

3.4 Communication and Memory Synthesis

  To this end, the authors establish an integer linear program that determines the number of required memory modules and their port configura-tions and bit widths such that the given number of parallel data accesses is possible while minimizing hardware costs. This primarily bases on the extended CSDF interpretation [ 85 ] as discussed in Section .

13 This kind of loop programs indicate data dependencies instead of being supposed to execute all iterations sequentially

  Besides the techniques implemented in the DEFACTO compiler (seeSection 3.2.4 ), Weinhardt and Luk [ 300 ] present a simpler approach that is able to exploit reuse in the innermost loop level by introduction of corresponding shift registers that delay the data until it is required again. Figure 3.16 , for instance, shows the tiling operation occurring in JPEG2000 compression as discussed in Section 2.2 .

3.5 System Level Design

  Nevertheless, description, analysis, and optimization of complete systems as well as communication generation forimage processing applications still remain challenging due to several limitations of avail- able tools and methodologies. Consequently, this section aims to present several approachesthat use a system level view in order to perform, for instance, memory analysis, efficient communication synthesis, or system performance evaluation.

3.5.1 Embedded Multi-processor Software Design

  Efficient design of complex embedded systems requires both the generation of hardware accelerators and software running on multiple embedded processors. Since creation of energysaving and multi-threaded software applications is very challenging, different tools have been proposed aiming to simplify this task. Omphale

  In order to control the position and the size of the read and write windows, the tasks have to request and release data elements from the communication channel. For a the source actor of the original task graph edge, this corresponds to the movement of the write window end pointer, while for the sink of the task graph edge, it is associated with themovement of the read window end pointer. ATOMIUM

  ATOMIUM [ 57 , 58 , 60 , 140 ] stands for “A Toolbox for Optimizing Memory I/O Using geometrical Models” and represents a complex tool for buffer size analysis and memory hierarchy optimization of C-programs mapped to MPSoCs in terms of power, performance,and area. Furthermore, the memoryanalysis is extended by throughput-optimized scheduling of out-of-order communication and by tradeoff analysis in multirate applications.

3.5.2 Model-Based Simulation and Design

  In this case, the application is typically represented by a more or less formal model of computation ranging from continuous time simulations with differ-ential equations and discrete event systems over finite state machines to data flow models of computation. Furthermore, although Simulink channels are able to carry multidimensional arrays, derivation of an effi-cient hardware implementation still requires lots of user intervention in form of serializa- tion, data buffering, and memory size determination. Image Processing Centric Approaches

  References [ 33 , 77 ], for instance, provide a C++ front-end for design of local 3.5 System Level Design 73that includes a general local algorithm core, frequent point and global algorithms, as well as components for arithmetic combination of images. Combination with a behavioral synthesis tool enables quick generation of the filter kernels, which can be connected to the optimized storage structures offering parallel data access [ 189 ]. System Level Design Tools for Multidimensional Signal Processing

  The FIFO sizes are determined by automatic buffer analysis performed on the original loop program [ 292 ] as described in Section 3.3.2 on page 62 . For the Daedalus framework, on the other hand, the input code has to be designed in a specific way in order to obtain efficient memory structures [ 292 ].

3.5.3 System Level Mapping and Exploration

  The input specification is given as Kahn process network modeled in UML . Based on this information, the SPARCS system automates the process of mapping (i) task computations to the FPGA resources, (ii) abstract memories to physical memories, and (iii)data flow to the interconnection fabric.

Chapter 4 Electronic System Level Design of Image Processing Applications with YSTEM O ESIGNER S C D As already discussed in Chapter

  Section 2.5 has identified several requirements on such a new design flow like, for instance, efficient representation of point, local, or global algo- rithms, the capacity to express different kinds of parallelism as well as control flow, andthe support for out-of-order communication, high-level verification, simulation on different levels of abstraction, or high-level performance evaluation. Sec- tion 4.2 presents its application to a Motion-JPEG decoder in order to demonstrate how STEM O ESIGNER C D is able to solve the requirements identified in Section 2.5 .

4.1.1 Actor-Oriented Model

  The first step in the proposed ESL design flow is to describe the application by an actor- oriented data flow model . The multidimensional FIFO is able to handle sliding windows, out-of-order communication, and parallel access to several data elements and will be further YSTE O discussed in Chapter 5 .

4.1.4 Automatic Design Space Exploration

  As a result of the above synthesis steps, several implementations are available for both the actors and the communication channels, which can be characterized in terms of throughput,latency, or chip size. The task of the automatic design space exploration then consists of selecting a subsetof the hardware resources and mapping both the actors and the communication channels on 4.1 Design Flow 85Figure 4.4 shows an extract of an architecture template that belongs to the Motion-JPEG decoder depicted in Fig.

4.2 Case Study for the Motion-JPEG Decoder

  After having described the overall design flow of S YSTEM C O D ESIGNER , this section presents and discusses the quality of results obtained for a Motion-JPEG Decoder described inS YSTE M O C. Table 4.1 illustrates the development effort spent for realization of the complete Motion-JPEG decoder and its different hardware–software implementations.

1 In Xilinx Virtex-II FPGAs, hardware multipliers and BRAMs partially share communication

  Consequently, they are combined to one 4.2 Case Study for the Motion-JPEG Decoder 87represents the activity of breaking the JPEG standard [ 153 ] down into a S YSTE M O C graph of actors. Table 4.3 provides the performance values of the corresponding hard- ware implementations in order to give an idea about the achievable accuracy of the VPC estimations.

2 Its impacts are directly taken into account by the automatic design space exploration, because both the resource requirements and the execution times are determined after synthesis of the actors

  88 4 Electronic System Level Design with S C D YSTEM O ESIGNER mathematical sense, the values typically observed are situated in the ranges resulting fromTables 4.2 and 4.3 . The discrepancy between the VPC estimations for latency and throughput and those mea- sured for the hardware-only solution could be traced back to the time spent in complex guardsoccurring, for instance, in the Huffman Decoder. Evaluation of the Schedule Overhead In order to simulate a S YSTE M O C application, VPC uses the SystemC event mechanism

  This leads to a time consumption that cannotbe taken into account by the VPC framework, as the latter uses an event-based scheduling strategy, thus leading to a discrepancy between the execution times predicted by the VPC andthose measured in the final software implementation. The values have been obtained by means of a special hard- ware profiler that monitors the program counter of the MicroBlaze processor in order todetermine when the latter is occupied by looking for the next action to execute. Evaluation of the Influence of the Cache

  As can be seen, the simple enabling of both an instruction and a data cache with 16 KB each results in a performancegain of factor 3.4. However, since the cache behaviordepends on the program execution history, a simple reordering of the program code can lead to significant changes in the action execution times, which cannot be taken into account bythe VPC framework.

3 Note that this value divided by 4 does not result in the latency for one image. This is due to the

  Instead, S YSTEM C O D ESIGNER uses the fast VPC simulations in order to find amanageable set of good implementations that can then be further investigated by more precise simulation techniques. By providing an automatic synthesis path, S YSTEM C O D ESIGNER sig-nificantly simplifies this opportunity, and thus helps to evaluate a relatively large number of design points.

4.2.2 Influence of the Input Motion-JPEG Stream

  This, however, is a difficult task due to the data dependency of some of the involved actors, like the Huffman decoder. In other words, the achievable system throughputdoes not only depend on the chosen hardware implementation but also depend on parame- ters such as image size and image quality, which is related to the compression ratio or filesize.

4.3 Conclusions

  This can, for instance, easily be seen by means of theshuffle actor, which is responsible to forward the Y, the C , and the C value of a pixel in b r the correct order and in parallel to the YC C decoder. As those cannot be represented directly in a one-dimensional model of computation (see also Section 3.1 ), the designer is responsible for handling the data reuse occurring due to multiple reads of the same data value.

Chapter 5 Windowed Data Flow (WDF) Modern image processing applications not only induce huge computational load but also

  Section 5.3 describes how data reordering can be captured, followed by a discussion on communication control in Sec- tion 5.4 . Section 5.7 targets integration of WDF into the system level design tool S YSTEM C O D ESIGNER (see Chap- ter 4 ) before Section 5.8 presents two case studies in form of a morphological reconstruction and a lifting-based wavelet transform in order to demonstrate the usefulness of the chosen approach.

5.1 Sliding Window Communication

5.1.1 WDF Graph and Token Production

  Similar to any other data flow model of computation, applications are modeled in WDF by means of a data flow graph G, which operates on an infinite stream of data. write readO and O define the communication order and are discussed in Section 5.3 .

A, E , p, v, c, c, δ, b , b , O , O

  In order to ease notation, the parameters are typically not listed explicitly in thetuple leading to the following short-hand notation: G = ( A, E) . The number ofproduced virtual tokens can be calculated by means of a local balance equation that will be derived in Section 5.2 .

1 References [ 65 , 66 ] additionally use the concept of virtual token unions of size u. However

  since they have been superseded by the communication order described in Section 5.3 , they are omitted in this monograph (u = 1). Figure 5.2 illustrates the corre-e sponding principle by annotating some of the effective tokens with the vector I of the src(e) corresponding write operation.

5.1.2 Virtual Border Extension

  As can be seen, each virtual token is extended by a border whose size at the “upper left corner” is s t defined by b (s stands for start), whereas b (t stands for termination) is associated to the s t = 5.1 Sliding Window Communication 97 s t columns to add at the beginning, respectively, end of the virtual token. 5.3 Illustration of the virtual border extension in the WDF model of computation An important fact is that the data elements belonging to the extended border are not pro- duced by the source actor.

5.1.3 Token Consumption

  For each read operation, the sink actor snk (e) accesses such a sliding window whose size is defined by the function n ec : E ∋ e → c (e) ∈ N . Inside a virtual token, the window movement is defined by 2 the sampling vector c (e), where c (e), e defines the window shift in dimension i.

2 In accordance with [ 15 ], [ 165 , 166 ] proposed to use a diagonal sampling matrix in order to

  To this end, each edge has an associated vector δ, which defines the number of initial hyperplanes and whose interpretation is similar to MDSDF (see also Section 3.1.4 ):n e δ : E ∋ e → δ (e) ∈ N . 5.6 , a second one is started, which is supposed to be added in dimension e .

5.2 Local WDF Balance Equation

  5.7 . The bold frames indicate the data elements produced by the T source actor for the first virtual token with v = (7, 4) .

3 Assuming edge buffers of infinite size

  As a consequence, supposing one virtual token, the number of consumed data elements can be calculated toc , e + c , e × (r (e) , e − 1) e e vt i i ii consumed data elements in dimension s t = v , e + b + b , e . Thus, the number of data elements produced by the source must be a multiple of the smallest common multiple of p, e and i v, e .

5.3 Communication Order

  Principally, the communication order can be represented by two sequences of iteration% & % & e e e e vectors I , I , . The BB actor on the other hand reads 1 pixel per invocation using a non-overlapping T 1 = ( window of size c 1, 1) .

4 This block size has only be chosen for illustration purposes. In reality, blocks are much larger and their size has to be a power of 2

  " Definition 5.8 A sequence of firing blocks O = B , . 5.8 this would lead to% & read T T O = ( 3, 3) , ( 9, 6) .

2 BB B ,e

  In the inner of each firing block, the BB actor also uses raster-scan orderleading to the read order shown in Fig. e Note that by means of this notation, also k of Section 5.2 is implicitly given as follows:i src i ie , v, e ) scm (p, e B = k × ,qsrc i p, e i scm (p, e , v, e ) snk i ie B = k × × r (e) , e .vtiq isnk v, e isrcsnk B is the largest firing block describing the write operation on edge e ∈ E.

5 It could be imagined to relax this condition in order to support also incomplete code-blocks

  which can occur in JPEG2000 compression for certain relations between code-block and image size. As this, however, complicates both analysis and synthesis, it is not further detailed.

5.4 Communication Control

  In other words, it establishesa relation between the individual actor invocations or firings and the occurring read and write operations. Corresponding details can be found in Section 5.4.2 .

5.4.1 Multidimensional FIFO

  It forwards the incoming data from the multidimensional port i to either the multidimensional port o or o , 1 1 2 depending on the control value present on the one-dimensional port i . The size of both the effective token and the sliding window is defined by the corresponding 108 5 Windowed Data Flow (WDF)The data dependency of the actor is expressed in the state machine by taking into account the value of the control token: i ( 1) && (i [0] == 0) evaluates to true if at least one token c c is available on the one-dimensional channel associated with port i and if the value of the first c token is zero.

6 T

  i . They 1 2 consume the requested data from the input FIFO connected to the input port i and write it to 1 the FIFO attached to the corresponding output port.

5.5 Windowed Synchronous Data Flow (WSDF)

  Although thispresents a great benefit compared to related work presented in Sections 3.1.3 and 3.1.4 , this has to be paid by difficult analysis and verification techniques. Unfortunately, this contradicts the user’s needs identified inSection 2.5 , such as system level analysis for determination of required buffer sizes as well as formal validation of the data flow specification.

6 More precisely spoken, pos () returns the value of the hierarchical iteration vector defined later

  5.13 a, this kind of model is supported by the buffer analysis presented in Chapter 7 . For precise distinction between the general win-dowed data flow and its static variant, the latter is called windowed synchronous data flow (WSDF) .

5.6 WSDF Balance Equation Implementation of a WSDF graph is only possible if it can be executed within finite memory

  " a , a , . The current 1 2k read and write positions are indirectly given by the invocation orders as defined in Section 5.3 .

7 If such a schedule sequence cannot be found, the number of data elements stored in the edge

  This leads to the following definition: Definition 5.10 Periodic WSDF Schedule A periodic WSDF schedule is a finite schedule that invokes each actor of the graph G at least once and that returns the graph into its initial state. In other words, it must not cause anet change in the number of stored data elements on each edge e ∈ E, and both the source and the sink actor have to return to the equivalent of their initial firing positions.

5.6.1 Derivation of the WSDF Balance Equation

  Since an actor might have several inputs and outputs, it has to be taken care that communication isvalid on all edges e ∈ E of the WSDF graph. In order to take care that an actor a ∈ A reads complete virtual tokens on each input, the minimal actor period L (a) , e represents the minimal number of invocations in dimension ii ∀ 1 ≤ i ≤ n : L (a) := L (a) , e = scm (r (e) , e ) .

8 WSDF currently requires that virtual tokens are produced and consumed completely. In other words it is not allowed to process a virtual token only half

  Then, the number of actor invocations in dimension i in order to returnvtithe WSDF graph to a state equivalent to the initial one can be calculated by ⎡ ⎤ L (a ) · · · 1i ⎢ L (a ) ⎥ 2i ⎢ ⎥ r = M × q , with M = . 114 5 Windowed Data Flow (WDF) Both r and q are strictly positive integer vectors.i i r a := r , e is the number of invocations of actor a in dimension i .i j i j jq a := q , e defines, how often the minimal period of actor a in dimension i is exe-i j i j jcuted and can be calculated by ⎡ ⎤Γ · · · Γ · · · Γ 1,1,i 1,a,i 1,|A|,i .

e, 1,i a,i | A|,i i

  For each dimension i, the smallest pos- i min sible integer vector q defines the minimal repetition vector r whose components corre- i i spond to the minimal number of actor invocations that are necessary to return the WSDF graph into a state equivalent to the initial one. In other words, each actor a ∈ A has to fire jmin k × r ∈ N, e times in dimension i, with k .

5.6.2 Application to an Example Graph

  Example WSDF graph for illustration of the balance equation Fig. The graph behavior is borrowed from image processing, however generalized in some aspects in order to show all elements of the WSDF model.

1 The size of the second image shall be half the width and is cut into sub-images of 64 ×

  Actor a takes these blocks, performs a complex downsampling in vertical 2 direction and an upsampling in horizontal direction, and combines the resulting blocks to an image with 4096 × 1024 pixels. Furthermore, it consumes one image of actor a . Actor Periods

  The corresponding results are shown in Tables 5.1 and 5.2 . Table 5.1 Repetition counts for consumption of a single virtual token, based on the graph shown in Fig.

5.15 Edge (a

  The topology matrices for both dimensions allow the calculation of the vectors q 40921022 1640921022 8 1 1 v j 4 L 5.15 Actor a1 a 2 a 3 a 1,2q 5.7 Integration into S C D 117 YSTEM O ESIGNER ⎡ ⎤ ⎛ ⎞ 1 0 0 2048⎢ ⎥ ⎜ ⎟ 0 8 0 256 min ⎢ ⎥ ⎜ ⎟ r = × q = 1 1 ⎣ 0 0 4092 0 ⎦ ⎝ 4092 ⎠ 0 0 0 4092 4092⎡ ⎤ ⎛ ⎞ 1 0 2048⎢ ⎥ ⎜ ⎟ 0 16 0 512 min ⎢ ⎥ ⎜ ⎟ r = × q = . Similarly, 1 6 256 2046 O = , O = , , O = O =a a a a 1 2 3 4 2048 16 512 511 is not allowed, since in this case, actor a would not consume all data elements, leading thus 4 to unbounded token accumulation on edge e .

5.7 Integration into S YSTEM C O D ESIGNER

  The latter is part of the S YS TEM C O D ESIGNER described in Chapter 4 and has been exemplarily extended in order to describe a possible implementation strategy of the previously introduced WDF model of com- putation. For the final hardware implementation, however, a more efficientmemory management can be used as explained in Chapter 8 .

5.8 Application Examples

  This section addresses more complex applications in order to illustrate the benefits of the model of computation. In particular, it shows that the restriction to rectangular windows andhierarchical line-based communication orders does not impact the applicability to less regular algorithms.

5.8.1 Binary Morphological Reconstruction

The binary morphological reconstruction [ 294 ] is an algorithm that extracts objects from monochrome images. It has been selected as case study because it is easily understandable,

9 Due to implementation details, the current read or write position within an action is returned via

  On the other hand, in the definition of the communication state machine, 5.8 Application Examples119 1class m_wdf_rot180_ex : public smoc_actor { 2 public :/ * Ports * / 3 smoc_md_port_in < i n t ,2 > i n ; smoc_md_port_out < i n t ,2 > o u t ;4 p r i v a t e : 5/ * I n t e r n a l s t a t e s * / 6const unsigned s i z e _ x ; const unsigned s i z e _ y ; 7/ * Action * / 8void c o p y _ p i x e l ( ) { 9 unsigned i n t x = o u t . i t e r a t i o n ( 0 , 1 ) ; 11 o u t [ 0 ] [ 0 ] = i n [ si ze_x −x − 1 ] [ size_y −y − 1 ] ; } 12/ * States of communication s t a t e machine * / 13 s m o c _ f i r i n g _ s t a t e s t a r t ;14 public : 15 m_wdf_rot180_ex ( sc_module_name name ,16 unsigned size_x , unsigned s i z e _ y ) 17 : smoc_actor ( name , s t a r t ) ,18 i n ( u l _ v e c t o r _ i n i t [ 1 ] [ 1 ] , / / comm . Definition of the Binary Morphological Reconstruction

  The binary morphological reconstruction is fed with two equally sized w × h monochrome input images, the mask image I : D → { 0, 1} and the marker or seed image I : D → m s {0, 1}, where D = {1, . , w} × {1, . Calculation by Iterative One-Dilatation

  It extends all components of the seed image I by a one-dilatation and cutss all pixels that overlap the mask I . This is necessary in order to model the“repeat until” loop contained in Algorithm 1 . Iterative Dilatation with Two Passes

  Let I be the input image in the ith iteration of the 2i+0 forward pass and the initial image be the seed image, i.e., I = I . Backward Pass In order to deliver correct results, the forward pass has to be followed by a second run ◦ with a seed and mask image rotated about 180 . FIFO-Based Morphological Reconstruction

  As can be seen from Algorithm 2 , the order by which the pixels of the mask and the seed image are accessed is data dependent and hence cannot be predicted. Whereas the mask image is directly accessed from the corresponding input port using non-consuming reads, the seed image can be retrieved from the output buffer, since the contained data havenot already been forwarded to the sink.

5.8.2 Lifting-Based Wavelet Kernel

  The input image enters the WT1 module on edge e 1 and is decomposed into a high-pass part on edge e and a low-pass part on edge e . As soon as this is true, the data were read from edge e and decomposed into a high-pass part on edge e and a low-pass 4 7 part on edge e .

5 In contrast to the previous cases, this time WT1 and WT2 contain a communication state

  In case of the manual designed JPEG2000 encoder (see Section 2.2 ), it has been preferred to only interleave lines in order to reduce the number of required multiplexers. Such a model would have allowed to verify the architecture on a higher level of abstraction compared to RTL as done in the manual implementation.

5.9 Limitations and Future Work

  The input pixels enter the first lifting stage via edge e and are forwarded to the second stage by means 1 of edge e . In addition to these parametric applications, further potential for improvement can also be found in allowing for more complex out-of-order communication than what have been pre-sented in Section 5.3 .

Chapter 6 Memory Mapping Functions for Efficient Implementation of WDF Edges Selection of a good memory architecture is a crucial step during embedded system design

  Although the image sizes are rather small (176 × 144), and despite an additionalexternal SD-RAM, SRAM occupies approximately a quarter of the overall chip, which is a significant portion. First, it helps avoiding external mem- ory, which increases achievable throughput and simplifies the layout of the printed circuitboard due to the reduced number of required chips.

Dokumen baru

Aktifitas terbaru

Download (323 Halaman)