Get Complete Project Material File(s) Now! »
Explicit token matching shifts the challenges in hardware design to compilation
In the non-preemptive thread model, each thread runs to completion once started. The data and control dependences need to be resolved and satised at partitioning stage. Unlike tagged-token data ow machines, the tokens are stored separately in the Frame Memory, and use an explicit token matching strategy in TSTAR Tagged-token data ow machines are implemented through a sophisticated matching hardware, which dynamically schedules operations with available operands Arvind & Culler [1986b] Memo & Culler [1983]. When a token arrives at a processor, the tag it carries is checked against the tags present in the token-store. If a matching token is found, it is extracted and the corresponding instruction is enabled for execution; otherwise, the incoming token is added to the store. This allows for a simple non-blocking processor pipeline that can overlap instructions from closely related or completely unrelated computations. However, the matching operation involves considerable complexity on the critical path of instruction scheduling Gajski et al. [1982].
The more subtle problem with the matching paradigm is that a failure to nd a match implicitly allocates resources within the token store. Thus in mapping computations to processors places an unspecied commitment on the token storage hardware, if this resource becomes overcommited, the program may deadlock.
Explicit Token Matching shifts the complications from the hardware design of token-matching system to the compiler. The compiler has to match the producers and consumers explicitly. In TSTAR execution model, as producer data ow threads communicate by writing directly in the data ow frame of their consumers, it is necessary that, along all data dependence edges of the program dependence graph, the producer nodes know the data ow frame of the consumer nodes. It becomes more complicates when the producer and consumer are created by separate control programs, which means there needs to be a way to communicate such information. We address this issue in chapter 5.
The complex data structure should be han- dled in an ecient way
There are two basic approaches to represent data structures such as arrays: direct and indirect access schemes Gaudiot & Wei [1989]. The direct access scheme treats each structure element as individual data tokens. The data ow functionality principle implies that all operations are side-eect free. The direct access scheme is compatible with the data ow principle. However, absence of side eect also means the token carries vectors, arrays or other complex data structures results in a new data structure, which will greatly increase the communication overhead in practice.
On the other hand, in an indirect access scheme, data structures are stored in special memory units and their elements are access through \read » and \write » operations. Literature has shown that storing the data structures and representing them by indirect access incurs less overhead that the direct access scheme in transmitting data, reconstructing the resultant array, and randomly accessing array elements.
However, the indirect access approach is not free of charge. When the array elements are stored in a separate storage, to preserve the functionality principle of data ow, a write to a single element in the array might result in copying the entire array. If the array elements are stored in a virtual memory that is global addressable to all nodes, the consistent view of the virtual memory to related threads becomes a demanding task. We address this problem in Chapter 6.
Compiling imperative programs to data- ow threads
The problem of compiling imperative programs for data- ow execution has been widely studied. Beck et al. Beck et al. [1989] propose a method for translating control ow to data ow, and show that data- ow graphs can serve as an executable intermediate representation in parallelizing compilers. Ottenstein et al. Ottenstein et al. [1990] study such a translation using the Program Dependence Web, an intermediate representation based on gated-SSA Tu & Padua [1995] that can directly be interpreted in either control-, data-, or demand-driven models of execution. Programs are transformed to the MIT data ow program graph Arvind & Nikhil [1990], targeting the Monsoon architecture. Najjar et al. evaluated multiple techniques for extracting thread-level data- ow Najjar et al. [1994]. These papers target a token-based, instruction-level data- ow model, analogous to the simulation of hardware circuits. In contrast, our data- ow model does not require tokens or associative maps, shifting the eort of explicitly assign the consumer threads to their producers to the compiler.
The comparison between our approach and the token-based solution is further discussed in chapter 5. In addition, thread-level data- ow requires additional eorts to coarsen the grain of concurrency, handling the dynamic creation of threads, and managing their activation records (data- ow frames).
SSA as an intermediate representation for data- ow compilation
The static single assignment form (SSA) is formally equivalent to a well-behaved subset of the continuation-passing style (CPS) Appel [1998]; Kelsey [1995] model, which is used in compilers for functional languages such as Scheme, ML and Haskell. The data- ow model has been tied closely to functional languages, since the edges in a data- ow graph can be seen both as encoding dependence information as well as continuation in a parallel model of execution. The SSA representation builds a bridge between imperative languages and the data- ow execution model. Our algorithm uses the properties of the SSA to streamline the conversion of general control ow into thread-level data- ow.
Decoupled software pipelining
Closely related to our work, and in particular to our analysis framework, is the decoupled software pipelining (DSWP) technique Ottoni et al. [2005]. It partitions loops into long-running threads that communicate via inter-core queues, following the execution model of Kahn process networks Kahn [1974]. DSWP builds a Program Dependence Graph (PDG) Ferrante et al. [1987], combining control and data dependences (scalar and memory). In contrast to DOALL and DOACROSS Cytron [1986] methods which partition the iteration space into threads, DSWP partitions the loop body into several stages connected with pipelining to achieve parallelism. It exposes parallelism in cases where DOACROSS is limited by loop-carried dependences on the critical path, it handles uncounted loops, complex control ow and irregular pointer-based memory accesses. Parallel- Stage Decoupled Software Pipelining (PS-DSWP) Raman et al. [2008] is an extension to combine pipeline parallelism with some stages executed in a DOALL, data-parallel fashion. For example, when there are no dependences between loop iterations of a DSWP stage, the incoming data can be distributed over multiple data-parallel worker threads dedicated to this stage, while the outgoing data can be merged to proceed with downstream pipeline stages.
These techniques have a few caveats however. They oer limited support for decoupling along backward control and data dependences. They provide a complex yet somewhat conservative code generation method to decouple dependences between source and target statements governed by dierent control ow.
EARTH thread partitioning
Another thread partitioning method closely related to our work is used in EARTH architecture compilation technique Tang et al. [1997]. The thread partitioning method operates on the Program Dependence Flow Graph (PDFG). A PDFG graph could be built based on the Program Structure Tree (PST) representation. A PST tree of a program is a hierarchical representation of the control structure of the program (similar to the control dependence graph we used in the thesis). In the PST representation, all the nodes control dependent on the same node belongs to the same region. The PDFG is built upon PST by introducing data dependences inside each basic region. The cross region data dependences is promoted up to the same region level.
Our thread partitioning method presents in Chapter 5 diers in several aspects:
In the EARTH partitioning method, the promotion of cross region dependences creates extra overhead upon each promotion | the outer regions needs to create continuations to wait for the data passing from the inner region. When the destination region get the data passing from the source, it still needs to pass the data down to the lower region where the actual dependence happens.
Our method, instead of dependence promotion, we reserve the point to point cross region dependence, passing the consumer’s thread information to its producer by building and analyzing the DataFlow Program Dependence Graph (DF-PDG). A detailed algorithm is described in Chapter 5.2.5.
Our method operates on the SSA-PDG form. There are two main reasons for relying on a SSA-based representation of the PDG: 1. SSA is formally equivalent to a well-behaved subset of the continuation-passing style model. By making the reaching denitions unique for each use, eectively converting the scalar ow into a functional program. 2. Reducing the complexity of analyzing def-use chains from O(N2) to O(N), as the number of def-use edges can become very large, sometimes quadratic in the number of nodes.
Formalization of the thread partitioning cost model
Sarkar Sarkar [1989] has formally dened the macro-data ow model and developed the cost model of the partitioning method targeting this data ow architecture, shows that the problem of determining the optimal partition, with the smallest execution time is NP-complete in the strong sense. Tang Tang & Gao [1999] Tang et al. [1997] has formalized the EARTH data ow execution model and developed the cost model for this architecture, shows the thread partitioning problem is NP complete. The TSTAR data ow architecture is similar to EARTH. The formalization is not our focus in this thesis since it has already been proven. Thus, in this thesis, we provide a simple heuristic algorithm for coarsening the granularity of the data ow threads.
Table of contents :
Contents
List of Figures
Nomenclature
1 Introduction
1.1 Hybrid Data ow for Latency Tolerance
1.1.1 Convergence of data ow and von Neumann
1.1.2 Latency Tolerance
1.1.3 TSTAR Multithreaded Data ow Architecture
1.2 Task Granularity
1.3 Motivation
1.4 Dissertation Outline
2 Problem Statement
2.1 Explicit token matching shifts the challenges in hardware design to compilation
2.2 The complex data structure should be handled in an ecient way
2.3 Related Work
2.3.1 Compiling imperative programs to data- ow threads
2.3.2 SSA as an intermediate representation for data- ow compilation
2.3.3 Decoupled software pipelining
2.3.4 EARTH thread partitioning
2.3.5 Formalization of the thread partitioning cost model .
3 Thread Partitioning I: Advances in PS-DSWP
3.1 Introduction
3.1.1 Decoupled software pipelining
3.1.2 Loop distribution
3.2 Observations
3.2.1 Replacing loops and barriers with a task pipeline
3.2.2 Extending loop distribution to PS-DSWP
3.2.3 Motivating example
3.3 Partitioning Algorithm
3.3.1 Denitions
3.3.2 The algorithm
3.4 Code Generation
3.4.1 Decoupling dependences across tasks belonging to dierent treegions
3.4.2 SSA representation
3.5 Summary
4 TSTAR Data ow Architecture
4.1 Data ow Execution Model
4.1.1 Introduction
4.1.2 Past Data-Flow Architectures
4.2 TSTAR Data ow Execution Model
4.2.1 TSTAR Multithreading Model
4.2.2 TSTAR Memory Model
4.2.3 TSTAR Synchronization
4.2.4 TSTAR Data ow Instruction Set
4.3 TSTAR Architecture
4.3.1 Thread Scheduling Unit
5 Thread Partitioning II: Transform Imperative C Program to Data ow Program
5.1 Revisit TSTAR Data ow Execution Model
5.2 Partitioning Algorithms
5.2.1 Loop Unswitching
5.2.2 Build Program Dependence Graph under SSA
5.2.3 Merging Strongly Connected Components
5.2.4 Typed Fusion
5.2.5 Data Flow Program Dependence Graph
5.3 Modular Code Generation
5.4 Implementation
5.5 Experimental Validation
5.6 Summary
6 Handling Complex Data Structures
6.1 Streaming Conversion of Memory Dependences (SCMD)
6.1.1 Motivating Example
6.1.2 Single Producer Single Consumer
6.1.3 Single Producer Multiple Consumers
6.1.4 Multiple Producers Single Consumer
6.1.5 Generated Code for Motivating Example
6.1.6 Discussion
6.2 Owner Writable Memory
6.2.1 OWM Protocol
6.2.2 OWM Extension to TSTAR
6.2.3 Expressiveness
6.2.4 Case Study: Matrix Multiplication
6.2.5 Conclusion and perspective about OWM
6.3 Summary
7 Simulation on Many Nodes
7.1 Introduction
7.2 Multiple Nodes Data ow Simulation
7.3 Resource Usage Optimization
7.3.1 Memory Usage Optimization
7.3.2 Throttling
7.4 Experimental Validation
7.4.1 Experimental Settings
7.4.2 Experimental Results
7.4.2.1 Gauss Seidel
7.4.2.2 Viola Jones
7.4.2.3 Sparse LU
7.5 Summary
8 Conclusions and Future Work
8.1 Contributions
8.2 Future Work
Personal Publications
References