% Notes from parallel intermediate represenation workshop % % 2013-01-17 Previous work ============= VCODE is not relevant for the project (the ideas, and more, are implemented in LLVM). PENCIL might be relevant, but we don't know much about it. Current PIRs in our group ========================= Feldspar -------- Imperative representation from [feldspar-compiler](https://github.com/Feldspar/feldspar-compiler): * Close to C * Decisions about parallelization have already been made Parallel constructs: * `ParLoop` - Currently compiled to a for loop - Parallelization would have to assume independent iterations. This can be guaranteed by the user language. Programs using the Par monad and futures are just compiled to procedure calls in the IR. The IR is completely unaware of these constructs. Obsidian -------- [Repository](https://github.com/svenssonjoel/MonadObsidian) IR designed to target GPUs * Hierarchical, scalable architecture: grids of blocks of threads * This hierarchy seems like it will remain in the future * Memory hierarchy Two parallel for loops: * For course-grained and fine-grained parallelism (reflecting the GPU hierarchy) IR uses higher-order abstract syntax. Koen's toy PIR -------------- Important to be able to transform the IR. The IR must allow the low-level details "we care about" to be controlled. The proposed IR: * Aims to be compiled to CUDA * Parallel constructs: for loop, composition The IR allows flattening transformations * Nested parallelism can be compiled as is if the target supports it, or flattened if the target does not support it * Well-suited for heterogeneous targets Important aspects when designing the IR: * Memory management: reuse of memory (hard in a functional language) * Filters: hard to do in a deterministic language ("hard to make functional") * We need either a functional language with "imperative promises" or an imperative language with "functional promises" * A functional language (with controlled use of imperative features) is easier to transform than an imperative Discussion: * Nikita: [Polyhedral compilation](http://en.wikipedia.org/wiki/Polytope_model) is a better way of merging and splitting parallelism. * Koen: But we need to be able to control low-level details before we go for automatic approaches. * Koen: The key question is: What details do we care about? It will be hard to agree on that. Purpose of the PIR ================== Josef: We also need to think about task parallelism (e.g. from the Par monad in Feldspar) and pipeline parallelism (important in Feldspar's domain). * Synchronous streams can help modeling pipeline parallelism, but don't seem to allow the flexibility needed for an efficient implementation What kind of back ends are we considering? * Need to target both GPUs and CPUs and combinations thereof * CUDA/OpenCL * We need task parallelism, but it's a waste to use task par. to simulate other forms of parallelism Suggestion: Focus on data parallelism and pipeline parallelism; task parallelism can always be slapped on top. Scope of Ivar's thesis ---------------------- Focus on data parallelism * Pipeline parallelism is future work Targets: * Single-core CPU (C) * Multicore (either of: C + pthreads, OpenMP, OpenCL) * GPU (either of: CUDA, OpenCL) * Heterogeneous systems are future work What should the user be able to control? * How many threads and what should they do * Memory reuse * Where in memory data is located (regions) IR should probably be imperative * We can have other functional IRs higher up Implement a mini Obsidian and mini Feldspar and compile to the IR. Examples (written in the mini DSLs or directly in the IR): * Tiny ones like append, etc. * FFT * Matrix mult. * Sorting * Filter (as in `Data.List.filter`, because it requires tricks on the GPU) * FIR filter, CRC - If there are good/simple ways of parallelizing these * Stencil computation Action: Koen should send his initial implementation to Ivar.