Introduction To Parallel Processing
Introduction To Parallel Processing
OUTLINE
Moore's Law Different performance enhancement techniques A simple model of Parallel Processing Performance evaluation of parallelized programs Classification of Parallel Programming - Distributed and Shared memory Message Passing Interface Synchronization and Communication Load Balancing and Granularity Examples Applications Further Research Summary
Chip performance doubles every 18-24 months Power consumption is prop. to freq. Limits of Serial computing Heating issues Limit to transmissions speeds Leakage currents Limit to miniaturization Multi-core processors already commonplace. Most high performance servers already parallel.
3
Pipelining Superscalar Architecture Out of Order Execution Caches Instruction Set Design Advancements Parallelism Multi-core processors Clusters Grid
PIPELINING
Illustration of Pipeline using the fetch, load, execute, store stages. At the start of execution Wind up.
CACHE
Desire for fast cheap and non volatile memory Memory growth at 7% per annum while processor growth at 50% p.a.
PARALLELISM - A SIMPLISTIC
UNDERSTANDING
Multiple tasks at once. Distribute work into multiple execution units. Two approaches
Data Parallelism - Divide the dataset into grids or sectors and solve each sector on a separate execution unit. Functional Parallelism Divide the 'problem' into different tasks and execute the tasks on different units.
7
AMDAHLS LAW
How many processors can we really use? Lets say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel:
AMDAHLS LAW
If we run this on a parallel machine with five processors: Our code now takes about 60s. We have sped it up by about 40%. Lets say we use a thousand processors: We have now sped our code by about a factor of two.
10
AMDAHLS LAW
This seems pretty depressing, and it does point out one limitation of converting old codes one subroutine at a time. However, most new codes, and almost all parallel algorithms, can be written almost entirely in parallel (usually, the start up or initial input I/O code is the exception), resulting in significant practical speed ups. This can be quantified by how well a code scales which is often measured as efficiency.
11
Flynn's Classical Taxonomy Single Instruction, Single Data (SISD)your single-core uniprocessor PC Single Instruction, Multiple Data (SIMD)special purpose lowgranularity multi-processor m/c w/ a single control unit relaying the same instruction to all processors (w/ different data) every cc Multiple Instruction, Single Data (MISD)pipelining is a major example Multiple Instruction, Multiple Data (MIMD)the most prevalent model. SPMD (Single Program Multiple Data) is a very useful subset. Note that this is v. different from SIMD. Why? Note that Data vs Control Parallelism is another independent classification
Uniform Memory Access (UMA) Non- Uniform Memory Access (NUMA) Cache Coherent NUMA (ccNUMA)
Distributed MemoryMost prevalent architecture model for # processors > 8 Hybrid Distributed Shared memory
12
Each processor P (with its own local cache C) is connected to exclusive local memory, i.e. no other CPU has direct access to it. Each node comprises at least one network interface (NI) that mediates the connection to a communication network. On each CPU runs a serial process that can communicate with other processes on other CPUs by means of the network.
DISTRIBUTED SHARED MEMORY ARCH.: NUMA Memory is physically distributed but logically shared.
The physical layout similar to the distributed-memory case Aggregated memory of the whole system appear as one single address space. Due to the distributed nature, memory access performance varies depending on which CPU accesses which parts of memory (local vs. remote access). Two locality domains linked through a high speed connection called Hyper Transport Advantage Scalability Disadvantage Locality Problems and Connection congestion.
14
15
Synchronizes well with Data Parallelism. The same program on each processor/machine (SPMDa v. useful subset of MIMD) Each process distinguished by its rank. The program is written in a sequential language (FORTRAN/C[++]) All variables are local! No concept of shared memory Data exchange between processes through Send/receive messages via appropriate library MPI System requires information about Which processor is sending the message. Where is the data on the sending processor. What kind of data is being sent. How much data is there. Which processor (s) are receiving the message. Where should the data be left on the receiving processor. How much data is the receiving processor prepared to accept.
16
17
Cost of communication Overheads, wait due to synchronization Latency - The time it takes to send a minimal (0 byte) message from one point to another. Bandwidth - The amount of data that can be communicated per unit of time. Visibility of communication MPI visible, Data Parallel Transparent (this is so in all message-passing models) Blocking (wait until the data is transferred) versus non-blocking communication. Point to point versus Collective communication. Synchronization Barriers Consistent access to all information in variables with shared scope Lock/Semaphores To prevent violation of data/ race conditions/ deadlocks and control access into critical regions. Synchronous operations All this hits performance !
18
Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time. It can be considered a minimization of task idle time. If data is fairly homogeneous then an equal partition of the work among different tasks suffices else a dynamic work assignment strategy needs to be used. Granularity Computation/ Communication ratio Fine grain parallelism Relatively small amounts of computation between communication events. Needs very active low-level load balancing. Coarse grain parallelism Significant work done between communications. Also needs load balancing, but at a higher level of granularity In general since communication overheads are significant, coarse grain parallelism is preferred though depending on the architecture, hardware and the actual problem fine grain parallelism could be more advantageous as it reduces load imbalances.
19
20
21
22
23
JPEG, MPEG, convolution, IP processing, MATLAB, SVM, PCA, Database Hash Fluid dynamics, linear program solver, Reverse Kinematics/spring models(games) Spectral Clustering, texture maps, FFT Molecular dynamics FEM, Lattice Boltzmann, smoothing/interpolation (games), weather modelling Belief propagation 24 Expectation Maximization, Option pricing
Spectral methods
N-body methods
Structured grids
Unstructured Grids
Monte Carlo
25
26
Embarrassingly Parallel Situation Each data part is independent. No communication is required between the execution units solving two different parts. Heat Equation The initial temperature is zero on the boundaries and high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its neighbor elements
27
GPU(Graphic Processing Units) are already in the teraflop range. Increased use of parallelism would help model physical processes in animation better. 1000 core processors envisaged feasible when processor technology is in the 30 nm range (currently in the 45 nm range). Parallelism (despite the difficulties in programming) is definitely the way ahead. Programming models and languages with inherent support for parallelism, also be more human centric. Deal with bit level, instruction level, task level and data level parallelism. Autotuners as a substitute to existing compilers. Richer hardware support for maintaining cache coherency in 1000 core machines.
Pipelining, out of order execution and other innovations of the single core processor will probably be done away with to make room for more cores on a chip.
28
SUMMARY
Serial computers will probably not get much faster - Parallelization unavoidable. Pipelining, cache and other optimization strategies for serial computers
Shared Memory
Distributed Memory
Message Passing Interface Synchronization and communication Load Balancing and Granularity Applications and examples
29