4003-532/4005-736 Parallel Computing II
Module 2 Lecture Notes -- Hybrid SMP Cluster Programming
N-Bodies Problem With Visualization
Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science
An N-Body Problem
The Antimatter Simulation program
calculates the motion of a number of antiprotons
moving in a two-dimensional plane.
The antiprotons have equal, negative charges.
Each antiproton experiences a repulsive force
from every other antiproton
that is directly proportional
to the product of the antiprotons' charges
and is inversely proportional
to the square of the distance
between the antiprotons.
The antiprotons are surrounded by an "antiproton trap" --
a square metal cage with sides of length R,
extending from coordinates (0,0) to (R,R)
in the (x,y) plane.
The antiproton trap has a negative charge.
Thus, each antiproton experiences a repulsive force
away from the sides of the trap.
Since the antiprotons are repelled from the sides of the trap,
the antiprotons will never touch the trap
and matter-antimatter annihilation will not happen.
The Antimatter Simulation program
maintains each antiproton's position and velocity.
The program calculates the positions and velocities
as a function of time
by doing a series of discrete time steps.
At each time step,
the program calculates the total force on each antiproton
(repulsive forces from all other antiprotons
plus repulsive forces from the sides of the trap),
updates the velocity based on the force,
and updates the position based on the velocity and force:
V' = V + F Δt
P' = P + V Δt + 1/2 F Δt2
where F is the vector force on the antiproton,
V is the antiproton's vector velocity before the time step,
V' is the antiproton's vector velocity after the time step,
P is the antiproton's vector position before the time step,
P' is the antiproton's vector position after the time step,
and Δt is the size of the time step.
(These formulas represent the first few terms
in the Taylor series expansions
for velocity and position as a function of time.)
Animation
Goals for a Parallel Program
- Shortcomings of the above animation program
- It is a sequential program; too slow for large numbers of antiprotons
- Visualization done "online" using a Java Swing UI; cannot examine the visualization "offline"
- Goals
- Calculate antiproton positions in parallel
- Render frames of the visualization in parallel
- Record visualization in a file or files for offline viewing
- Engineering Design Issue 1: Visualization file format
- Multiple-image Network Graphics (MNG) ("ming") file format
- Like PNG, but can store multiple images in one file
- A MNG player can then display the images in sequence to show an animation
- Simpler than MPEG; no need for audio
- For further information: http://www.libpng.org/pub/mng/
- Engineering Design Issue 2: How to partition the program into parallel processes and threads
How Long Does It Take to Compute Antiproton Positions?
How Long Does It Take to Send a PJ Message?
How Long Does It Take to Send a PJ Message With Type Double?
How Long Does It Take to Compute the Visualization?
- Package edu.rit.hyb.antimatter
- Rendering test program
- Create N random antiproton positions
- Render them into a WxW-pixel indexed-color buffered image
- Each antiproton is a 2-pixel diameter antialiased red circle, black background
- Store the buffered image in a PNG file
- Repeat a given number of times
- Example of one 800x800-pixel frame with N = 1400 antiprotons
- Rendering program's running time on the tardis.cs.rit.edu hybrid SMP cluster parallel computer
- 800x800-pixel frames
- 1000 frames
- PNG files stored on local hard disk, not on network file system
- Various numbers of particles N
- Three program runs for each N, full program
- Three program runs for each N, rendering time only (omit writing PNG files)
java edu.rit.hyb.antimatter.RenderSeq 142857 $N 800 1000 /var/tmp/ark/test
N T (msec), full T (msec), rendering only
---- --------------------- ------------------------
1000 61910 62563 62096 23028 22916 23144
1400 66944 67767 67786 26101 26635 26578
2000 75392 75882 75778 30835 31278 31635
2800 84413 85727 84864 39769 39044 39348
- Linear regression fit to the model T = A + B N (sec), for 1000 frames
- Full program: T = 4.97E1 + 1.27E-2 N (sec), correlation = 0.998
- Rendering: T = 1.37E1 + 9.04E-3 N (sec), correlation = 0.997
- File writing: T = 3.60E1 + 3.66E-3 N (sec) -- difference of the previous two
- Assume the model for one frame is 1/1000 times the model for 1000 frames
- (Rather than do a multivariate regression on N and frames)
- Full program: T = 4.97E-2 + 1.27E-5 N (sec)
- Rendering: T = 1.37E-2 + 9.04E-6 N (sec)
- File writing: T = 3.60E-2 + 3.66E-6 N (sec)
- File size for one frame (bytes)
- Stored as a rendered PNG image (average of 1000 files), versus . . .
- Stored as raw antiproton (x,y) positions in binary using java.io.DataOutput
N PNG Binary
---- ----- ------
1000 18608 16000
1400 23411 22400
2000 30291 32000
2800 39159 44800
Parallel Program Partitioning
- Design proposal
- Use K processors to compute the antiproton positions
- Each processor has the full position array
- Each processor computes one slice of the position array
- At the end of each time step, all-gather the slices
- Use one additional processor to compute the visualization
- Take a snapshot of the position array every S time steps
- This requires gathering the slices into the visualization processor
- Store each frame as a separate PNG file, afterwards combine them into one MNG file
- (I haven't found any Java code that can create MNG files directly)
- Running time for one time step (sec)
- Computation = (1.26E-4 + 2.47E-8 N2) / K
- All-gather = (2.11E-4 + 7.04E-8 (2 N / K)) * ceil (log2 K)
- Running time for one frame (sec)
- Gather = (2.11E-4 + 7.04E-8 (2 N / K)) * K
- Render and write file = 4.97E-2 + 1.27E-5 N
- Suppose N = 2000, K = 1
- One time step running time = 9.89E-2 sec
- One frame running time = 7.56E-2 sec
- Suppose N = 2000, K = 2
- One time step running time = 4.98E-2 sec
- One frame running time = 7.58E-2 sec
- Consequences
- As K increases, the time step running time decreases as 1/K
- As K increases, the frame running time increases slightly
- Therefore, we cannot render one frame per time step
- We must render one frame per many time steps (S >> K)
- Otherwise, the position processors will be spending all their time waiting for the visualization processor
- We want S >> K anyway, because typically Δt is very small and the positions do not change much at each time step
- Parallelism in the position processors
- One degree of parallelism from multiple processors (Kp)
- Another degree of parallelism from multiple threads within each processor (Kt)
- K = Kp * Kt
- Parallelism in the visualization processor
- We could get a speedup with multiple threads in the visualization processor
- E.g., overlap file write of previous frame with rendering of next frame
- However, if S >> K, the visualization processor is not the bottleneck
- Therefore, it doesn't make sense to spend the effort to parallelize the visualization processor code
Parallel Code
- Package edu.rit.hyb.antimatter
- Example
- Random seed = 142857
- Number of antiprotons N = 1000
- Trap size R = 10
- Number of visualization frames = 100
- Number of time steps per frame = 1000
- Time step size dt = 1x10-6
- Image size = 800x800 pixels
- Image file names = "/var/tmp/ark/test8_0000.png", "/var/tmp/ark/test8_0001.png", etc.
- Image files stored on backend processor's local hard disk
java -Dpj.np=$KP -Dpj.nt=$KT edu.rit.hyb.antimatter.AntiprotonHyb \
142857 1000 10 100 1000 1e-6 800 /var/tmp/ark/test8
- The PNG files are combined into one MNG file: test8.mng (1,490,030 bytes)
- Running times for the above example on the Tardis hybrid SMP cluster parallel computer
- K = (Kp - 1) * Kt = Antiproton position calculation parallelism
Due to Kp
Kp Kt K T (msec) Spdup Effi.
-----------------------------------
2 1 1 2342062 1.000 1.000
3 1 2 1187275 1.973 0.986
4 1 3 828839 2.826 0.942
5 1 4 648151 3.613 0.903
6 1 5 550160 4.257 0.851
7 1 6 487794 4.801 0.800
8 1 7 447627 5.232 0.747
9 1 8 422701 5.541 0.693
10 1 9 408864 5.728 0.636
Due to Kp+Kt Due to Kt
Kp Kt K T (msec) Spdup Effi. Spdup Effi.
-------------------------------------------------
2 4 4 594994 3.936 0.984 3.936 0.984
3 4 8 328938 7.120 0.890 3.609 0.902
4 4 12 257282 9.103 0.759 3.222 0.805
5 4 16 227412 10.299 0.644 2.850 0.713
6 4 20 214460 10.921 0.546 2.565 0.641
7 4 24 209890 11.159 0.465 2.324 0.581
8 4 28 209201 11.195 0.400 2.140 0.535
9 4 32 215415 10.872 0.340 1.962 0.491
10 4 36 227923 10.276 0.285 1.794 0.448
|
Parallel Computing II
|
|
•
|
|
4003-532-70/4005-736-70
|
|
•
|
|
Spring Quarter 2007
|
|
Course Page
|
|
Alan Kaminsky
|
|
•
|
|
Department of Computer Science
|
|
•
|
|
Rochester Institute of Technology
|
|
•
|
|
4486 +
2220 =
6706
|
|
Home Page
|
Copyright © 2007 Alan Kaminsky.
All rights reserved.
Last updated 08-Apr-2007.
Please send comments to ark@cs.rit.edu.
|