Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4486 + 2220 = 6706
Home Page
Parallel Computing II 4003-532-70/4005-736-70 Spring Quarter 2007
Course Page

4003-532/4005-736 Parallel Computing II
Module 2 Lecture Notes -- Hybrid SMP Cluster Programming
N-Bodies Problem With Visualization

Prof. Alan Kaminsky
Rochester Institute of Technology -- Department of Computer Science


An N-Body Problem

The Antimatter Simulation program calculates the motion of a number of antiprotons moving in a two-dimensional plane. The antiprotons have equal, negative charges. Each antiproton experiences a repulsive force from every other antiproton that is directly proportional to the product of the antiprotons' charges and is inversely proportional to the square of the distance between the antiprotons. The antiprotons are surrounded by an "antiproton trap" -- a square metal cage with sides of length R, extending from coordinates (0,0) to (R,R) in the (x,y) plane. The antiproton trap has a negative charge. Thus, each antiproton experiences a repulsive force away from the sides of the trap. Since the antiprotons are repelled from the sides of the trap, the antiprotons will never touch the trap and matter-antimatter annihilation will not happen.

The Antimatter Simulation program maintains each antiproton's position and velocity. The program calculates the positions and velocities as a function of time by doing a series of discrete time steps. At each time step, the program calculates the total force on each antiproton (repulsive forces from all other antiprotons plus repulsive forces from the sides of the trap), updates the velocity based on the force, and updates the position based on the velocity and force:
 
V' = V + F Δt
 
P' = P + V Δt + 1/2 F Δt2
 
where F is the vector force on the antiproton, V is the antiproton's vector velocity before the time step, V' is the antiproton's vector velocity after the time step, P is the antiproton's vector position before the time step, P' is the antiproton's vector position after the time step, and Δt is the size of the time step. (These formulas represent the first few terms in the Taylor series expansions for velocity and position as a function of time.)


Animation


Goals for a Parallel Program

  • Shortcomings of the above animation program
    • It is a sequential program; too slow for large numbers of antiprotons
    • Visualization done "online" using a Java Swing UI; cannot examine the visualization "offline"
       
  • Goals
    • Calculate antiproton positions in parallel
    • Render frames of the visualization in parallel
    • Record visualization in a file or files for offline viewing
       
  • Engineering Design Issue 1: Visualization file format
    • Multiple-image Network Graphics (MNG) ("ming") file format
    • Like PNG, but can store multiple images in one file
    • A MNG player can then display the images in sequence to show an animation
    • Simpler than MPEG; no need for audio
    • For further information: http://www.libpng.org/pub/mng/
       
  • Engineering Design Issue 2: How to partition the program into parallel processes and threads
    • Timing data needed


How Long Does It Take to Compute Antiproton Positions?

  • Package edu.rit.vector
  • Package edu.rit.clu.antimatter
  • Sequential program's running time on the tardis.cs.rit.edu hybrid SMP cluster parallel computer
    • Δt = 0.00001
    • 1000 time steps
    • Various numbers of particles N
    • Three program runs for each N
    java edu.rit.clu.antimatter.AntiprotonSeq 142857 10 0.00001 1000 $N
    
    N      T (msec)
    ----   ------------------------
    1000    24568    24579    24588
    1400    48607    48557    48560
    2000    99232    99219    99180
    2800   193493   193594   193503
    
  • Linear regression fit to the model T = A + B N2 (sec), for 1000 time steps
    • A = 1.26E-1
    • B = 2.47E-5
    • Correlation = 1.000
       
  • Assume the model for one frame is 1/1000 times the model for 1000 frames
    • (Rather than do a multivariate regression on N and steps)
    • T = 1.26E-4 + 2.47E-8 N2 (sec)


How Long Does It Take to Send a PJ Message?

  • Package edu.rit.clu.timing
  • Results for various message sizes N on the tardis.cs.rit.edu hybrid SMP cluster computer
    • 10000 repetitions for each message size
    N (bytes)   T1 (sec)   T2 (sec)   T3 (sec)
    ---------   --------   --------   --------
    100         9.27E-5    1.13E-4    1.27E-4
    200         1.25E-4    1.22E-4    1.35E-4
    500         1.38E-4    1.34E-4    1.35E-4
    1000        1.45E-4    1.48E-4    1.54E-4
    2000        1.50E-4    1.65E-4    1.63E-4
    5000        2.10E-4    2.27E-4    2.15E-4
    10000       2.64E-4    2.64E-4    2.64E-4
    20000       3.79E-4    3.77E-4    3.77E-4
    50000       7.06E-4    6.85E-4    6.93E-4
    100000      1.14E-3    1.15E-3    1.15E-3
    200000      2.09E-3    2.09E-3    2.09E-3
    500000      4.88E-3    4.93E-3    4.94E-3
    1000000     9.52E-3    9.57E-3    9.52E-3
    


     
  • Sending a message consists of two parts:
    • Overhead portion, O(1) -- startup processing, transmitting fixed message header over the network, etc.
    • Data portion, O(N) -- copying data bytes between buffers, transmitting data bytes over the network, etc.
       
  • Therefore, the time to send a message should be given by T = A + B N
     
  • A linear regression on the above data for 10000 <= N <= 1000000 gives:
    T = 2.07E-4 + 9.35E-9 N (sec)
    Correlation = 1.000
     
  • This has vast implications for cluster parallel program design
     
  • Lesson 1: Your program must have much more computation than communication
    • With today's CPUs and networks, you can do a lot of computation in the time it takes to send one message
    • If you don't have lots of computation in between each communication, your running times will be mostly communication, and you will see little or no parallel speedups
    • Best if the asymptotic complexity of the computation is greater than the asymptotic complexity of the communication
      • E.g. O(N3) computation vs. O(N2) communication
      • Then by scaling up N, eventually computation will dominate communication
    • Be careful if the asymptotic complexity of the computation equals the asymptotic complexity of the communication
      • E.g. O(N) computation vs. O(N) communication
      • Then each computation needs to take much longer than each communication
         
  • Lesson 2: You want big messages with lots of data
    • This is because in the message time formula T = A + B N, A is typically several orders of magnitude bigger than B
    • If N is small, most of the message time will be spent on overhead rather than useful data transmission


How Long Does It Take to Send a PJ Message With Type Double?

  • Package edu.rit.clu.timing
  • Results for various message sizes N (measured in doubles) on the tardis.cs.rit.edu workstation cluster computer
    • 10000 repetitions for each message size
    N (doubles)   T1 (sec)   T2 (sec)   T3 (sec)
    -----------   --------   --------   --------
    100            1.34E-4    1.39E-4    1.37E-4
    200            1.60E-4    1.85E-4    1.74E-4
    500            1.99E-4    1.98E-4    1.99E-4
    1000           2.41E-4    2.50E-4    2.48E-4
    2000           3.43E-4    3.43E-4    3.41E-4
    5000           5.81E-4    5.85E-4    5.77E-4
    10000          9.09E-4    9.11E-4    9.03E-4
    20000          1.61E-3    1.61E-3    1.61E-3
    50000          3.75E-3    3.75E-3    3.75E-3
    100000         7.24E-3    7.26E-3    7.25E-3
    


     
  • A linear regression on the above data for 2000 <= N <= 100000 gives:
    T = 2.11E-4 + 7.04E-8 N (sec)
    Correlation = 1.000
     


How Long Does It Take to Compute the Visualization?

  • Package edu.rit.hyb.antimatter
  • Rendering test program
    • Create N random antiproton positions
    • Render them into a WxW-pixel indexed-color buffered image
    • Each antiproton is a 2-pixel diameter antialiased red circle, black background
    • Store the buffered image in a PNG file
    • Repeat a given number of times
       
  • Example of one 800x800-pixel frame with N = 1400 antiprotons
     

     
  • Rendering program's running time on the tardis.cs.rit.edu hybrid SMP cluster parallel computer
    • 800x800-pixel frames
    • 1000 frames
    • PNG files stored on local hard disk, not on network file system
    • Various numbers of particles N
    • Three program runs for each N, full program
    • Three program runs for each N, rendering time only (omit writing PNG files)
    java edu.rit.hyb.antimatter.RenderSeq 142857 $N 800 1000 /var/tmp/ark/test
    
    N      T (msec), full          T (msec), rendering only
    ----   ---------------------   ------------------------
    1000   61910   62563   62096   23028   22916   23144
    1400   66944   67767   67786   26101   26635   26578
    2000   75392   75882   75778   30835   31278   31635
    2800   84413   85727   84864   39769   39044   39348
    
  • Linear regression fit to the model T = A + B N (sec), for 1000 frames
    • Full program: T = 4.97E1 + 1.27E-2 N (sec), correlation = 0.998
    • Rendering: T = 1.37E1 + 9.04E-3 N (sec), correlation = 0.997
    • File writing: T = 3.60E1 + 3.66E-3 N (sec) -- difference of the previous two
       
  • Assume the model for one frame is 1/1000 times the model for 1000 frames
    • (Rather than do a multivariate regression on N and frames)
    • Full program: T = 4.97E-2 + 1.27E-5 N (sec)
    • Rendering: T = 1.37E-2 + 9.04E-6 N (sec)
    • File writing: T = 3.60E-2 + 3.66E-6 N (sec)
       
  • File size for one frame (bytes)
    • Stored as a rendered PNG image (average of 1000 files), versus . . .
    • Stored as raw antiproton (x,y) positions in binary using java.io.DataOutput
    N      PNG     Binary
    ----   -----   ------
    1000   18608    16000
    1400   23411    22400
    2000   30291    32000
    2800   39159    44800
    


Parallel Program Partitioning

  • Design proposal
    • Use K processors to compute the antiproton positions
      • Each processor has the full position array
      • Each processor computes one slice of the position array
      • At the end of each time step, all-gather the slices
    • Use one additional processor to compute the visualization
      • Take a snapshot of the position array every S time steps
      • This requires gathering the slices into the visualization processor
      • Store each frame as a separate PNG file, afterwards combine them into one MNG file
      • (I haven't found any Java code that can create MNG files directly)
         
  • Running time for one time step (sec)
    • Computation = (1.26E-4 + 2.47E-8 N2) / K
    • All-gather = (2.11E-4 + 7.04E-8 (2 N / K)) * ceil (log2 K)
       
  • Running time for one frame (sec)
    • Gather = (2.11E-4 + 7.04E-8 (2 N / K)) * K
    • Render and write file = 4.97E-2 + 1.27E-5 N
       
  • Suppose N = 2000, K = 1
    • One time step running time = 9.89E-2 sec
    • One frame running time = 7.56E-2 sec
       
  • Suppose N = 2000, K = 2
    • One time step running time = 4.98E-2 sec
    • One frame running time = 7.58E-2 sec
       
  • Consequences
    • As K increases, the time step running time decreases as 1/K
    • As K increases, the frame running time increases slightly
    • Therefore, we cannot render one frame per time step
    • We must render one frame per many time steps (S >> K)
    • Otherwise, the position processors will be spending all their time waiting for the visualization processor
    • We want S >> K anyway, because typically Δt is very small and the positions do not change much at each time step
       
  • Parallelism in the position processors
    • One degree of parallelism from multiple processors (Kp)
    • Another degree of parallelism from multiple threads within each processor (Kt)
    • K = Kp * Kt
       
  • Parallelism in the visualization processor
    • We could get a speedup with multiple threads in the visualization processor
    • E.g., overlap file write of previous frame with rendering of next frame
    • However, if S >> K, the visualization processor is not the bottleneck
    • Therefore, it doesn't make sense to spend the effort to parallelize the visualization processor code


Parallel Code

  • Package edu.rit.hyb.antimatter
  • Example
    • Random seed = 142857
    • Number of antiprotons N = 1000
    • Trap size R = 10
    • Number of visualization frames = 100
    • Number of time steps per frame = 1000
    • Time step size dt = 1x10-6
    • Image size = 800x800 pixels
    • Image file names = "/var/tmp/ark/test8_0000.png", "/var/tmp/ark/test8_0001.png", etc.
    • Image files stored on backend processor's local hard disk
    java -Dpj.np=$KP -Dpj.nt=$KT edu.rit.hyb.antimatter.AntiprotonHyb \
    142857 1000 10 100 1000 1e-6 800 /var/tmp/ark/test8
    
  • The PNG files are combined into one MNG file: test8.mng (1,490,030 bytes)
     
  • Running times for the above example on the Tardis hybrid SMP cluster parallel computer
    • K = (Kp - 1) * Kt = Antiproton position calculation parallelism
                            Due to Kp
    Kp  Kt   K  T (msec)   Spdup  Effi.
    -----------------------------------
     2   1   1   2342062   1.000  1.000
     3   1   2   1187275   1.973  0.986
     4   1   3    828839   2.826  0.942
     5   1   4    648151   3.613  0.903
     6   1   5    550160   4.257  0.851
     7   1   6    487794   4.801  0.800
     8   1   7    447627   5.232  0.747
     9   1   8    422701   5.541  0.693
    10   1   9    408864   5.728  0.636
    
                           Due to Kp+Kt   Due to Kt
    Kp  Kt   K  T (msec)   Spdup  Effi.  Spdup  Effi.
    -------------------------------------------------
     2   4   4    594994   3.936  0.984  3.936  0.984
     3   4   8    328938   7.120  0.890  3.609  0.902
     4   4  12    257282   9.103  0.759  3.222  0.805
     5   4  16    227412  10.299  0.644  2.850  0.713
     6   4  20    214460  10.921  0.546  2.565  0.641
     7   4  24    209890  11.159  0.465  2.324  0.581
     8   4  28    209201  11.195  0.400  2.140  0.535
     9   4  32    215415  10.872  0.340  1.962  0.491
    10   4  36    227923  10.276  0.285  1.794  0.448
    

Parallel Computing II 4003-532-70/4005-736-70 Spring Quarter 2007
Course Page
Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4486 + 2220 = 6706
Home Page
Copyright © 2007 Alan Kaminsky. All rights reserved. Last updated 08-Apr-2007. Please send comments to ark­@­cs.rit.edu.