Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4555 + 2414 = 6969
Home Page

Parallel Java
An API for Developing Parallel Programs in 100% Java
Lecture Notes

Prof. Alan Kaminsky
Department of Computer Science
Rochester Institute of Technology

Presented to the RIT Research Computing Group
November 9, 2006


Overview


Modern Parallel Computing

  • Parallel computer architectures:
     
  • Shared memory multiprocessor (SMP) parallel computers
    • Even consumer desktop machines are starting to look like this
       
     
  • Cluster parallel computers
    • Soon you won't even be able to build these any more
       
     
  • Hybrid SMP cluster parallel computers
    • Soon all clusters will look like this
       
     
  • The RIT CS Department's parallel computers:
     
  • SMP parallel computers
    • paradise.cs.rit.edu -- Four Sun UltraSPARC-IV dual-core CPUs, eight processors, 1.35 GHz clock, 16 GB main memory
    • parasite.cs.rit.edu -- Four Sun UltraSPARC-IV dual-core CPUs, eight processors, 1.35 GHz clock, 16 GB main memory
    • paradox.cs.rit.edu -- Four Sun UltraSPARC-II CPUs, four processors, 450 MHz clock, 4 GB main memory
    • paragon.cs.rit.edu -- Four Sun UltraSPARC-II CPUs, four processors, 450 MHz clock, 4 GB main memory
       
  • Cluster parallel computer
    • Frontend computer -- paranoia.cs.rit.edu -- UltraSPARC-II CPU, 296 MHz clock, 192 MB main memory
    • 32 backend computers -- thug01 through thug32 -- each an UltraSPARC-IIi CPU, 440 MHz clock, 256 MB main memory
    • 100-Mbps switched Ethernet backend interconnection network
    • Aggregate 14 GHz clock, 8 GB main memory
       
  • Hybrid SMP cluster parallel computer
    • Under construction
    • 10 backend computers, each a 4-CPU SMP machine -- 40 CPUs
    • 1-Gbps switched Ethernet backend interconnection network
       
  • Standard middleware for SMP parallel programming: OpenMP
  • Standard middleware for cluster parallel programming: Message Passing Interface (MPI)


SMP Parallel Programming in Java with OpenMP

  • Monte Carlo technique for computing an approximate value of pi


       
    • The area of the unit square is 1
    • The area of the circle quadrant is pi/4
    • Generate a large number of points at random in the unit square
    • Count how many of them fall within the circle quadrant, i.e. distance from origin <= 1
    • The fraction of the points within the circle quadrant gives an approximation for pi/4
    • 4 x this fraction gives an approximation for pi
       
  • Sequential program
    • Package edu.rit.openmp.monte
  • Java version of OpenMP: JOMP
  • JOMP parallel program
    • Package edu.rit.openmp.monte
  • Process for using OpenMP: Precompiler
     


Criticisms

  • OpenMP and MPI were designed for use with Fortran and C, not Java
     
  • OpenMP and MPI are not object oriented
     
  • Existing forays into parallel programming middleware in Java leave much to be desired
     
  • There are a couple Java versions of MPI, but each one is just a thin veneer on top of the non-object oriented MPI API and does not smell like a Java API
  • JOMP mimics OpenMP, but the precompiler directive approach feels unnatural to Java programmers
     
  • JOMP is alpha quality software, buggy, no longer maintained, and no source code is released
     
  • JOMP and mpiJava do not play well together
     
  • MPI and threads do not play well together
     
  • Parallel programs using MPI, when run on a hybrid SMP cluster, do not take full advantage of the SMP machines' capabilities
    • You want to use threading within each SMP machine and message passing between SMP machines
    • But MPI programs use message passing for everything, even between processors of the same SMP machine -- performance penalty
       
  • I am not aware of any middleware standard that encompasses SMP, cluster, and hybrid SMP cluster parallel programming


Parallel Java (PJ)


SMP Parallel Programming with PJ

  • Monte Carlo calculation of pi -- sequential program
  • Monte Carlo calculation of pi -- PJ parallel program
  • Demonstrations
     
  • Running time measurements on the parasite.cs.rit.edu machine (30-Sep-2005)
    • Running times in milliseconds
    • K = number of processors
    • K=0 stands for the sequential version; command:
        java edu.rit.smp.monte.PiSeq2 285714 1000000000
    • K>0 stands for the parallel version; command:
        java -Dpj.nt=$K edu.rit.smp.monte.PiSmp2 285714 1000000000
      K   Run 1   Run 2   Run 3   Run 4   Run 5   Median
      -   -----   -----   -----   -----   -----   ------
      0   71158   71158   71388   71156   71158    71158
      1   74214   74085   74098   74077   74232    74085
      2   37142   37115   37229   37149   37101    37142
      3   24852   24852   24853   24766   24851    24852
      4   18715   18740   18636   18640   18716    18715
      5   14899   14930   14932   14933   14934    14932
      6   12515   12518   12457   12501   12499    12501
      7   10696   10724   10695   10778   10694    10696
      8    9492    9419    9465    9407    9371     9419
      
  • Speedup calculations
    • Hold problem size constant, look at running time versus number of processors
    • Speedup(K) = (Sequential version time) / (Parallel version time on K processors)
    • Efficiency(K) = Speedup(K) / K
      K       T   Spdup   Effi.
      -   -----   -----   -----
      0   71158
      1   74085   0.960   0.960
      2   37142   1.916   0.958
      3   24852   2.863   0.954
      4   18715   3.802   0.951
      5   14932   4.765   0.953
      6   12501   5.692   0.949
      7   10696   6.653   0.950
      8    9419   7.555   0.944
      
  • Tests done by Luke McOmber in the Spring 2005 quarter show that Java/PJ programs' performance equals or exceeds equivalent C/OpenMP programs' performance


Cluster Parallel Programming with PJ

  • Monte Carlo calculation of pi -- sequential program
  • Monte Carlo calculation of pi -- PJ parallel program
  • Demonstrations
     
  • Running time measurements on the paranoia.cs.rit.edu cluster (09-Jun-2006)
    • Running times in milliseconds
    • N = number of iterations per processor
    • K = number of processors
    • Size = number of iterations = N * K
    • K=0 stands for the sequential version; command:
        java -Dpj.np=1 edu.rit.clu.monte.PiSeq 285714 $N
    • K>0 stands for the parallel version; command:
        java -Dpj.np=$K edu.rit.clu.monte.PiClu 285714 $N
              N   K        Size   Run 1   Run 2   Run 3   Run 4   Run 5   Median
              -   -        ----   -----   -----   -----   -----   -----   ------
       50000000   0    50000000   31016   31045   31023   31051   30983    31023
       50000000   1    50000000   30682   30718   30690   30638   30664    30682
       50000000   2   100000000   30716   30706   30749   30701   30698    30706
       50000000   3   150000000   30718   30748   30704   30706   30694    30706
       50000000   4   200000000   30690   30697   30810   30704   30664    30697
       50000000   5   250000000   30835   30763   30712   30640   30660    30712
       50000000   6   300000000   30729   30710   30676   30706   30670    30706
       50000000   7   350000000   30670   30664   30806   30648   30665    30665
       50000000   8   400000000   30701   30953   30741   30698   30653    30701
      
      100000000   0   100000000   61970   61933   61971   61946   61898    61946
      100000000   1   100000000   61387   61208   61184   61175   61160    61184
      100000000   2   200000000   61221   61193   61202   61245   61213    61213
      100000000   3   300000000   61135   61152   61202   61176   61150    61152
      100000000   4   400000000   61178   61183   61169   61189   61160    61178
      100000000   5   500000000   61173   61138   61265   61195   61171    61173
      100000000   6   600000000   61288   61387   61268   61154   61161    61268
      100000000   7   700000000   61163   61141   61275   61156   61190    61163
      100000000   8   800000000   61245   61568   61316   61178   61174    61245
      
  • Sizeup calculations
    • Hold running time constant, look at problem size versus number of processors
    • Use interpolation to find size S that would yield (for example) time T = 60000
    • Sizeup(K) = (Parallel version size on K processors) / (Sequential version size)
    • Sizeup efficiency(K) = Sizeup(K) / K
      K           S   Sizeup   Effi.
      -           -   ------   -----
      0    96853475
      1    98059144    1.012   1.012
      2   196023863    2.024   1.012
      3   294324378    3.039   1.013
      4   392270595    4.050   1.013
      5   490372936    5.063   1.013
      6   587553171    6.066   1.011
      7   686653223    7.090   1.013
      8   783695652    8.092   1.011
      


Status

  • Shared memory parallel programming API complete
     
  • Message passing parallel programming API nearly complete
    • Some of the more esoteric collective communication operations are not implemented
       
  • Major redesign of the cluster parallel programming capabilities recently completed
    • Message passing classes redesigned to improve performance
    • Cluster middleware redesigned to make it easier for the user to run cluster parallel programs
    • Web interface for viewing cluster status and job queue
       
  • I use PJ in my Parallel Computing I class
  • I presented a "work in progress" poster on PJ at the SIGCSE 2006 Conference
    • Poster (PDF, 188,290 bytes, 36" x 24")
       

       
  • I started writing a parallel programming textbook using Java and PJ


Future Plans

  • Continue work on the PJ API
    • Continue improving performance of the message passing classes
    • Implement remaining collective communication operations
    • Expand the web frontend to include job submission and job control
       
  • Continue teaching the Parallel Computing I class with PJ
     
  • Finish writing the textbook and publish it
     
  • Accumulate more performance measurements
     
  • Work on solving scientific computing problems with PJ programs
    • Computational medicine: MRI spin relaxometry -- done with mpiJava, need to redo with PJ
    • Computational biology: Maximum parsimony phylogenetic tree construction

Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4555 + 2414 = 6969
Home Page
Copyright © 2006 Alan Kaminsky. All rights reserved. Last updated 07-Nov-2006. Please send comments to ark­@­cs.rit.edu.