Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4531 + 2408 = 6939
Home Page

Parallel Java
An API for Developing Parallel Programs in 100% Java
Lecture Notes

Prof. Alan Kaminsky
Department of Computer Science
Rochester Institute of Technology

Presented at the Rochester Java User's Group Meeting
June 13, 2006


Overview


Modern Parallel Computing

  • Parallel computer architectures:
     
  • Shared memory multiprocessor (SMP) parallel computers
     
     
  • Cluster parallel computers
     
     
  • Hybrid SMP cluster parallel computers
     
     
  • The RIT CS Department's parallel computers:
     
  • SMP parallel computers
    • paradise.cs.rit.edu -- Four Sun UltraSPARC-IV dual-core CPUs, eight processors, 1.35 GHz clock speed, 16 GB main memory
    • parasite.cs.rit.edu -- Four Sun UltraSPARC-IV dual-core CPUs, eight processors, 1.35 GHz clock speed, 16 GB main memory
    • paradox.cs.rit.edu -- Four Sun UltraSPARC-II CPUs, four processors, 450 MHz clock, 4 GB main memory
    • paragon.cs.rit.edu -- Four Sun UltraSPARC-II CPUs, four processors, 450 MHz clock, 4 GB main memory
       
  • Cluster parallel computer
    • Frontend computer -- paranoia.cs.rit.edu -- UltraSPARC-II CPU, 296 MHz clock, 192 MB main memory
    • 32 backend computers -- thug01 through thug32 -- each an UltraSPARC-IIi CPU, 440 MHz clock, 256 MB main memory
    • 100-Mbps switched Ethernet backend interconnection network
    • Aggregate 14 GHz clock speed, 8 GB main memory
       
  • Hybrid SMP cluster parallel computer
    • The four SMP parallel computers are also clustered
    • No separate backend network yet
       
  • Standard middleware for SMP parallel programming: OpenMP
  • Standard middleware for cluster parallel programming: Message Passing Interface (MPI)


SMP Parallel Programming in Java with OpenMP

  • Monte Carlo technique for computing an approximate value of pi


       
    • The area of the unit square is 1
    • The area of the circle quadrant is pi/4
    • Generate a large number of points at random in the unit square
    • Count how many of them fall within the circle quadrant, i.e. distance from origin <= 1
    • The fraction of the points within the circle quadrant gives an approximation for pi/4
    • 4 x this fraction gives an approximation for pi
       
  • Sequential program
    • Package edu.rit.openmp.monte
  • Java version of OpenMP: JOMP
  • JOMP parallel program
    • Package edu.rit.openmp.monte
  • Process for using OpenMP: Precompiler
     


Criticisms

  • OpenMP and MPI were designed for use with Fortran and C, not Java
     
  • OpenMP and MPI are not object oriented
     
  • Existing forays into parallel programming middleware in Java leave much to be desired
     
  • There is a Java version of MPI -- mpiJava -- but it is just a thin veneer on top of the non-object oriented MPI API and does not smell like a Java API
  • JOMP mimics OpenMP, but the precompiler directive approach feels unnatural to Java programmers
     
  • JOMP is alpha quality software, buggy, no longer maintained, and no source code is released
     
  • JOMP and mpiJava do not play well together
     
  • MPI and threads do not play well together
     
  • I am not aware of any middleware standard that encompasses SMP, cluster, and hybrid SMP cluster parallel programming


Parallel Java (PJ)

  • "Parallel Java (PJ) is an API and middleware for parallel programming in 100% Java on shared memory multiprocessor (SMP) parallel computers, cluster parallel computers, and hybrid SMP cluster parallel computers. PJ was developed by Professor Alan Kaminsky and his student Luke McOmber in the Department of Computer Science at the Rochester Institute of Technology."
     
  • Main PJ page: http://www.cs.rit.edu/~ark/pj.shtml
     
  • Javadoc: http://www.cs.rit.edu/~ark/pj/doc/index.html
     
  • Parallel Computing I course web site: http://www.cs.rit.edu/~ark/531/


SMP Parallel Programming with PJ

  • Monte Carlo calculation of pi -- sequential program
  • Monte Carlo calculation of pi -- PJ parallel program
  • Demonstrations
     
  • Running time measurements on the parasite.cs.rit.edu machine (30-Sep-2005)
    • Running times in milliseconds
    • K = number of processors
    • K=0 stands for the sequential version; command:
        java edu.rit.smp.monte.PiSeq2 285714 1000000000
    • K>0 stands for the parallel version; command:
        java -Dpj.nt=$K edu.rit.smp.monte.PiSmp2 285714 1000000000
      K   Run 1   Run 2   Run 3   Run 4   Run 5   Aver.
      -   -----   -----   -----   -----   -----   -----
      0   71158   71158   71388   71156   71158   71204
      1   74214   74085   74098   74077   74232   74141
      2   37142   37115   37229   37149   37101   37147
      3   24852   24852   24853   24766   24851   24835
      4   18715   18740   18636   18640   18716   18689
      5   14899   14930   14932   14933   14934   14926
      6   12515   12518   12457   12501   12499   12498
      7   10696   10724   10695   10778   10694   10717
      8    9492    9419    9465    9407    9371    9431
      
  • Speedup calculations
    • Hold problem size constant, look at running time versus number of processors
    • Speedup(K) = (Sequential version time) / (Parallel version time on K processors)
    • Efficiency(K) = Speedup(K) / K
      K       T   Spdup   Effi.
      -   -----   -----   -----
      0   71204
      1   74141   0.960   0.960
      2   37147   1.917   0.958
      3   24835   2.867   0.956
      4   18689   3.810   0.952
      5   14926   4.771   0.954
      6   12498   5.697   0.950
      7   10717   6.644   0.949
      8    9431   7.550   0.944
      
  • Tests done by Luke McOmber in the Spring 2005 quarter show that Java/PJ programs' performance equals or exceeds equivalent C/OpenMP programs' performance


Cluster Parallel Programming with PJ

  • Monte Carlo calculation of pi -- sequential program
  • Monte Carlo calculation of pi -- PJ parallel program
  • Demonstrations
     
  • Running time measurements on the paranoia.cs.rit.edu cluster (09-Jun-2006)
    • Running times in milliseconds
    • N = number of iterations per processor
    • K = number of processors
    • Size = number of iterations = N * K
    • K=0 stands for the sequential version; command:
        java -Dpj.np=1 edu.rit.clu.monte.PiSeq 285714 $N
    • K>0 stands for the parallel version; command:
        java -Dpj.np=$K edu.rit.clu.monte.PiClu 285714 $N
              N   K        Size   Run 1   Run 2   Run 3   Run 4   Run 5   Aver.
              -   -        ----   -----   -----   -----   -----   -----   -----
       50000000   0    50000000   31016   31045   31023   31051   30983   31024
       50000000   1    50000000   30682   30718   30690   30638   30664   30678
       50000000   2   100000000   30716   30706   30749   30701   30698   30714
       50000000   3   150000000   30718   30748   30704   30706   30694   30714
       50000000   4   200000000   30690   30697   30810   30704   30664   30713
       50000000   5   250000000   30835   30763   30712   30640   30660   30722
       50000000   6   300000000   30729   30710   30676   30706   30670   30698
       50000000   7   350000000   30670   30664   30806   30648   30665   30691
       50000000   8   400000000   30701   30953   30741   30698   30653   30749
      
      100000000   0   100000000   61970   61933   61971   61946   61898   61944
      100000000   1   100000000   61387   61208   61184   61175   61160   61223
      100000000   2   200000000   61221   61193   61202   61245   61213   61214
      100000000   3   300000000   61135   61152   61202   61176   61150   61163
      100000000   4   400000000   61178   61183   61169   61189   61160   61176
      100000000   5   500000000   61173   61138   61265   61195   61171   61188
      100000000   6   600000000   61288   61387   61268   61154   61161   61252
      100000000   7   700000000   61163   61141   61275   61156   61190   61185
      100000000   8   800000000   61245   61568   61316   61178   61174   61296
      
  • Sizeup calculations
    • Hold running time constant, look at problem size versus number of processors
    • Use logarithmic interpolation to find size S that would yield (for example) time T = 60000
    • Sizeup(K) = (Parallel version size on K processors) / (Sequential version size)
    • Sizeup efficiency(K) = Sizeup(K) / K
      K           S   Sizup   Effi.
      -           -   -----   -----
      0    96854189
      1    97996174   1.012   1.012
      2   196013717   2.024   1.012
      3   294260112   3.038   1.013
      4   392265644   4.050   1.013
      5   490233964   5.062   1.012
      6   587694653   6.068   1.011
      7   686380207   7.087   1.012
      8   783005550   8.084   1.011
      


Status

  • Shared memory parallel programming API complete
     
  • Message passing parallel programming API nearly complete
    • Some of the more esoteric collective communication operations are not implemented
       
  • I used PJ in my Parallel Computing I class in the Winter 2005 quarter
  • I presented a "work in progress" poster on PJ at the SIGCSE 2006 Conference
    • Poster (PDF, 188,290 bytes, 36" x 24")
       


Future Plans

  • Continue work on the PJ API
    • Reimplement message passing operations to improve performance
    • Implement remaining collective communication operations
    • Change the cluster middleware to make it easier for the user to run cluster parallel programs
    • Add a web frontend
       
  • Continue teaching the Parallel Computing I class with PJ
     
  • Write a textbook on parallel programming, using Java/PJ as the programming language
     
  • Accumulate more performance measurements
     
  • Work on solving scientific computing problems with PJ programs

Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4531 + 2408 = 6939
Home Page
Copyright © 2006 Alan Kaminsky. All rights reserved. Last updated 13-Jun-2006. Please send comments to ark­@­cs.rit.edu.