Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4486 + 2220 = 6706
Home Page
Operating Systems I 4003-440-02 Winter Quarter 2012
Course Page

4003-440-02 Operating Systems I
Module 11. Parallel and Distributed Systems -- Lecture Notes

Prof. Alan Kaminsky -- Winter Quarter 2012
Rochester Institute of Technology -- Department of Computer Science


Little and Big

  Little Data Big Data
Little CPU Word processor
Spreadsheet
Google PageRank
Facebook ad targeting
Big CPU Cryptographic attacks
NP-hard problems
Climate modeling
Nuclear explosion modeling


An Example Problem

  • A toxic spill contaminates the soil

    Initial spill The pollutant starts diffusing After a longer time

  • Partial differential equation: ∂u/∂t = D (∂2u/∂x2 + ∂2u/∂y2)

  • Computation
    • Given diffusion coefficients D(x,y) . . .
    • And given initial pollutant concentration u(x,y) . . .
    • Compute pollutant concentration as a function of time . . .
    • On an N×N spatial grid (N might be several thousand) . . .
    • For many, many small time steps (thousands or millions) . . .
    • And make a movie to visualize the results

  • Class ToxicSpillSeq (download)


SMP Parallel Programming

  • Multicore (symmetric multiprocessor, SMP) parallel computer

  • SMP parallel program organization

  • OpenMP Home Page. http://www.openmp.org/

  • Parallel Java Library. http://www.cs.rit.edu/~ark/pj.shtml

  • Class ToxicSpillSmp (download)

  • Running times (msec) on one node of the tardis.cs.rit.edu computer
    • N = Grid size
    • NT = Number of parallel threads
    • SMP computation amount = Sequential computation amount ("strong scaling")
    • Ideally, SMP time = (1/NT)×sequential time
    • Speedup = Sequential time/SMP time
    • Efficiency = Speedup/NT
    $ java ToxicSpillSeq 142857 $N 0.01 50 400
    $ java -Dpj.nt=$NT ToxicSpillSmp 142857 $N 0.01 50 400
    
       N   NT      T1      T2      T3       T  Spdup  Effic
    1000  seq  262621  262497  262944  262497
    1000    1  264094  265627  262970  262970  0.998  0.998
    1000    2  159800  160433  158530  158530  1.656  0.828
    1000    3  118103  117248  123653  117248  2.239  0.746
    1000    4  106365   98636  100802   98636  2.661  0.665
    


Amdahl's Law

  • There is a limit on the amount of speedup you can get with strong scaling

  • Sequential portion and parallelizable portion of a program

  • A parallel program with running time T(N,1) and sequential fraction F executing with different numbers of processors K

  • Amdahl's Law
    • Running time T(N,K) = F⋅T(N,1) + (1/K)⋅(1 − F)⋅T(N,1)
    • Speedup(N,K) = 1/(F + (1 − F)/K)
    • Efficiency(N,K) = 1/(K⋅F + 1 − F)
          where
    • N = Problem size
    • K = Number of processors
    • F = Sequential fraction
    • T(N,1) = Sequential program running time on one processor
    • T(N,K) = Parallel program running time on K processors

  • Limits on speedup
    • As K goes to infinity:
    • Speedup goes to 1/F
    • Efficiency goes to 0


Cluster Parallel Programming

  • Cluster parallel computer

  • Cluster parallel program organization

  • The Message Passing Interface (MPI) Standard. http://www-unix.mcs.anl.gov/mpi/

  • Parallel Java Library. http://www.cs.rit.edu/~ark/pj.shtml

  • Class ToxicSpillClu (download)

  • Running times (msec) on 1 to 10 nodes of the tardis.cs.rit.edu computer
    • N = Grid size
    • NP = Number of parallel processes
    • Cluster computation amount = NP×Sequential computation amount ("weak scaling")
    • Ideally, cluster time = sequential time
    • Efficiency = Sequential time/Cluster time
    • Sizeup = Efficiency×NP
    $ java ToxicSpillSeq 142857 $N 0.01 50 400
    $ java -Dpj.np=$NP ToxicSpillClu 142857 $N 0.01 50 400
    
       N   NT      T1      T2      T3       T  Sizup  Effic
    1000  seq  262621  262497  262944  262497
    1000    1  265259  263263  263973  263263  0.997  0.997
    1414    2  273991  273570  277884  273570  1.919  0.960
    1732    3  267582  268205  267738  267582  2.943  0.981
    2000    4  280138  279639  277523  277523  3.783  0.946
    2236    5  281795  282604  281097  281097  4.669  0.934
    2449    6  292899  292335  293445  292335  5.388  0.898
    2646    7  292031  291103  291091  291091  6.312  0.902
    2828    8  301575  303670  303386  301575  6.963  0.870
    3000    9  274363  272176  273202  272176  8.680  0.964
    3162   10  293309  292879  294578  292879  8.963  0.896
    

  • Weak scaling often yields better efficiencies than strong scaling
    • You can do weak scaling on an SMP parallel computer too . . .
    • But you may run out of memory

  • A cluster parallel computer lets you scale the problem sizes larger than an SMP parallel computer
    • There can be more total memory in all the nodes of the cluster


Hybrid SMP Cluster Parallel Programming

  • Hybrid SMP cluster parallel computer

  • Hybrid parallel program organization

  • Parallel Java Library. http://www.cs.rit.edu/~ark/pj.shtml

  • Class ToxicSpillHyb (download)

  • Running times (msec) on 1 to 10 nodes of the tardis.cs.rit.edu computer
    • N = Grid size
    • NP = Number of parallel processes
    • NT = Number of parallel threads per process
    • Cluster computation amount = NP×Sequential computation amount ("weak scaling")
    • Ideally, cluster time (for NT = 1) = sequential time
    • Efficiency (for NT = 1) = Sequential time/Cluster time
    • Sizeup = Efficiency×NP
    • Speedup = NP×Sequential time/Cluster time
    • Efficiency (of speedup) = Speedup/(NP×NT)
    $ java ToxicSpillSeq 142857 $N 0.01 50 400
    $ java -Dpj.np=$NP -Dpj.nt=$NT ToxicSpillHyb 142857 $N 0.01 50 400
    
       N   NT   NT      T1      T2      T3       T  Sizup  Effic   Spdup  Effic
    1000  seq  seq  262621  262497  262944  262497
    
    1000    1    1  263261  263282  264324  263261  0.997  0.997   0.997  0.997
    1000    1    2  160303  159772  160856  159772                 1.643  0.821
    1000    1    3  120658  122926  119118  119118                 2.204  0.735
    1000    1    4   99424   99687   99130   99130                 2.648  0.662
    
    1414    2    1  277448  277208  278598  277208  1.894  0.947   1.894  0.947
    1414    2    2  159554  160026  159646  159554                 3.290  0.823
    1414    2    3  112005  112904  114167  112005                 4.687  0.781
    1414    2    4   99666  100582   99661   99661                 5.268  0.658
    
    1732    3    1  271478  268783  265674  265674  2.964  0.988   2.964  0.988
    1732    3    2  156087  157186  155149  155149                 5.076  0.846
    1732    3    3  129195  128524  130292  128524                 6.127  0.681
    1732    3    4  108991  109581  110394  108991                 7.225  0.602
    
    2000    4    1  278789  280878  282309  278789  3.766  0.942   3.766  0.942
    2000    4    2  174955  176111  176490  174955                 6.001  0.750
    2000    4    3  126928  126807  126678  126678                 8.289  0.691
    2000    4    4  119312  119057  117192  117192                 8.960  0.560
    
    2236    5    1  278990  282125  283478  278990  4.704  0.941   4.704  0.941
    2236    5    2  182128  182006  181720  181720                 7.223  0.722
    2236    5    3  139134  142793  139247  139134                 9.433  0.629
    2236    5    4  118087  117748  119232  117748                11.147  0.557
    
    2449    6    1  294478  292990  295282  292990  5.376  0.896   5.376  0.896
    2449    6    2  179972  180410  181517  179972                 8.751  0.729
    2449    6    3  144379  143871  144403  143871                10.947  0.608
    2449    6    4  126528  127588  128717  126528                12.448  0.519
    
    2646    7    1  290527  293894  292735  290527  6.325  0.904   6.325  0.904
    2646    7    2  180141  179757  181923  179757                10.222  0.730
    2646    7    3  147671  145664  146435  145664                12.615  0.601
    2646    7    4  128993  129066  135903  128993                14.245  0.509
    
    2828    8    1  301686  305434  301451  301451  6.966  0.871   6.966  0.871
    2828    8    2  190790  190606  191097  190606                11.017  0.689
    2828    8    3  154367  153908  155279  153908                13.644  0.569
    2828    8    4  138931  137272  136388  136388                15.397  0.481
    
    3000    9    1  274849  274722  274109  274109  8.619  0.958   8.619  0.958
    3000    9    2  189491  189447  190114  189447                12.470  0.693
    3000    9    3  154827  154031  155999  154031                15.338  0.568
    3000    9    4  139038  142835  139624  139038                16.992  0.472
    
    3162   10    1  295506  291521  292335  291521  9.004  0.900   9.004  0.900
    3162   10    2  180689  179803  179719  179719                14.606  0.730
    3162   10    3  156541  157270  158890  156541                16.769  0.559
    3162   10    4  141045  140720  135661  135661                19.349  0.484
    


GPU Parallel Programming

  • GPU accelerator (NVIDIA Tesla C2075)

  • GPGPU: http://www.gpgpu.org/

  • NVIDIA CUDA Zone: http://developer.nvidia.com/category/zone/cuda-zone

  • OpenCL: http://www.khronos.org/opencl/

  • Program ToxicSpillCpu (download)

  • Program ToxicSpillGpu (download)

  • Other source files

  • Running times (msec) on a desktop CPU and on a GPU
    • Desktop: Intel Pentium, 2.7 GHz, 4 GB memory
    • GPU: NVIDIA Tesla C2050, 448 cores, 1.15 GHz, 3 GB global memory
    • N = Grid size
    • Speedup = ToxicSpillCpu time/ToxicSpillGpu time
    $ ToxicSpillCpu 142857 $N 0.01 50 400
    $ ToxicSpillGpu 142857 $N 0.01 50 400
    
          --------ToxicSpillCpu  ----ToxicSpillGpu
       N  pre     calc    total  pre   calc  total  Speedup
     100    2     5676     5678   61    899    960     5.91
     141    3    16095    16098   60    957   1017    15.83
     200    7    40180    40187   66   1095   1161    34.61
     283   11    56358    56369   66   1402   1468    38.40
     400   13    91177    91190   76   1857   1933    47.18
     566   30   169241   169271   90   2983   3073    55.08
     800   53   329403   329456  116   4656   4772    69.03
    1131  103   604256   604359  174   9073   9247    65.36
    1600  209  1120617  1120826  283  15743  16026    69.94
    


The Top 500 Supercomputers

  • Processing speed
    • 1 flops = 1 floating point operation per second
    • 1 gigaflops = 109 flops
    • 1 teraflops = 1012 flops
    • 1 petaflops = 1015 flops
    • 1 exaflops = 1018 flops

  • Memory capacity
    • 1 gigabyte = 109 bytes
    • 1 terabyte = 1012 byte
    • 1 petabyte = 1015 byte
    • 1 exabyte = 1018 byte

  • Top500 List: http://www.top500.org/
    • Measures processing speed on the LINPACK linear algebra benchmark
    • CPU intensive benchmark

  • Number One supercomputer: Titan (November 2012)
    • Location: Oak Ridge National Laboratory, USA
    • Manufacturer: Cray
    • Model: XK7
    • CPU: Opteron 6274 16C
    • Clock speed: 2.2 GHz
    • Cores: 560,640
    • Processing speed: 17.5 petaflops
    • RAM: 0.71 petabytes
    • Accelerators: 261,632 NVIDIA K20x

  • Number Two supercomputer: Sequoia (November 2012)
    • Location: Lawrence Livermore National Laboratory, USA
    • Manufacturer: IBM
    • Model: BlueGene/Q
    • CPU: Power BQC 16C
    • Clock speed: 1.6 GHz
    • Cores: 1,572,864
    • Processing speed: 16.3 petaflops
    • RAM: 1.57 petabytes
    • Accelerators: None

  • Number Three supercomputer: K Computer (November 2012)
    • Location: RIKEN Advanced Institute for Computational Science, Japan
    • Manufacturer: Fujitsu
    • CPU: SPARC64 CIIIfx
    • Clock speed: 2.0 GHz
    • Cores: 705,024
    • Processing speed: 10.5 petaflops
    • RAM: 1.41 petabytes
    • Accelerators: None

  • Graph500 List: http://www.graph500.org/
    • Measures processing speed on selected graph algorithms
    • Data intensive benchmark

  • Green500 List: http://www.green500.org/
    • Uses the Top500 List data
    • Ranks computers based on energy efficiency -- performance per watt


Distributed Object Systems

  • A normal method call
    • Calling object → Method (arguments) → Target object
    • Calling object ← Return value ← Target object
      - or -
    • Calling object ← Thrown exception ← Target object

  • A remote method call
    • Calling object → Method (arguments) → Client-side proxy object
    • Client-side proxy object → Network message → Server-side proxy object
    • Server-side proxy object → Method (arguments) → Target object
    • Server-side proxy object ← Return value ← Target object
    • Client-side proxy object ← Network message ← Server-side proxy object
    • Calling object ← Return value ← Client-side proxy object
      - or -
    • Server-side proxy object ← Thrown exception ← Target object
    • Client-side proxy object ← Network message ← Server-side proxy object
    • Calling object ← Thrown exception ← Client-side proxy object

  • Another possibility
    • Calling object → Method (arguments) → Client-side proxy object
    • Client-side proxy object → Network message → Server-side proxy object
    • Client-side proxy object ← Network message ← Server-side proxy object
    • Calling object ← RemoteException ← Client-side proxy object
      • Possible reasons:
      • The client → server message was lost
      • The server → client message was lost
      • The server crashed in the middle of processing the message

  • Still another possibility
    • Calling object → Method (arguments) → Client-side proxy object
    • Calling object ← RemoteException ← Client-side proxy object
      • Possible reasons:
      • The server is down, so the client can't set up a connection

  • Terminology
    • Calling object
    • Target object
    • Remote object
      • A target object that can be called from other processes
    • Remote interface
      • An interface implemented by a remote object containing the methods that can be called from other processes
    • Non-remote interface
      • An interface implemented by a remote object containing the methods that cannot be called from other processes
    • Proxy object
    • Client-side proxy object; stub
      • Implements the same remote interface(s) as the remote object
      • Does not implement the remote object's non-remote interfaces
    • Server-side proxy object; skeleton
    • Remote reference
      • A reference to a client-side proxy object for a remote object
    • Remote exception
      • An exception thrown by the RMI system itself, not by the target object
    • Marshalling
    • Unmarshalling

  • Distributed object middleware; protocols
    • Common Object Request Broker Architecture (CORBA); IIOP/TCP/IP
    • Java Remote Method Invocation (RMI); JRMP/TCP/IP
    • Web services; SOAP/HTTP/TCP/IP, JSON

  • Web Services Example


Distributed Hash Tables

  • Normal (non-distributed) hash table
    • Data item = (key, value)
    • All data items stored in a table
    • Data item's location in table computed from hash of key
    • Operations: Add data item, remove data item, query key

  • Distributed hash table (DHT)
    • Data item = (key, value)
    • Table of data items spread out across multiple nodes
    • Data item's location (node) computed from hash of key
    • Operations: Add data item, remove data item, query key -- requires routing from originating node to destination node
    • Additional operations: Add node, remove node

  • Chord
    • I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11(1):17-32, February 2003.
    • Chord is just a DHT algorithm, not a complete file sharing application

  • Keys and hashing
    • Each node has a key which is the hash of the node's identifier (e.g. IP address)
    • Each data item (file) has a key which is the hash of the file's identifier
    • Any hash function can be used
    • The Chord paper uses SHA-1 (for hashing, not for security)

  • The Chord ring
    • The N different hash values are conceptually arranged in a circle, representing modulo N
    • For SHA-1, which yields a 160-bit hash value, N = 2160

  • Key storage
    • Each node "lives" at the location on the ring equal to the node's key (hash)
    • Each data item is stored in the node whose key is the smallest value greater than or equal to the data item key, modulo N
       

      Stoica et al., op. cit.

  • Key lookup
    • Each node has a finger table giving the nodes that store keys K+1, K+2, K+4, K+8, . . ., where K is the node's own key
    • To find a key, find the node in the finger table that is closest to the desired key without being greater than the desired key, and forward the query to that node
    • With O(log N) messages, the query will reach the node with the desired key
       

      Stoica et al., op. cit.

  • Other DHT algorithms:

  • CAN (Content Addressable Network)
    • S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM 2001), August 2001, pages 161-172.

  • Kademlia
    • P. Maymounkov and D. Mazières. Kademlia: A peer-to-peer information system based on the XOR metric. First International Workshop on Peer-to-Peer Systems (IPTPS 2002), LNCS 2429/2002, March 2002, pages 53-65.


BitTorrent

  • BitTorrent Protocol Specification. http://www.bittorrent.org/beps/bep_0003.html

  • Index of BitTorrent Enhancement Proposals. http://www.bittorrent.org/beps/bep_0000.html

  • The origin peer, who wants to make a file available:
    • Splits the file into pieces
    • Computes the SHA-1 hash of each piece
    • Sets up a seed file (.torrent file)
      • Name of the file
      • Length and hash of each piece
      • URL of the tracker
    • Publishes the seed file somewhere, e.g. a web site

  • The tracker is a central server
    • Keeps track of which peers have which pieces of the file

  • The swarm peers, who want to download the file:
    • Gets the seed file
    • Contacts the tracker
    • Tells the tracker to add itself to the swarm
    • Gets other peers in the swarm from the tracker
    • Downloads pieces of the file from peers
    • Notifies the tracker which pieces it has downloaded
    • Uploads those pieces to other peers

  • Trackerless BitTorrent
    • Torrent information is stored in a distributed hash table (DHT) instead of a central tracker
    • The Kademlia DHT is used
    • The DHT key is the SHA-1 hash of the piece of the file
    • BitTorrent DHT Protocol. http://www.bittorrent.org/beps/bep_0005.html

  • In theory, as time goes on, the peers download more and more from each other and less and less from the origin peer

  • Pros
    • Eliminates big server with enormous bandwidth, needed to download entire file to every user
    • Faster downloads due to multiple simultaneous connections with different peers

  • Cons
    • For the system to work well, peers must upload as well as download file pieces
    • Leecher peers that only download and never upload can degrade the system
    • To combat leechers, detect peers that don't upload enough and refuse to download to them (tit for tat)
    • But this unfairly penalizes users with limited upload bandwidth (like home users)
    • There is little to no security against malicious peers


Map-Reduce Systems

Operating Systems I 4003-440-02 Winter Quarter 2012
Course Page
Alan Kaminsky Department of Computer Science Rochester Institute of Technology 4486 + 2220 = 6706
Home Page
Copyright © 2013 Alan Kaminsky. All rights reserved. Last updated 13-Feb-2013. Please send comments to ark­@­cs.rit.edu.