| Home Page |
Prerequisites -- The student must be familiar with:
Outcomes -- The student will:

Things to notice:
all : OuterProductCPU \ OuterProductGPU OuterProductCPU : OuterProductCPU.cu Util.cu Random.cu nvcc -o OuterProductCPU OuterProductCPU.cu OuterProductGPU : OuterProductGPU.cu Util.cu Random.cu nvcc -arch compute_20 -code compute_20,sm_20 -o OuterProductGPU OuterProductGPU.cu clean : rm -f OuterProductCPU rm -f OuterProductGPU
$ ./OuterProductGPU 142857 16384 CUDA device 0: Tesla C2050, compute capability 2.0 A[0] = 0.856121 A[16383] = 0.074512 B[0] = 0.927307 B[16383] = 0.063003 C[0][0] = 0.793887 C[0][16383] = 0.053938 C[16383][0] = 0.069096 C[16383][16383] = 0.004694 23 msec computation
----CPU---- ----GPU----
N Comp Total Comp Total
1024 26 32 0 156
1024 25 32 1 192
1024 25 32 0 244
2048 101 112 0 284
2048 102 112 0 284
2048 74 88 1 284
4096 250 268 1 352
4096 229 244 1 284
4096 209 224 1 224
8192 711 748 5 664
8192 734 768 5 668
8192 737 772 5 564
16384 2625 2728 23 1752
16384 2549 2656 22 1784
16384 2600 2704 22 1776
----CPU---- ----GPU----
N Comp Total Comp Total
1024 25 32 0 156
2048 74 88 0 284
4096 209 224 1 224
8192 711 748 5 564
16384 2549 2656 22 1752
$ ssh guest@clairaut.cs.rit.edu
$ mkdir ark0114
$ cd ark0114
$ cp ../examples/* .
$ make
$ ./OuterProductCPU 142857 1024 $ ./OuterProductGPU 142857 1024 $ ./OuterProductCPU 142857 2048 $ ./OuterProductGPU 142857 2048
Initial state: Threads 0-7 have data that needs to be added together
Thread: 0 1 2 3 4 5 6 7
Data: 927 340 37 972 544 514 596 519
Step 1: Threads 0-3 add threads 4-7 data to their own data; synchronize
Thread: 0 1 2 3 4 5 6 7
Data: 1471 854 633 1491 544 514 596 519
Step 2: Threads 0-1 add threads 2-3 data to their own data; synchronize
Thread: 0 1 2 3 4 5 6 7
Data: 2104 2345 633 1491 544 514 596 519
Step 3: Thread 0 adds thread 1 data to its own data; synchronize
Thread: 0 1 2 3 4 5 6 7
Data: 4449 2345 633 1491 544 514 596 519
Final state: Thread 0's data is the sum of all the threads' original data
| Feature | Compute Capability | |||||
| 1.0 | 1.1 | 1.2 | 1.3 | 2.0 | 2.1 | |
| CUDA cores per multiprocessor | 8 | 8 | 8 | 8 | 32 | 48 |
| Single-precision floating point | Yes | Yes | Yes | Yes | Yes | Yes |
| Double-precision floating point | No | No | No | Yes | Yes | Yes |
| 32-bit integer atomic functions in global memory | No | Yes | Yes | Yes | Yes | Yes |
| 64-bit integer atomic functions in global memory | No | No | Yes | Yes | Yes | Yes |
| 32-bit integer atomic functions in shared memory | No | No | Yes | Yes | Yes | Yes |
| Single-precision floating point atomic addition in global and shared memory | No | No | No | No | Yes | Yes |
| Maximum threads per block | 512 | 512 | 512 | 512 | 1024 | 1024 |
| Maximum x- or y-dimension of a block | 512 | 512 | 512 | 512 | 1024 | 1024 |
| Maximum z-dimension of a block | 64 | 64 | 64 | 64 | 64 | 64 |
| Maximum x- or y-dimension of a grid | 65535 | 65535 | 65535 | 65535 | 65535 | 65535 |
| Number of 32-bit registers per multiprocessor | 8 K | 8 K | 16 K | 16 K | 32 K | 32 K |
| Maximum amount of shared memory per multiprocessor | 16 KB | 16 KB | 16 KB | 16 KB | 48 KB | 48 KB |
| θ = cos−1 | A ⋅ B |
| |A| |B| |
| N−1 | ||
| A ⋅ B = | Σ | Ai Bi |
| i=0 |
| N−1 | ||
| |A| = ( | Σ | Ai2)1/2 |
| i=0 |
| N−1 | ||
| |B| = ( | Σ | Bi2)1/2 |
| i=0 |
Things to notice:
$ ./VectorAngleGPU 142857 1000000 NT = 1024, NBX = 977, NBY = 1, threads = 1000448 theta = 0.722877 154 msec initialization + computation
N CPU GPU
1000 1 152
1000 1 147
1000 1 147
10000 4 153
10000 4 146
10000 4 143
100000 33 152
100000 33 147
100000 33 147
1000000 202 121
1000000 199 148
1000000 226 148
10000000 1374 164
10000000 1353 158
10000000 1382 158
100000000 12693 279
100000000 12716 286
100000000 12749 287
1000000000 127170 1258
1000000000 127027 1255
1000000000 126595 1255
N CPU GPU
1000 1 147
10000 4 143
100000 33 147
1000000 199 121
10000000 1353 158
100000000 12693 279
1000000000 126595 1255
$ ./VectorAngleCPU 142857 1000000 $ ./VectorAngleGPU 142857 1000000 $ ./VectorAngleCPU 142857 10000000 $ ./VectorAngleGPU 142857 10000000
| N−1 | N−1 | |||
| |C| = ( | Σ | Σ | Cij2)1/2 | |
| i=0 | j=0 |
| Prob [popcount = k] = | ( | n | ) | pk (1 − p)n−k |
| k |
| ( | n | ) | = | n! |
| k | k! (n − k)! |
| Prob [popcount = k] = | ( | 64 | ) | 2−64 |
| k |
| Ek = | ( | 64 | ) | 2−64 T |
| k |
| χ2 = |
|
|
Things to notice:
$ ./PresentStatTest 0000000000000000 0123456789abcdef 0123 200 200 ./PresentStatTest 0000000000000000 0123456789abcdef 0123 200 200 CUDA device 0: Tesla C2050, compute capability 2.0 Bin Actual Expected 0 0 2.22045e-12 1 0 1.42109e-10 2 0 4.47642e-09 3 0 9.25127e-08 4 0 1.41082e-06 5 0 1.69298e-05 6 0 1.66477e-04 7 0 1.37938e-03 8 0 9.82806e-03 9 0 6.11524e-02 10 1 3.36338e-01 11 1 1.65111e+00 12 8 7.29242e+00 13 24 2.91697e+01 14 111 1.06261e+02 15 349 3.54203e+02 16 1110 1.08475e+03 17 2994 3.06282e+03 18 7966 7.99736e+03 19 19319 1.93620e+04 20 44000 4.35645e+04 21 91539 9.12781e+04 22 178819 1.78407e+05 23 326411 3.25787e+05 24 556210 5.56553e+05 25 892012 8.90485e+05 26 1334546 1.33573e+06 27 1876935 1.87991e+06 28 2486335 2.48417e+06 29 3083521 3.08380e+06 30 3597558 3.59776e+06 31 3945805 3.94593e+06 32 4071770 4.06924e+06 33 3947480 3.94593e+06 34 3594596 3.59776e+06 35 3085671 3.08380e+06 36 2483511 2.48417e+06 37 1876226 1.87991e+06 38 1336657 1.33573e+06 39 890746 8.90485e+05 40 556169 5.56553e+05 41 325881 3.25787e+05 42 178601 1.78407e+05 43 91425 9.12781e+04 44 43422 4.35645e+04 45 19604 1.93620e+04 46 8024 7.99736e+03 47 3102 3.06282e+03 48 1061 1.08475e+03 49 328 3.54203e+02 50 112 1.06261e+02 51 31 2.91697e+01 52 7 7.29242e+00 53 1 1.65111e+00 54 0 3.36338e-01 55 1 6.11524e-02 56 0 9.82806e-03 57 0 1.37938e-03 58 0 1.66477e-04 59 0 1.69298e-05 60 0 1.41082e-06 61 0 9.25127e-08 62 0 4.47642e-09 63 0 1.42109e-10 64 0 2.22045e-12 chi^2 = 59.985633 p-value = 0.619148 4893 msec computation |
![]() |
NBX NBY Encryptions Time Encr/sec Bits/sec 100 100 10240000 1223 8.37e6 536.e6 200 200 40960000 4893 8.37e6 536.e6 400 400 163840000 19578 8.37e6 536.e6 800 800 655360000 78290 8.37e6 536.e6 1600 1600 2621440000 313201 8.37e6 536.e6
$ ./PresentStatTest 0000000000000000 0123456789abcdef 0123 10 10 $ ./PresentStatTest 0000000000000000 0123456789abcdef 0123 20 20
| Home Page |