CUDA Parallel Programming Tutorial

icon

26

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

26

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

CUDA Parallel Programming TutorialRichard Membarthrichard.membarth@cs.fau.deHardware-Software-Co-DesignUniversity of Erlangen-Nuremberg19.03.2009Friedrich-Alexander University of Erlangen-NurembergRichard Membarth1Outline◮ Tasks for CUDA◮ CUDA programming model◮ Getting started◮ Example codesFriedrich-Alexander University of Erlangen-NurembergRichard Membarth2Tasks for CUDA◮ Provide ability to run code on GPU◮ Manage resources◮ Partition data to fit on cores◮ Schedule blocks to coresFriedrich-Alexander University of Erlangen-NurembergRichard Membarth3Data Partitioning◮ Partition data in smallerblocks that can be processedby one core◮ Up to 512 threads in oneblock◮ All blocks define the grid◮ All blocks execute sameprogram (kernel)◮ Independent blocks◮ Only ONE kernel at a timeFriedrich-Alexander University of Erlangen-NurembergRichard Membarth4Memory HierarchyMemory types (fastest memoryfirst):◮ Registers◮ Shared memory◮ Device memory (texture,constant, local, global)Friedrich-Alexander University of Erlangen-NurembergRichard Membarth5Tesla Architecture◮ 30 cores, 240 ALUs (1 mul-add)◮ (1 mul-add + 1 mul): 240 * (2+1) * 1.3 GHz = 936 GFLOPS◮ 4.0 GB GDDR3, 102 GB/s Mem BW, 4GB/s PCIe BW to CPUFriedrich-Alexander University of Erlangen-NurembergRichard Membarth6CUDA: Extended C◮ Function qualifiers◮ Variable qualifiers◮ Built-in keywords◮ Intrinsics◮ Function callsFriedrich-Alexander University of ...
Voir icon arrow

Publié par

Langue

English

CUDA Parallel Programming Tutorial
Richard Membarth
richard.membarth@cs.fau.de
Hardware-Software-Co-Design University of Erlangen-Nuremberg
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
19.03.2009
1
Outline
Tasks for CUDA
CUDA programming
Getting started
Example codes
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
model
2
Tasks for CUDA
Provide ability to run code on
Manage resources
Partition data to t on cores
Schedule blocks to cores
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
GPU
3
Data Partitionin
Partition data in smaller blocks that can be processed by one core
Up to 512 threads in one block
All blocks dene the grid
All blocks execute same program (kernel)
Independent blocks
Only ONE kernel at a time
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
4
Memor
Hierarch
Memory types (fastest memory rst):
Registers
Shared memory
Device memory (texture, constant, local, global)
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
5
Tesla Architecture
30 cores, 240 ALUs (1 mul-add)
(1 mul-add + 1 mul): 240 * (2+1) * 1.3 GHz = 936 GFLOPS
4.0 GB GDDR3, 102 GB/s Mem BW, 4GB/s PCIe BW to CPU
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
6
CUDA: Extended
Function qualiers
Variable qualiers
Built-in keywords
Intrinsics
Function calls
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
C
7
Function Qualiers
Functions: device , host , global
__ __ global voidf i l t e r (int* in,int ... }
Default: host No function pointers No recursion No static variables No variable number of arguments No return value
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
* out )
{
8
Variable Qualiers
Variables: device , constant ,
shared
c o n s t a n tfloat ;t r i x [ 1 0 ]m a _ _ _ _ = {1 . 0 f , ... } s h a r e dint[ 3 2 ] [ 2 ] ; _ _ _ _
Default: Variables reside in registers
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
9
Built-In Variables
Available inside of kernel code
Thread index within current block:
threadIdx.x,threadIdx.y, threadIdx.z
Block index within grid:
blockIdx.x,blockIdx.y
Dimension of grid, block:
gridDim.x,gridDim.y blockDim.x,blockDim.y blockDim.z
Warp size:
warpSize
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
,
10
Intrinsics
voidsyncthreads(); __ Synchronizes in all thread of current block
Use in conditional code may lead to deadlocks
Intrinsics for most mathematical functions exists, e.g.
sinf(x), cosf(x), expf(x), ... __ __ __ Texture functions
Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth
11
Voir icon more
Alternate Text