NVIDIA Architecture
Produced by Tae Young Lee
Tesla ( 2008๋…„ NVIDIA์—์„œ ์ถœ์‹œ๋œ GPU Architecture )
Tesla GPU๋Š” SM(Streaming Multiprocessor)์˜
์ง‘ํ•ฉ์œผ๋กœ ์ด๋ค„์ง.
Tesla ์—์„œ SM์€ 8๊ฐœ์˜ SP(Stream Processor)์™€
2๊ฐœ์˜ SFU(Special Function Unit), Shared
Memory๋“ฑ์œผ๋กœ ์ด๋ค„์ง
SP (Core)๋ฅผ ๋ณดํ†ต CUDA core๋ผ๊ณ  ํ•˜๋Š”๋ฐ, GPU์˜ ๊ฐ
์„ธ๋Œ€๋งˆ๋‹ค SM, SP์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ฐจ์ด๊ฐ€ ๋‚จ
SP (Stream Processor) ๋Š” core ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—
CPU์˜ core๊ฒฉ์ธ ALU์™€ ๊ฐ™์ด ๋…ผ๋ฆฌ, ์ˆ˜ํ•™ ์—ฐ์‚ฐ(with
MAD(Multiply-add-Divide))์„ ์ˆ˜ํ–‰ํ•จ
SFU (Special Function Unit) ๋Š” ์ดˆ์›”ํ•จ์ˆ˜, ํ”ฝ์…€
attribute interpolation๋“ฑ์˜ ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๊ณ  4๊ฐœ์˜
๋ถ€๋™ ์†Œ์ˆ˜์  ๊ณฑ์…ˆ๊ธฐ๋„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.
https://89douner.tistory.com/159?category=913897
์ดˆ์›”ํ•จ์ˆ˜๋Š” ์ผ๋ฐ˜์  ๋‹คํ•ญ์‹์˜ ๊ทผ์œผ๋กœ ์ •์˜ํ•  ์ˆ˜ ์—†๋Š” ํ•จ์ˆ˜ ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ
๊ณ ์ • ์†Œ์ˆ˜์  ๋ฐฉ์‹์„ ์•Œ์•„๋ณด์ž
CPU๊ฐ€ 32bit ๋ช…๋ น์–ด ์ฒด๊ณ„๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋ถ€ํ˜ธ
(+.-)์™€ ์ •์ˆ˜๋ถ€, ์†Œ์ˆ˜๋ถ€๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์ œ๋Š” ์ •์ˆ˜,
์†Œ์ˆ˜๋ถ€๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ œํ•œ์ ์ด๋‹ค.
https://89douner.tistory.com/159?category=913897
๋ถ€๋™ ์†Œ์ˆ˜์  ๋ฐฉ์‹์€ ์œ ํšจ์ˆซ์ž๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ€์ˆ˜์™€ ์†Œ์ˆ˜์ ์˜ ์œ„์น˜๋ฅผ ํ’€์ดํ•˜๋Š” ์ง€์ˆ˜๋กœ ๋‚˜๋ˆ„์–ด ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.
๋ถ€๋™์†Œ์ˆ˜์ ์—์„œ๋Š” ์ง€์ˆ˜๋ถ€(Exponent)๋Š” ๊ธฐ์ค€๊ฐ’(Bias)๋ฅผ ์ค‘์‹ฌ์œผ๋กœ +,-๊ฐ’์„ ํ‘œํ˜„ํ•œ๋‹ค.
13.5๋ฅผ 32bit ๋ถ€๋™์†Œ์ˆ˜์  (float : 32bit )๋กœ ํ‘œํ˜„
https://89douner.tistory.com/159?category=913897
FLOPS (FLoat point Operations Per Second)
FLOPS๋Š” ์ปดํ“จํ„ฐ์˜ ์„ฑ๋Šฅ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ๊ต‰์žฅํžˆ
์ค‘์š”ํ•œ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉ๋จ
๋ง๊ทธ๋Œ€๋กœ ์ดˆ๋‹น ๋ถ€๋™์†Œ์ˆ˜์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์˜๋ฏธ
๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๊ณ„์‚ฐ๋“ค์ด ๋ถ€๋™์†Œ์ˆ˜์  (์‹ค์ˆ˜ํ˜•ํƒœ
: float ์ž๋ฃŒํ˜•)์œผ๋กœ ๊ณ„์‚ฐ์ด ๋˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— FLOPS๋ผ๋Š” ์ง€ํ‘œ๊ฐ€ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ™์ด
์†Œ์ˆ˜์ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ณผํ•™์—ฐ์‚ฐ์—์„œ ์‹œ๊ฐ„์„ ์ธก์ •ํ•˜๋Š”
์ค‘์š”ํ•œ ์ง€ํ‘œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.
๋งŒ์•ฝ FLOPS ์„ฑ๋Šฅ์ด ์ข‹์€ GPU๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•˜๋ฉด
๋‹ค๋ฅธ FLOPS๊ฐ€ ๋‚ฎ์€ GPU๋ณด๋‹ค ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ฑฐ๋‚˜
inferenceํ•˜๋Š” ๊ฒƒ์ด ๋” ๋น ๋ฆ„
VGG19๋ฅผ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด์„  ์ ์–ด๋„ 40G-Ops์ด์ƒ์„
์ง€์›ํ•˜๋Š” GPU๋ฅผ ๊ตฌ๋งคํ•ด์•ผ ํ•จ
1,000,000,000 FLOPS = 1 GFLOPS (giga FLOPS)
1000 GFLOPS = 1 TFLOPS (Tera FLOPS)
https://89douner.tistory.com/159?category=913897
SM (Streaming Multi-processor)
๋งŒ์•ฝ 8๊ฐœ์˜ SP์™€ 2๊ฐœ์˜ SFU๊ฐ€ ๋ชจ๋‘ ์‚ฌ์šฉ๋  ๊ฒฝ์šฐ SM์—์„œ๋Š”
1 clock cycle๋‹น ์ตœ๋Œ€ 16(=8+4*2)ํšŒ์˜ ๋ถ€๋™์†Œ์ˆ˜์  ๊ณฑ์…ˆ์„
์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ
Shared Memory๋Š” SM๋‚ด์—์„œ ์‹คํ–‰๋˜๋Š” thread ์‚ฌ์ด์˜
data๊ตํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ณณ์ด๋‹ค.
Tesla์—์„œ Shared Memory๋Š” 16KB์šฉ๋Ÿ‰์„ ๊ฐ–๋Š”๋‹ค.
SIMT (Single Instruction Multiple Threading)
(GP)GPU๋กœ ๋„˜์–ด์˜ค๋ฉด์„œ CUDA๋ฅผ ์ง€์›ํ•˜์ž SIMT๋ฐฉ์‹์„
๊ณ ์•ˆํ•จ
CPU์—์„œ๋Š” ์ฃผ๋กœ SIMD (Single Instruction Multiple Data)
๋ผ๋Š” ์šฉ์–ด๋ฅผ ์‚ฌ์šฉ.
CPU์˜ ์„ฑ๋Šฅ์„ ์ตœ๋Œ€๋กœ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•˜๋‚˜์˜ ๋ช…๋ น์–ด๋กœ
์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•˜๋Š” ๋™์ž‘์„ ์˜๋ฏธ
CUDA๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด์„œ ํ•˜๋‚˜์˜ ๋ช…๋ น์–ด๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ Thread๋ฅผ
๋™์ž‘์‹œํ‚ค๋Š” ์ผ์ด ํ•„์š”ํ•ด์กŒ๊ธฐ ๋•Œ๋ฌธ์— SIMT๋ฐฉ์‹์„ ๊ณ ์•ˆํ•จ
https://89douner.tistory.com/159?category=913897
Fermi ( 2010๋…„์— ์ถœ์‹œ๋œ NVIDIA GPU ์•„ํ‚คํ…์ฒ˜)
Tesla์—์„œ ๊ฐ SM๋งˆ๋‹ค ์ œ๊ณต๋˜๋˜ 16KB shared
memory๋Š” 64KB๋กœ ์šฉ๋Ÿ‰์ด ๋Š˜์—ˆ๋‹ค.
SM์™ธ๋ถ€์˜ texture unit์˜ ๋„์›€์„ ๋ฐ›์•„ ์‹คํ–‰๋˜๋˜
load/store ๋ช…๋ น๋„ SM๋‚ด์— Load/Store(LD&ST)
์œ ๋‹›์ด ์ถ”๊ฐ€๋จ์œผ๋กœ์จ SM ์ž์ฒด์ ์œผ๋กœ ์‹คํ–‰์ด
๊ฐ€๋Šฅํ•ด์ง
SM์— ํฌํ•จ๋˜์–ด ์žˆ๋Š” SP๋Š” Tesla์— ๋น„ํ•ด 4๋ฐฐ๊ฐ€
๋Š˜์–ด๋‚œ 32๊ฐœ๋กœ ๊ตฌ์„ฑ๋จ
Tesla์˜ SP๋Š” 32-bit ๋ถ€๋™์†Œ์ˆ˜์ ์„ ์ง€์›
Fermi์—์„œ๋Š” 32-bit ๋ถ€๋™์†Œ์ˆ˜์ ์„ ์ง€์›ํ•˜๋Š”
CUDA core 2๊ฐœ๋ฅผ ๋™์‹œ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด 64bit
๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ
https://89douner.tistory.com/159?category=913897
Kepler ( 2012 )
Fermi ๊ตฌ์กฐ์—์„œ๋Š” CUDA core, LD&ST unit, SFU ๋“ฑ์˜ ์‹คํ–‰
์œ ๋‹›๋“ค์ด ๋‹ค๋ฅธ ์œ ๋‹›๋“ค์— ๋น„ํ•ด ๋‘ ๋ฐฐ ๋น ๋ฅธ ์†๋„๋กœ ๋™์ž‘ํ–ˆ๋‹ค๋ฉด Kepler
๊ตฌ์กฐ์—์„œ๋Š” ์ „์ฒด ์œ ๋‹›์ด ๋™์ผํ•œ ์†๋„๋กœ ๋™์ž‘ํ•˜๋„๋ก ๋ณ€๊ฒฝ
( Performance/Watt ๋ฌธ์ œ๋กœ ์ด๋Ÿฐ์‹์˜ ๊ตฌ์กฐ๋ฅผ ๊ณ ์•ˆํ–ˆ๋‹ค๊ณ  ํ•จ )
Kepler๋ถ€ํ„ฐ SM์ด๋ผ๋Š” ์šฉ์–ด๊ฐ€ SMX๋กœ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ
์ „์ฒด ์†๋„์™€ ๋™๊ธฐํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด CUDA core์˜ ์†๋„๋ฅผ ์ค„์˜€๊ธฐ
๋•Œ๋ฌธ์— ์ด์ „ ์†๋„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋” ๋งŽ์€ CUDA core, LD&ST,
SFU ๋“ฑ์„ ์žฅ์ฐฉ
Kepler์˜ SMX๋Š” 192๊ฐœ์˜ CUDA core, 64๊ฐœ์˜ DP (64-bit Double
Precision) ์œ ๋‹›, 32๊ฐœ์˜ LD&ST ์œ ๋‹›, 32๊ฐœ์˜ SFU๋กœ ๊ตฌ์„ฑ๋จ
Kepler์—์„œ๋Š” HPC (High Performance Computing)์„ ๊ณ ๋ คํ•ด 64bit
๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ „์šฉ DP ์œ ๋‹›์ด ์ œ๊ณต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— 32bit,
64bit ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์ด ๋™์‹œ์— ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ
https://89douner.tistory.com/159?category=913897
๋Š˜์–ด๋‚œ core์˜ ์ˆ˜๋ฅผ ์ž˜ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด
warp scheduler์˜ ์ˆ˜๋„ 4๊ฐœ๋กœ ๋Š˜์–ด๋‚ฌ๊ณ ,
Dispatch unit๋„ ํ•˜๋‚˜์˜ warp scheduler ๋‹น
1๊ฐœ์—์„œ 2๊ฐœ๋กœ ์ฆ๊ฐ€
๊ทธ๋ž˜์„œ SMX๋Š” ๋™์‹œ์— ์ตœ๋Œ€ 8๊ฐœ์˜ ๋ช…๋ น์„ ์ฒ˜๋ฆฌ
๋˜ํ•œ Register file์˜ ํฌ๊ธฐ๋„ 128KB๋กœ 4๋ฐฐ๊ฐ€
๋Š˜์–ด๋‚ฌ๊ณ , L1 cacheํฌ๊ธฐ๋„ 128KB๋กœ ๋Š˜์–ด๋‚จ
ํ•˜๋‚˜์˜ thread๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” register ์ˆ˜๊ฐ€
Fermi์˜ 63๊ฐœ์—์„œ 255๊ฐœ๋กœ ๋Š˜์–ด๋‚ฌ๋Š”๋ฐ ์ด๋Ÿฌํ•œ ์ ์€
Dispatch Unit์˜ ์ฆ๊ฐ€์™€ ๋”๋ถˆ์–ด ๊ทธ๋ž˜ํ”ฝ ์—ฐ์‚ฐ๋ณด๋‹ค๋Š”
HPC์‘์šฉ๋ถ„์•ผ์˜ ์„ฑ๋Šฅ (ex:๊ณผํ•™์—ฐ์‚ฐ) ํ–ฅ์ƒ์„ ๊ณ ๋ คํ•œ
๋ณ€ํ™”๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
https://89douner.tistory.com/159?category=913897
MaxWell ( 2014 )
Kepler์—์„œ Maxwell๋กœ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋ณ€ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฏธ์„ธ๊ณต์ •์ด
28nm์— ๋จธ๋ฌผ๋กœ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํš๊ธฐ์ ์ธ ๋ณ€ํ™”๋ฅผ ๊พ€ํ•˜์ง€ ๋ชปํ• ๊ฑฐ๋ผ
์ƒ๊ฐํ•œ NVIDIA๋Š” ๋ชจ๋ฐ”์ผ ๋ฒ„์ „์˜ GPU ๊ตฌ์กฐ๋ฅผ ์ถฃ์‹œํ•˜๊ณ ์ž ํ•˜๋ฉด์„œ
์ด์ „ Kepler ๊ตฌ์กฐ๋ฅผ ์ตœ์ ํ™”ํ•จ
https://89douner.tistory.com/159?category=913897
Pascal ( 2016 )
Pascal์ด๋ผ๋Š” ์ธ๊ณต์ง€๋Šฅ์— ํŠนํ™”๋œ GPU ์•„ํ‚คํ…์ฒ˜ ์†Œ๊ฐœํ•จ
Pascal ๋ถ€ํ„ฐ๋Š” HPC(High Performance Computing)๋ถ„์•ผ (GP104 GPU)์™€ ๊ทธ๋ž˜ํ”ฝ ๋ถ„์•ผ (GP100 GPU) ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์œผ๋กœ
๋‚˜๋ˆ ์„œ ์ œํ’ˆ ์ถœ์‹œํ•จ
https://89douner.tistory.com/159?category=913897
HPC ๋˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ถ„์•ผ์—์„œ๋Š” 64bit, 16bit ๋ถ€๋™์†Œ์ˆ˜์ ์—ฐ์‚ฐ
(FP64/FP16)์„ ์ง€์›ํ•˜๋ฉด์„œ ํ•˜๋‚˜์˜ thread๊ฐ€ ๋งŽ์€ register๋ฅผ
์‚ฌ์šฉํ•˜๋„๋ก ํ–ˆ๊ณ , ๊ทธ๋ž˜ํ”ฝ ๋ถ„์•ผ์—์„œ๋Š” 32bit(FP32)๋ฅผ ์ฃผ๋กœ
์‚ฌ์šฉํ•˜๊ณ  ํ”„๋กœ๊ทธ๋žจ์ด ๊ฐ„๋‹จํ•ด register๊ฐœ์ˆ˜๋ฅผ ๊ตณ์ด ๋Š˜๋ฆฌ์ง€
์•Š๋„๋ก ํ–ˆ๋‹ค.
๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ pascal๊ตฌ์กฐ์˜ ๊ฐ€์žฅ ํฐ ํŠน์ง•์€ 16bit
๋ถ€๋™์†Œ์ˆ˜์ (FP16) ์—ฐ์‚ฐ์„ ์ง€์›ํ•œ๋‹ค๋Š” ์ 
์‹ค์ œ ๋”ฅ๋Ÿฌ๋‹์„ ํ•˜๋‹ค๋ณด๋ฉด weight bias, learning rate๋“ฑ์˜ ๊ฐ’
๋“ฑ์„ ์ด์šฉํ•  ํ…๋ฐ, ๋‹ค๋ฅธ ๊ณผํ•™๋ถ„์•ผ๋ณด๋‹ค๋Š” ์ดˆ์ •๋ฐ€ํ•œ ๊ฐ’์„
์š”๊ตฌํ•˜๋Š”๊ฑด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— 32bit์ฒ˜๋ฆฌ ๋ฐฉ์‹๋ณด๋‹ค๋Š” 16bit์ฒ˜๋ฆฌ
๋ฐฉ์‹์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋„๋ก ํ–ˆ๋‹ค.
16bit ๋ถ€๋™์†Œ์ˆ˜์  ์ฒ˜๋ฆฌ๋ฐฉ์‹์œผ๋กœ ๋ฐ”๊พธ๋ฉด 32bit์ฒ˜๋ฆฌ ๋ฐฉ์‹์—์„œ
์‚ฌ์šฉํ–ˆ๋˜ ๊ฒƒ ๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰๊ณผ ๋Œ€์—ญํญ์— ๋Œ€ํ•œ ๋ถ€๋‹ด์ด
์ค„์–ด๋“ค๊ฒŒ ๋จ
๋˜ํ•œ GP100 Pascal์˜ CUDA core๋Š” FP16 ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ
ํ•œ ์‚ฌ์ดํด์— ๋‘ ๋ช…๋ น์–ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— FP16 ์—ฐ์‚ฐ
์„ฑ๋Šฅ์€ FP32 ์—ฐ์‚ฐ ์„ฑ๋Šฅ์˜ ๋‘ ๋ฐฐ๊ฐ€ ๋œ๋‹ค
๊ทธ๋ž˜์„œ ๋‹ค๋ฅธ ๋‘ GPU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, A๋ผ๋Š” GPU๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ๊ฐ€
๋” ์ž‘์€ B๋ผ๋Š” GPU์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹์ด ์ž˜ ๋™์ž‘ํ•˜๋Š”๋ฐ, A
GPU์—์„œ๋Š” ์ž‘๋™์ด ์•ˆ ๋œ๋‹ค๊ณ  ํ•˜๋ฉด FP๋ถ€๋ถ„์˜ ์ฐจ์ด๋ฅผ
ํ™•์ธํ•ด์•ผ ํ•จ.
https://89douner.tistory.com/159?category=913897
Votal ( 2018 ) V100
Volta ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ฑ„ํƒํ•œ GV100์€ TSMC์˜ 12nm ๊ณต์ •์œผ๋กœ
๊ตฌํ˜„
๊ธฐ์กด์— ์‚ฌ์šฉ๋œ CUDA core๋Š” FP32์ฝ”์–ด๋กœ ์ด๋ฆ„์ด ๋ฐ”๋€œ
์ด์ „์— GPU๊ฐ€ โ€œ64FP -> 16FPโ€๋กœ ๋ณ€ํ™˜๋œ ๊ฑธ ๋ดค๋“ฏ์ด,
Votal์—์„œ๋Š” 8bit ๋‹จ์œ„์˜ INT(์ •์ˆ˜์—ฐ์‚ฐ) ์ฝ”์–ด๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋‹ค.
์ด๋ ‡๊ฒŒ ํ•˜์—ฌ inferencing์„ ๊ฐ€์†ํ™”ํ•จ
Volta์—์„œ ๋ˆˆ์—ฌ๊ฒจ ๋ณผ ๋ถ€๋ถ„์€ INT32 ์—ฐ์‚ฐ๊ณผ tensor core๋ฅผ
์ œ๊ณตํ•˜์—ฌ ์‹ค์ œ ํ•™์Šต ๋˜๋Š” inference ์†๋„๋ฅผ ๋Œ€ํญ ํ–ฅ์ƒ์‹œํ‚ด
Deep Learning์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๊ณ„์‚ฐ์ด โ€œD=A*B+Cโ€ ๋‹จ์œ„๋กœ
์ด๋ค„์ง
(A:์ž…๋ ฅ๊ฐ’, B:๊ฐ€์ค‘์น˜(weight),C:bias,D:์ถœ๋ ฅ ๊ฐ’).
๋น ๋ฅธ๊ณ„์‚ฐ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด, A,B ๋ถ€๋ถ„์€ FP16๋‹จ์œ„(floating point
16bit, half precision)๋กœ ๊ณ„์‚ฐ์ด๋˜๊ณ , ๋” ์ •๋ฐ€ํ•œ accuracy๋ฅผ
์œ„ํ•ด C,D๋Š” FP32(floating point 32bit, single-precision)์œผ๋กœ
์ฒ˜๋ฆฌํ•˜๋Š” mixed precision ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” tensor core๋ฅผ
๊ณ ์•ˆ
https://89douner.tistory.com/159?category=913897
V100 GPU
V100 GPU์—๋Š” SM๋‹น 8๊ฐœ์˜ Tensor core๊ฐ€ ์žฅ์ฐฉ๋˜์–ด
์žˆ์œผ๋ฏ€๋กœ, ํ•˜๋‚˜์˜ SM์—์„œ๋Š” 64*8*2 = 1024 ๋ฒˆ์˜
โ€œ๊ณฑ์…ˆ+๋ง์…ˆ" floating point ์—ฐ์‚ฐ์ด ํ•œ ์‚ฌ์ดํด์— ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
V100์—๋Š” 80๊ฐœ์˜ SM์ด ์žฅ์ฐฉ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ 80*1024๋ฒˆ์˜
์—ฐ์‚ฐ์ด ํ•œ ์‚ฌ์ดํด์— ์ˆ˜ํ–‰๋จ
์ด์ „ ํŒŒ์Šค์นผ ์•„ํ‚คํ…์ฒ˜๊ธฐ๋ฐ˜์˜ P100๋ณด๋‹ค
mixed-precision๊ฐœ๋…์„ ๋„์ž…ํ•œ volta ์•„ํ‚คํ…์ฒ˜ ๊ธฐ๋ฐ˜์˜ V100
๋ชจ๋ธ์˜ 9~10๋ฐฐ์˜ ์„ฑ๋Šฅ์„ ๋ƒ„
cuDNN์€ ์ถœ์‹œ๋˜๋Š” ์ตœ์‹  GPU๋ชจ๋ธ์— ๋งž๊ฒŒ ์—…๋ฐ์ดํŠธ ๋˜๋Š”
๊ฒฝ์šฐ๊ฐ€ ์žˆ๋Š”๋ฐ, ์ตœ์‹  GPU์ธ volta์™€ ๋‹น์‹œ ์ตœ์‹  ์†Œํ”„ํŠธ์›จ์–ด
ํ”Œ๋žซํผ์ธ cuDNN์ด ๊ฒฐํ•ฉ๋˜๋ฉด ํ›จ์”ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„
https://89douner.tistory.com/159?category=913897
Turing ( 2018.02 )
Turing๊ธฐ๋ฐ˜ RTX Geforce20์‹œ๋ฆฌ์ฆˆ๊ฐ€ ์ถœ์‹œ๋จ
Volta์˜ ํฐ ๊ณจ๊ฒฉ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ทœ๋ชจ๋ฅผ ์ ˆ๊ฐํ•œ ๋งˆ์ด๋„ˆ
ํŒŒ์ƒ์ƒํ’ˆ์œผ๋กœ ๊ฐ„์ฃผ๋จ
๋ˆˆ์— ๋„๋Š” ๋ถ€๋ถ„์€ RT core๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฒƒ๊ณผ 4bit์ •์ˆ˜
(INT) ์—ฐ์‚ฐ๋„ ๊ฐ€๋Šฅ
https://89douner.tistory.com/159?category=913897
RT core๋Š” Ray Tracing์„ ์œ„ํ•œ ๊ธฐ์ˆ ์ž„
RT core๋Š” Ray Tracing์„ ์œ„ํ•œ ๊ธฐ์ˆ ์ž„ HPC(High Performance Computing)์„ ์œ„ํ•ด ์ ์  ๋ฐœ์ „ํ•ด ๋‚˜๊ฐ€๊ณ  ์žˆ์ง€๋งŒ ๊ทธ๋ž˜ํ”ฝ
๋ถ€๋ถ„๋„ ํฌ๊ธฐํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— RT(Ray Tracing)๊ธฐ์ˆ ์„ ์ ‘๋ชฉ์‹œํ‚ด
1) Ray Tracing ( RT ; ๊ด‘์›์ถ”์  )
RT๋ž€ ๊ทธ๋ž˜ํ”ฝ์œผ๋กœ ๊ตฌ์„ฑ๋œ 3D ๊ณต๊ฐ„์—์„œ ์–ด๋–ค ๋ฌผ์ฒด์— ๋น›์ด ๋ฐ˜์‚ฌ๋˜๊ฑฐ๋‚˜, ์–ด๋–ค ๋ฌผ์ฒด์— ์˜ํ•ด ๊ทธ๋ฆผ์ž๊ฐ€ ์ƒ๊ธฐ๊ฑฐ๋‚˜, ๋น›์˜ ๊ตด์ ˆ์„
์ผ์œผํ‚ค๋Š” ๋ชจ๋“  ์ž‘์šฉ๋“ค์„ ๊ณ ๋ คํ–์—ฌ ํ™”๋ฉด์— ํ‘œํ˜„ํ•ด์ฃผ๋Š” ๊ธฐ์ˆ ์ž„
RT core๋ฅผ ํ†ตํ•ด ์‹ค์‹œ๊ฐ„ ( Real-Time ) ์œผ๋กœ ์ด๋Ÿฌํ•œ ๊ธฐ๋Šฅ๋“ค์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€ ๊ฒƒ์ด ๊ฐ€์žฅ ํฐ ํŠน์ง•์ž„
https://89douner.tistory.com/159?category=913897
Ampere ( 2020.03 ) A100
2020.03์›” Votal ์•„ํ‚คํ…์ฒ˜์˜ ํ›„์†์ž‘์œผ๋กœ Ampere๋ผ๋Š” NVIDIA GPU ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์†Œ๊ฐœ๋จ.
Ampere ์•„ํ‚คํ…์ฒ˜๋Š” ์ž‘์ •ํ•˜๊ณ  ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•ด ๋งŒ๋“  GPU๋ผ๊ณ  ๋ด„
Ampere ์•„ํ‚คํ…์ฒ˜๋Š” TSMC์˜ 7nm ๋ฏธ์„ธ๊ณต์ • ์ ์šฉ
https://89douner.tistory.com/159?category=913897
Ampere์—์„œ ์ฃผ๋ชฉํ•  ๋ถ€๋ถ„์€ ๋”ฅ๋Ÿฌ๋‹์— ์ตœ์ ํ™”๋œ ์ƒˆ๋กœ์šด ์ •๋ฐ€๋„์ธ TensorFloat-32 (TF32)๊ฐ€ ๋„์ž…๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ
Tensorflow ์ž๋ฃŒํ˜•์„ ๋ณด๋ฉด tf.float32๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Œ, tensor ์ž๋ฃŒ๊ตฌ์กฐ์—์„œ float์„ ์ œ๊ณตํ•ด์ฃผ๋Š” ๊ฒƒ์„ ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ support
๊ทธ๋ž˜์„œ ์ •๋ฐ€ํ•˜๊ฒŒ ๊ณ„์‚ฐ์„ ํ•˜๋ฉด์„œ ์†๋„๋Š” ๋น ๋ฅด๊ฒŒ ์œ ์ง€ํ•ด์คŒ
TF32๋Š” FP32์™€ ๊ฐ™์ด ์ž‘๋™ํ•˜๋ฉด์„œ ์ฝ”๋“œ ๋ณ€๊ฒฝ์—†์ด ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์ตœ๋Œ€ 20๋ฐฐ๊นŒ์ง€ ๊ฐ€์†ํ•œ๋‹ค๊ณ  ํ•จ
(์ฐธ๊ณ ๋กœ TF32ํ˜•ํƒœ๊ฐ€ FP32๋ณด๋‹ค 6๋ฐฐ๋Š” ๋น ๋ฅด๋‹ค)
๋˜ํ•œ Mixed Precision์ด๋ผ๋Š” ์—ฐ์‚ฐ๊ธฐ๋ฒ•์„ ์ง€์›ํ•˜๋Š” ์‹œ๋„๋„ ์žˆ๋Š”๋ฐ, single-precision์ธ FP32์™€ half-single-precision์ธ FP16์„
์ ์ ˆํžˆ ์ž˜ ์„ž์–ด์ฃผ๋ฉด์„œ ์†๋„(speed)์™€ ์ •ํ™•๋„(accuracy)๋ฅผ ๋ชจ๋‘ ์žก์•˜๋‹ค๊ณ  ํ•จ
๋‹จ์ ์ธ ์˜ˆ๋กœ, FP16์„ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰๋„ ์ค„์–ด๋“ค์–ด ๋” ํฐ ๋ชจ๋ธ์„ GPU์— ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๊ฒŒ
๋˜์—ˆ๊ณ , ๋” ํฐ mini-batches (size) ๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด ์คŒ
https://89douner.tistory.com/159?category=913897
A100 Mixed Precision ์‚ฌ์šฉํ•˜๊ธฐ
https://89douner.tistory.com/159?category=913897
Sparse connection
Sparse connection์€ ์‰ฝ๊ฒŒ ๋งํ•ด ๋”ฅ๋Ÿฌ๋‹์—์„œ ์“ฐ์ด๋Š”
parameter ์ค‘์— ๋ถˆํ•„์š”ํ•œ ๋ถ€๋ถ„์„ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๊ณ„์‚ฐ์„ ์ข€ ๋”
๋น ๋ฅด๊ฒŒ ํ•˜๊ฑฐ๋‚˜, ์ฐจ์› ์ˆ˜๋ฅผ ์ค„์—ฌ์ฃผ์–ด overfiting์„ ํ”ผํ•ด์ฃผ๋Š”
๊ธฐ๋ฒ•์œผ๋กœ ์‚ฌ์šฉ๋จ
Ampere(A100) ์•„ํ‚คํ…์ฒ˜์—์„œ๋Š” ์ด๋Ÿฌํ•œ sparse model์—
supportํ•ด์ฃผ๋Š” ๊ธฐ๋ฒ•์„ ์ œ๊ณตํ•ด ์คŒ
A100(Ampere)์˜ tensor ์ฝ”์–ด๋Š” sparse model์— ๋Œ€ํ•ด ์ตœ๋Œ€
2๋ฐฐ ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ด ์คŒ, inference ์‹œ๊ฐ„๋„ ์ค„์—ฌ์ค„ ๋ฟ๋งŒ
์•„๋‹ˆ๋ผ ํ•™์Šต์„ฑ๋Šฅ ๋„ ๊ฐœ์„ ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
Multi-Instance with Kubernetes
Ampere ์•„ํ‚คํ…์ฒ˜๋Š” ์ตœ๋Œ€ 7๊ฐœ์˜ sub-group gpu๋กœ partitioning
์˜ˆ๋ฅผ๋“ค์–ด, 40GB VRAM์„ ๊ฐ–๊ณ  ์žˆ๋Š” ampere GPU๋Š” ๊ฐ๊ฐ 20GB VRAM์„ ๊ฐ–๋Š” 2๊ฐœ์˜ sub-ampere GPU๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Œ
์ตœ๋Œ€ 5GB์˜ sub-ampere gpu 7๊ฐœ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ, ์ด๋ ‡๊ฒŒ ๋‚˜๋ˆˆ GPU๋“ค์„ ๋‚˜์ค‘์— ๋‹ค์‹œ mergeํ•  ์ˆ˜ ์žˆ๋‹ค.
์‚ฌ์šฉ์‚ฌ๋ก€๋ฅผ ๋ณด๋ฉด, ๋‚ฎ์—๋Š” ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ถ”๋ก ์„ ์œ„ํ•ด 7๊ฐœ์˜ sub-GPU๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๋ฐค์— ํ‡ด๊ทผํ•  ๋•Œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด
1๊ฐœ์˜ ๋ณธ๋ž˜์˜ GPU์ธ์Šคํ„ด์Šค๋กœ ๋งŒ๋“ค์–ด ์‚ฌ์šฉ
๊ฐ๊ฐ์˜ sub-group์„ ํ˜•์„ฑํ•˜๋Š” gpu๋“ค์€ ์„œ๋กœ ๋…๋ฆฝ์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰์‹œ์ผœ๋„ CUDA๊ฐ€ ๊ฐ๊ฐ
์ธ์Šคํ„ด์Šค์— ๋งž์ถฐ ์‹คํ–‰
์ด๋Ÿฌํ•œ MIG ( Muiti-Instance GPU )๊ธฐ์ˆ ์€ ์ปจํ…Œ์ด๋„ˆ ๋˜๋Š” ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค์™€ ๊ฐ™์ด DevOps์œ ์ €๋“ค์—๊ฒŒ ํŠนํžˆ ์œ ์šฉํ•จ
https://89douner.tistory.com/159?category=913897

Nvidia architecture

  • 1.
  • 2.
    Tesla ( 2008๋…„NVIDIA์—์„œ ์ถœ์‹œ๋œ GPU Architecture ) Tesla GPU๋Š” SM(Streaming Multiprocessor)์˜ ์ง‘ํ•ฉ์œผ๋กœ ์ด๋ค„์ง. Tesla ์—์„œ SM์€ 8๊ฐœ์˜ SP(Stream Processor)์™€ 2๊ฐœ์˜ SFU(Special Function Unit), Shared Memory๋“ฑ์œผ๋กœ ์ด๋ค„์ง SP (Core)๋ฅผ ๋ณดํ†ต CUDA core๋ผ๊ณ  ํ•˜๋Š”๋ฐ, GPU์˜ ๊ฐ ์„ธ๋Œ€๋งˆ๋‹ค SM, SP์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ฐจ์ด๊ฐ€ ๋‚จ SP (Stream Processor) ๋Š” core ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— CPU์˜ core๊ฒฉ์ธ ALU์™€ ๊ฐ™์ด ๋…ผ๋ฆฌ, ์ˆ˜ํ•™ ์—ฐ์‚ฐ(with MAD(Multiply-add-Divide))์„ ์ˆ˜ํ–‰ํ•จ SFU (Special Function Unit) ๋Š” ์ดˆ์›”ํ•จ์ˆ˜, ํ”ฝ์…€ attribute interpolation๋“ฑ์˜ ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๊ณ  4๊ฐœ์˜ ๋ถ€๋™ ์†Œ์ˆ˜์  ๊ณฑ์…ˆ๊ธฐ๋„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. https://89douner.tistory.com/159?category=913897
  • 3.
    ์ดˆ์›”ํ•จ์ˆ˜๋Š” ์ผ๋ฐ˜์  ๋‹คํ•ญ์‹์˜๊ทผ์œผ๋กœ ์ •์˜ํ•  ์ˆ˜ ์—†๋Š” ํ•จ์ˆ˜ ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ ๊ณ ์ • ์†Œ์ˆ˜์  ๋ฐฉ์‹์„ ์•Œ์•„๋ณด์ž CPU๊ฐ€ 32bit ๋ช…๋ น์–ด ์ฒด๊ณ„๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋ถ€ํ˜ธ (+.-)์™€ ์ •์ˆ˜๋ถ€, ์†Œ์ˆ˜๋ถ€๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์ œ๋Š” ์ •์ˆ˜, ์†Œ์ˆ˜๋ถ€๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ œํ•œ์ ์ด๋‹ค. https://89douner.tistory.com/159?category=913897
  • 4.
    ๋ถ€๋™ ์†Œ์ˆ˜์  ๋ฐฉ์‹์€์œ ํšจ์ˆซ์ž๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ€์ˆ˜์™€ ์†Œ์ˆ˜์ ์˜ ์œ„์น˜๋ฅผ ํ’€์ดํ•˜๋Š” ์ง€์ˆ˜๋กœ ๋‚˜๋ˆ„์–ด ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๋ถ€๋™์†Œ์ˆ˜์ ์—์„œ๋Š” ์ง€์ˆ˜๋ถ€(Exponent)๋Š” ๊ธฐ์ค€๊ฐ’(Bias)๋ฅผ ์ค‘์‹ฌ์œผ๋กœ +,-๊ฐ’์„ ํ‘œํ˜„ํ•œ๋‹ค. 13.5๋ฅผ 32bit ๋ถ€๋™์†Œ์ˆ˜์  (float : 32bit )๋กœ ํ‘œํ˜„ https://89douner.tistory.com/159?category=913897
  • 5.
    FLOPS (FLoat pointOperations Per Second) FLOPS๋Š” ์ปดํ“จํ„ฐ์˜ ์„ฑ๋Šฅ์„ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉ๋จ ๋ง๊ทธ๋Œ€๋กœ ์ดˆ๋‹น ๋ถ€๋™์†Œ์ˆ˜์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์˜๋ฏธ ๋”ฅ๋Ÿฌ๋‹์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๊ณ„์‚ฐ๋“ค์ด ๋ถ€๋™์†Œ์ˆ˜์  (์‹ค์ˆ˜ํ˜•ํƒœ : float ์ž๋ฃŒํ˜•)์œผ๋กœ ๊ณ„์‚ฐ์ด ๋˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— FLOPS๋ผ๋Š” ์ง€ํ‘œ๊ฐ€ ๋”ฅ๋Ÿฌ๋‹๊ณผ ๊ฐ™์ด ์†Œ์ˆ˜์ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ณผํ•™์—ฐ์‚ฐ์—์„œ ์‹œ๊ฐ„์„ ์ธก์ •ํ•˜๋Š” ์ค‘์š”ํ•œ ์ง€ํ‘œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค. ๋งŒ์•ฝ FLOPS ์„ฑ๋Šฅ์ด ์ข‹์€ GPU๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•˜๋ฉด ๋‹ค๋ฅธ FLOPS๊ฐ€ ๋‚ฎ์€ GPU๋ณด๋‹ค ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ฑฐ๋‚˜ inferenceํ•˜๋Š” ๊ฒƒ์ด ๋” ๋น ๋ฆ„ VGG19๋ฅผ ๋Œ๋ฆฌ๊ธฐ ์œ„ํ•ด์„  ์ ์–ด๋„ 40G-Ops์ด์ƒ์„ ์ง€์›ํ•˜๋Š” GPU๋ฅผ ๊ตฌ๋งคํ•ด์•ผ ํ•จ 1,000,000,000 FLOPS = 1 GFLOPS (giga FLOPS) 1000 GFLOPS = 1 TFLOPS (Tera FLOPS) https://89douner.tistory.com/159?category=913897
  • 6.
    SM (Streaming Multi-processor) ๋งŒ์•ฝ8๊ฐœ์˜ SP์™€ 2๊ฐœ์˜ SFU๊ฐ€ ๋ชจ๋‘ ์‚ฌ์šฉ๋  ๊ฒฝ์šฐ SM์—์„œ๋Š” 1 clock cycle๋‹น ์ตœ๋Œ€ 16(=8+4*2)ํšŒ์˜ ๋ถ€๋™์†Œ์ˆ˜์  ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ Shared Memory๋Š” SM๋‚ด์—์„œ ์‹คํ–‰๋˜๋Š” thread ์‚ฌ์ด์˜ data๊ตํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ณณ์ด๋‹ค. Tesla์—์„œ Shared Memory๋Š” 16KB์šฉ๋Ÿ‰์„ ๊ฐ–๋Š”๋‹ค. SIMT (Single Instruction Multiple Threading) (GP)GPU๋กœ ๋„˜์–ด์˜ค๋ฉด์„œ CUDA๋ฅผ ์ง€์›ํ•˜์ž SIMT๋ฐฉ์‹์„ ๊ณ ์•ˆํ•จ CPU์—์„œ๋Š” ์ฃผ๋กœ SIMD (Single Instruction Multiple Data) ๋ผ๋Š” ์šฉ์–ด๋ฅผ ์‚ฌ์šฉ. CPU์˜ ์„ฑ๋Šฅ์„ ์ตœ๋Œ€๋กœ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•˜๋‚˜์˜ ๋ช…๋ น์–ด๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•˜๋Š” ๋™์ž‘์„ ์˜๋ฏธ CUDA๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด์„œ ํ•˜๋‚˜์˜ ๋ช…๋ น์–ด๋กœ ์—ฌ๋Ÿฌ๊ฐœ์˜ Thread๋ฅผ ๋™์ž‘์‹œํ‚ค๋Š” ์ผ์ด ํ•„์š”ํ•ด์กŒ๊ธฐ ๋•Œ๋ฌธ์— SIMT๋ฐฉ์‹์„ ๊ณ ์•ˆํ•จ https://89douner.tistory.com/159?category=913897
  • 7.
    Fermi ( 2010๋…„์—์ถœ์‹œ๋œ NVIDIA GPU ์•„ํ‚คํ…์ฒ˜) Tesla์—์„œ ๊ฐ SM๋งˆ๋‹ค ์ œ๊ณต๋˜๋˜ 16KB shared memory๋Š” 64KB๋กœ ์šฉ๋Ÿ‰์ด ๋Š˜์—ˆ๋‹ค. SM์™ธ๋ถ€์˜ texture unit์˜ ๋„์›€์„ ๋ฐ›์•„ ์‹คํ–‰๋˜๋˜ load/store ๋ช…๋ น๋„ SM๋‚ด์— Load/Store(LD&ST) ์œ ๋‹›์ด ์ถ”๊ฐ€๋จ์œผ๋กœ์จ SM ์ž์ฒด์ ์œผ๋กœ ์‹คํ–‰์ด ๊ฐ€๋Šฅํ•ด์ง SM์— ํฌํ•จ๋˜์–ด ์žˆ๋Š” SP๋Š” Tesla์— ๋น„ํ•ด 4๋ฐฐ๊ฐ€ ๋Š˜์–ด๋‚œ 32๊ฐœ๋กœ ๊ตฌ์„ฑ๋จ Tesla์˜ SP๋Š” 32-bit ๋ถ€๋™์†Œ์ˆ˜์ ์„ ์ง€์› Fermi์—์„œ๋Š” 32-bit ๋ถ€๋™์†Œ์ˆ˜์ ์„ ์ง€์›ํ•˜๋Š” CUDA core 2๊ฐœ๋ฅผ ๋™์‹œ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด 64bit ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋จ https://89douner.tistory.com/159?category=913897
  • 8.
    Kepler ( 2012) Fermi ๊ตฌ์กฐ์—์„œ๋Š” CUDA core, LD&ST unit, SFU ๋“ฑ์˜ ์‹คํ–‰ ์œ ๋‹›๋“ค์ด ๋‹ค๋ฅธ ์œ ๋‹›๋“ค์— ๋น„ํ•ด ๋‘ ๋ฐฐ ๋น ๋ฅธ ์†๋„๋กœ ๋™์ž‘ํ–ˆ๋‹ค๋ฉด Kepler ๊ตฌ์กฐ์—์„œ๋Š” ์ „์ฒด ์œ ๋‹›์ด ๋™์ผํ•œ ์†๋„๋กœ ๋™์ž‘ํ•˜๋„๋ก ๋ณ€๊ฒฝ ( Performance/Watt ๋ฌธ์ œ๋กœ ์ด๋Ÿฐ์‹์˜ ๊ตฌ์กฐ๋ฅผ ๊ณ ์•ˆํ–ˆ๋‹ค๊ณ  ํ•จ ) Kepler๋ถ€ํ„ฐ SM์ด๋ผ๋Š” ์šฉ์–ด๊ฐ€ SMX๋กœ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ ์ „์ฒด ์†๋„์™€ ๋™๊ธฐํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•ด CUDA core์˜ ์†๋„๋ฅผ ์ค„์˜€๊ธฐ ๋•Œ๋ฌธ์— ์ด์ „ ์†๋„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋” ๋งŽ์€ CUDA core, LD&ST, SFU ๋“ฑ์„ ์žฅ์ฐฉ Kepler์˜ SMX๋Š” 192๊ฐœ์˜ CUDA core, 64๊ฐœ์˜ DP (64-bit Double Precision) ์œ ๋‹›, 32๊ฐœ์˜ LD&ST ์œ ๋‹›, 32๊ฐœ์˜ SFU๋กœ ๊ตฌ์„ฑ๋จ Kepler์—์„œ๋Š” HPC (High Performance Computing)์„ ๊ณ ๋ คํ•ด 64bit ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ „์šฉ DP ์œ ๋‹›์ด ์ œ๊ณต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— 32bit, 64bit ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์ด ๋™์‹œ์— ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ https://89douner.tistory.com/159?category=913897
  • 9.
    ๋Š˜์–ด๋‚œ core์˜ ์ˆ˜๋ฅผ์ž˜ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด warp scheduler์˜ ์ˆ˜๋„ 4๊ฐœ๋กœ ๋Š˜์–ด๋‚ฌ๊ณ , Dispatch unit๋„ ํ•˜๋‚˜์˜ warp scheduler ๋‹น 1๊ฐœ์—์„œ 2๊ฐœ๋กœ ์ฆ๊ฐ€ ๊ทธ๋ž˜์„œ SMX๋Š” ๋™์‹œ์— ์ตœ๋Œ€ 8๊ฐœ์˜ ๋ช…๋ น์„ ์ฒ˜๋ฆฌ ๋˜ํ•œ Register file์˜ ํฌ๊ธฐ๋„ 128KB๋กœ 4๋ฐฐ๊ฐ€ ๋Š˜์–ด๋‚ฌ๊ณ , L1 cacheํฌ๊ธฐ๋„ 128KB๋กœ ๋Š˜์–ด๋‚จ ํ•˜๋‚˜์˜ thread๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” register ์ˆ˜๊ฐ€ Fermi์˜ 63๊ฐœ์—์„œ 255๊ฐœ๋กœ ๋Š˜์–ด๋‚ฌ๋Š”๋ฐ ์ด๋Ÿฌํ•œ ์ ์€ Dispatch Unit์˜ ์ฆ๊ฐ€์™€ ๋”๋ถˆ์–ด ๊ทธ๋ž˜ํ”ฝ ์—ฐ์‚ฐ๋ณด๋‹ค๋Š” HPC์‘์šฉ๋ถ„์•ผ์˜ ์„ฑ๋Šฅ (ex:๊ณผํ•™์—ฐ์‚ฐ) ํ–ฅ์ƒ์„ ๊ณ ๋ คํ•œ ๋ณ€ํ™”๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. https://89douner.tistory.com/159?category=913897
  • 10.
    MaxWell ( 2014) Kepler์—์„œ Maxwell๋กœ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋ณ€ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฏธ์„ธ๊ณต์ •์ด 28nm์— ๋จธ๋ฌผ๋กœ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํš๊ธฐ์ ์ธ ๋ณ€ํ™”๋ฅผ ๊พ€ํ•˜์ง€ ๋ชปํ• ๊ฑฐ๋ผ ์ƒ๊ฐํ•œ NVIDIA๋Š” ๋ชจ๋ฐ”์ผ ๋ฒ„์ „์˜ GPU ๊ตฌ์กฐ๋ฅผ ์ถฃ์‹œํ•˜๊ณ ์ž ํ•˜๋ฉด์„œ ์ด์ „ Kepler ๊ตฌ์กฐ๋ฅผ ์ตœ์ ํ™”ํ•จ https://89douner.tistory.com/159?category=913897
  • 11.
    Pascal ( 2016) Pascal์ด๋ผ๋Š” ์ธ๊ณต์ง€๋Šฅ์— ํŠนํ™”๋œ GPU ์•„ํ‚คํ…์ฒ˜ ์†Œ๊ฐœํ•จ Pascal ๋ถ€ํ„ฐ๋Š” HPC(High Performance Computing)๋ถ„์•ผ (GP104 GPU)์™€ ๊ทธ๋ž˜ํ”ฝ ๋ถ„์•ผ (GP100 GPU) ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์œผ๋กœ ๋‚˜๋ˆ ์„œ ์ œํ’ˆ ์ถœ์‹œํ•จ https://89douner.tistory.com/159?category=913897
  • 12.
    HPC ๋˜๋Š” ๋”ฅ๋Ÿฌ๋‹๋ถ„์•ผ์—์„œ๋Š” 64bit, 16bit ๋ถ€๋™์†Œ์ˆ˜์ ์—ฐ์‚ฐ (FP64/FP16)์„ ์ง€์›ํ•˜๋ฉด์„œ ํ•˜๋‚˜์˜ thread๊ฐ€ ๋งŽ์€ register๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ํ–ˆ๊ณ , ๊ทธ๋ž˜ํ”ฝ ๋ถ„์•ผ์—์„œ๋Š” 32bit(FP32)๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ํ”„๋กœ๊ทธ๋žจ์ด ๊ฐ„๋‹จํ•ด register๊ฐœ์ˆ˜๋ฅผ ๊ตณ์ด ๋Š˜๋ฆฌ์ง€ ์•Š๋„๋ก ํ–ˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ pascal๊ตฌ์กฐ์˜ ๊ฐ€์žฅ ํฐ ํŠน์ง•์€ 16bit ๋ถ€๋™์†Œ์ˆ˜์ (FP16) ์—ฐ์‚ฐ์„ ์ง€์›ํ•œ๋‹ค๋Š” ์  ์‹ค์ œ ๋”ฅ๋Ÿฌ๋‹์„ ํ•˜๋‹ค๋ณด๋ฉด weight bias, learning rate๋“ฑ์˜ ๊ฐ’ ๋“ฑ์„ ์ด์šฉํ•  ํ…๋ฐ, ๋‹ค๋ฅธ ๊ณผํ•™๋ถ„์•ผ๋ณด๋‹ค๋Š” ์ดˆ์ •๋ฐ€ํ•œ ๊ฐ’์„ ์š”๊ตฌํ•˜๋Š”๊ฑด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— 32bit์ฒ˜๋ฆฌ ๋ฐฉ์‹๋ณด๋‹ค๋Š” 16bit์ฒ˜๋ฆฌ ๋ฐฉ์‹์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋„๋ก ํ–ˆ๋‹ค. 16bit ๋ถ€๋™์†Œ์ˆ˜์  ์ฒ˜๋ฆฌ๋ฐฉ์‹์œผ๋กœ ๋ฐ”๊พธ๋ฉด 32bit์ฒ˜๋ฆฌ ๋ฐฉ์‹์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๊ฒƒ ๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰๊ณผ ๋Œ€์—ญํญ์— ๋Œ€ํ•œ ๋ถ€๋‹ด์ด ์ค„์–ด๋“ค๊ฒŒ ๋จ ๋˜ํ•œ GP100 Pascal์˜ CUDA core๋Š” FP16 ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ํ•œ ์‚ฌ์ดํด์— ๋‘ ๋ช…๋ น์–ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— FP16 ์—ฐ์‚ฐ ์„ฑ๋Šฅ์€ FP32 ์—ฐ์‚ฐ ์„ฑ๋Šฅ์˜ ๋‘ ๋ฐฐ๊ฐ€ ๋œ๋‹ค ๊ทธ๋ž˜์„œ ๋‹ค๋ฅธ ๋‘ GPU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ, A๋ผ๋Š” GPU๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋” ์ž‘์€ B๋ผ๋Š” GPU์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹์ด ์ž˜ ๋™์ž‘ํ•˜๋Š”๋ฐ, A GPU์—์„œ๋Š” ์ž‘๋™์ด ์•ˆ ๋œ๋‹ค๊ณ  ํ•˜๋ฉด FP๋ถ€๋ถ„์˜ ์ฐจ์ด๋ฅผ ํ™•์ธํ•ด์•ผ ํ•จ. https://89douner.tistory.com/159?category=913897
  • 13.
    Votal ( 2018) V100 Volta ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ฑ„ํƒํ•œ GV100์€ TSMC์˜ 12nm ๊ณต์ •์œผ๋กœ ๊ตฌํ˜„ ๊ธฐ์กด์— ์‚ฌ์šฉ๋œ CUDA core๋Š” FP32์ฝ”์–ด๋กœ ์ด๋ฆ„์ด ๋ฐ”๋€œ ์ด์ „์— GPU๊ฐ€ โ€œ64FP -> 16FPโ€๋กœ ๋ณ€ํ™˜๋œ ๊ฑธ ๋ดค๋“ฏ์ด, Votal์—์„œ๋Š” 8bit ๋‹จ์œ„์˜ INT(์ •์ˆ˜์—ฐ์‚ฐ) ์ฝ”์–ด๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜์—ฌ inferencing์„ ๊ฐ€์†ํ™”ํ•จ Volta์—์„œ ๋ˆˆ์—ฌ๊ฒจ ๋ณผ ๋ถ€๋ถ„์€ INT32 ์—ฐ์‚ฐ๊ณผ tensor core๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์‹ค์ œ ํ•™์Šต ๋˜๋Š” inference ์†๋„๋ฅผ ๋Œ€ํญ ํ–ฅ์ƒ์‹œํ‚ด Deep Learning์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๊ณ„์‚ฐ์ด โ€œD=A*B+Cโ€ ๋‹จ์œ„๋กœ ์ด๋ค„์ง (A:์ž…๋ ฅ๊ฐ’, B:๊ฐ€์ค‘์น˜(weight),C:bias,D:์ถœ๋ ฅ ๊ฐ’). ๋น ๋ฅธ๊ณ„์‚ฐ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด, A,B ๋ถ€๋ถ„์€ FP16๋‹จ์œ„(floating point 16bit, half precision)๋กœ ๊ณ„์‚ฐ์ด๋˜๊ณ , ๋” ์ •๋ฐ€ํ•œ accuracy๋ฅผ ์œ„ํ•ด C,D๋Š” FP32(floating point 32bit, single-precision)์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” mixed precision ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” tensor core๋ฅผ ๊ณ ์•ˆ https://89douner.tistory.com/159?category=913897
  • 14.
    V100 GPU V100 GPU์—๋Š”SM๋‹น 8๊ฐœ์˜ Tensor core๊ฐ€ ์žฅ์ฐฉ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, ํ•˜๋‚˜์˜ SM์—์„œ๋Š” 64*8*2 = 1024 ๋ฒˆ์˜ โ€œ๊ณฑ์…ˆ+๋ง์…ˆ" floating point ์—ฐ์‚ฐ์ด ํ•œ ์‚ฌ์ดํด์— ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. V100์—๋Š” 80๊ฐœ์˜ SM์ด ์žฅ์ฐฉ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ 80*1024๋ฒˆ์˜ ์—ฐ์‚ฐ์ด ํ•œ ์‚ฌ์ดํด์— ์ˆ˜ํ–‰๋จ ์ด์ „ ํŒŒ์Šค์นผ ์•„ํ‚คํ…์ฒ˜๊ธฐ๋ฐ˜์˜ P100๋ณด๋‹ค mixed-precision๊ฐœ๋…์„ ๋„์ž…ํ•œ volta ์•„ํ‚คํ…์ฒ˜ ๊ธฐ๋ฐ˜์˜ V100 ๋ชจ๋ธ์˜ 9~10๋ฐฐ์˜ ์„ฑ๋Šฅ์„ ๋ƒ„ cuDNN์€ ์ถœ์‹œ๋˜๋Š” ์ตœ์‹  GPU๋ชจ๋ธ์— ๋งž๊ฒŒ ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋Š”๋ฐ, ์ตœ์‹  GPU์ธ volta์™€ ๋‹น์‹œ ์ตœ์‹  ์†Œํ”„ํŠธ์›จ์–ด ํ”Œ๋žซํผ์ธ cuDNN์ด ๊ฒฐํ•ฉ๋˜๋ฉด ํ›จ์”ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„ https://89douner.tistory.com/159?category=913897
  • 15.
    Turing ( 2018.02) Turing๊ธฐ๋ฐ˜ RTX Geforce20์‹œ๋ฆฌ์ฆˆ๊ฐ€ ์ถœ์‹œ๋จ Volta์˜ ํฐ ๊ณจ๊ฒฉ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ทœ๋ชจ๋ฅผ ์ ˆ๊ฐํ•œ ๋งˆ์ด๋„ˆ ํŒŒ์ƒ์ƒํ’ˆ์œผ๋กœ ๊ฐ„์ฃผ๋จ ๋ˆˆ์— ๋„๋Š” ๋ถ€๋ถ„์€ RT core๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฒƒ๊ณผ 4bit์ •์ˆ˜ (INT) ์—ฐ์‚ฐ๋„ ๊ฐ€๋Šฅ https://89douner.tistory.com/159?category=913897
  • 16.
    RT core๋Š” RayTracing์„ ์œ„ํ•œ ๊ธฐ์ˆ ์ž„ RT core๋Š” Ray Tracing์„ ์œ„ํ•œ ๊ธฐ์ˆ ์ž„ HPC(High Performance Computing)์„ ์œ„ํ•ด ์ ์  ๋ฐœ์ „ํ•ด ๋‚˜๊ฐ€๊ณ  ์žˆ์ง€๋งŒ ๊ทธ๋ž˜ํ”ฝ ๋ถ€๋ถ„๋„ ํฌ๊ธฐํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— RT(Ray Tracing)๊ธฐ์ˆ ์„ ์ ‘๋ชฉ์‹œํ‚ด 1) Ray Tracing ( RT ; ๊ด‘์›์ถ”์  ) RT๋ž€ ๊ทธ๋ž˜ํ”ฝ์œผ๋กœ ๊ตฌ์„ฑ๋œ 3D ๊ณต๊ฐ„์—์„œ ์–ด๋–ค ๋ฌผ์ฒด์— ๋น›์ด ๋ฐ˜์‚ฌ๋˜๊ฑฐ๋‚˜, ์–ด๋–ค ๋ฌผ์ฒด์— ์˜ํ•ด ๊ทธ๋ฆผ์ž๊ฐ€ ์ƒ๊ธฐ๊ฑฐ๋‚˜, ๋น›์˜ ๊ตด์ ˆ์„ ์ผ์œผํ‚ค๋Š” ๋ชจ๋“  ์ž‘์šฉ๋“ค์„ ๊ณ ๋ คํ–์—ฌ ํ™”๋ฉด์— ํ‘œํ˜„ํ•ด์ฃผ๋Š” ๊ธฐ์ˆ ์ž„ RT core๋ฅผ ํ†ตํ•ด ์‹ค์‹œ๊ฐ„ ( Real-Time ) ์œผ๋กœ ์ด๋Ÿฌํ•œ ๊ธฐ๋Šฅ๋“ค์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด์ค€ ๊ฒƒ์ด ๊ฐ€์žฅ ํฐ ํŠน์ง•์ž„ https://89douner.tistory.com/159?category=913897
  • 17.
    Ampere ( 2020.03) A100 2020.03์›” Votal ์•„ํ‚คํ…์ฒ˜์˜ ํ›„์†์ž‘์œผ๋กœ Ampere๋ผ๋Š” NVIDIA GPU ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์†Œ๊ฐœ๋จ. Ampere ์•„ํ‚คํ…์ฒ˜๋Š” ์ž‘์ •ํ•˜๊ณ  ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•ด ๋งŒ๋“  GPU๋ผ๊ณ  ๋ด„ Ampere ์•„ํ‚คํ…์ฒ˜๋Š” TSMC์˜ 7nm ๋ฏธ์„ธ๊ณต์ • ์ ์šฉ https://89douner.tistory.com/159?category=913897
  • 18.
    Ampere์—์„œ ์ฃผ๋ชฉํ•  ๋ถ€๋ถ„์€๋”ฅ๋Ÿฌ๋‹์— ์ตœ์ ํ™”๋œ ์ƒˆ๋กœ์šด ์ •๋ฐ€๋„์ธ TensorFloat-32 (TF32)๊ฐ€ ๋„์ž…๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ Tensorflow ์ž๋ฃŒํ˜•์„ ๋ณด๋ฉด tf.float32๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Œ, tensor ์ž๋ฃŒ๊ตฌ์กฐ์—์„œ float์„ ์ œ๊ณตํ•ด์ฃผ๋Š” ๊ฒƒ์„ ํ•˜๋“œ์›จ์–ด์ ์œผ๋กœ support ๊ทธ๋ž˜์„œ ์ •๋ฐ€ํ•˜๊ฒŒ ๊ณ„์‚ฐ์„ ํ•˜๋ฉด์„œ ์†๋„๋Š” ๋น ๋ฅด๊ฒŒ ์œ ์ง€ํ•ด์คŒ TF32๋Š” FP32์™€ ๊ฐ™์ด ์ž‘๋™ํ•˜๋ฉด์„œ ์ฝ”๋“œ ๋ณ€๊ฒฝ์—†์ด ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์ตœ๋Œ€ 20๋ฐฐ๊นŒ์ง€ ๊ฐ€์†ํ•œ๋‹ค๊ณ  ํ•จ (์ฐธ๊ณ ๋กœ TF32ํ˜•ํƒœ๊ฐ€ FP32๋ณด๋‹ค 6๋ฐฐ๋Š” ๋น ๋ฅด๋‹ค) ๋˜ํ•œ Mixed Precision์ด๋ผ๋Š” ์—ฐ์‚ฐ๊ธฐ๋ฒ•์„ ์ง€์›ํ•˜๋Š” ์‹œ๋„๋„ ์žˆ๋Š”๋ฐ, single-precision์ธ FP32์™€ half-single-precision์ธ FP16์„ ์ ์ ˆํžˆ ์ž˜ ์„ž์–ด์ฃผ๋ฉด์„œ ์†๋„(speed)์™€ ์ •ํ™•๋„(accuracy)๋ฅผ ๋ชจ๋‘ ์žก์•˜๋‹ค๊ณ  ํ•จ ๋‹จ์ ์ธ ์˜ˆ๋กœ, FP16์„ ์ด์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰๋„ ์ค„์–ด๋“ค์–ด ๋” ํฐ ๋ชจ๋ธ์„ GPU์— ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๊ณ , ๋” ํฐ mini-batches (size) ๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ด ์คŒ https://89douner.tistory.com/159?category=913897
  • 19.
    A100 Mixed Precision์‚ฌ์šฉํ•˜๊ธฐ https://89douner.tistory.com/159?category=913897
  • 20.
    Sparse connection Sparse connection์€์‰ฝ๊ฒŒ ๋งํ•ด ๋”ฅ๋Ÿฌ๋‹์—์„œ ์“ฐ์ด๋Š” parameter ์ค‘์— ๋ถˆํ•„์š”ํ•œ ๋ถ€๋ถ„์„ 0์œผ๋กœ ๋งŒ๋“ค์–ด ๊ณ„์‚ฐ์„ ์ข€ ๋” ๋น ๋ฅด๊ฒŒ ํ•˜๊ฑฐ๋‚˜, ์ฐจ์› ์ˆ˜๋ฅผ ์ค„์—ฌ์ฃผ์–ด overfiting์„ ํ”ผํ•ด์ฃผ๋Š” ๊ธฐ๋ฒ•์œผ๋กœ ์‚ฌ์šฉ๋จ Ampere(A100) ์•„ํ‚คํ…์ฒ˜์—์„œ๋Š” ์ด๋Ÿฌํ•œ sparse model์— supportํ•ด์ฃผ๋Š” ๊ธฐ๋ฒ•์„ ์ œ๊ณตํ•ด ์คŒ A100(Ampere)์˜ tensor ์ฝ”์–ด๋Š” sparse model์— ๋Œ€ํ•ด ์ตœ๋Œ€ 2๋ฐฐ ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ด ์คŒ, inference ์‹œ๊ฐ„๋„ ์ค„์—ฌ์ค„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•™์Šต์„ฑ๋Šฅ ๋„ ๊ฐœ์„ ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ Multi-Instance with Kubernetes Ampere ์•„ํ‚คํ…์ฒ˜๋Š” ์ตœ๋Œ€ 7๊ฐœ์˜ sub-group gpu๋กœ partitioning ์˜ˆ๋ฅผ๋“ค์–ด, 40GB VRAM์„ ๊ฐ–๊ณ  ์žˆ๋Š” ampere GPU๋Š” ๊ฐ๊ฐ 20GB VRAM์„ ๊ฐ–๋Š” 2๊ฐœ์˜ sub-ampere GPU๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Œ ์ตœ๋Œ€ 5GB์˜ sub-ampere gpu 7๊ฐœ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ, ์ด๋ ‡๊ฒŒ ๋‚˜๋ˆˆ GPU๋“ค์„ ๋‚˜์ค‘์— ๋‹ค์‹œ mergeํ•  ์ˆ˜ ์žˆ๋‹ค. ์‚ฌ์šฉ์‚ฌ๋ก€๋ฅผ ๋ณด๋ฉด, ๋‚ฎ์—๋Š” ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ถ”๋ก ์„ ์œ„ํ•ด 7๊ฐœ์˜ sub-GPU๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๋ฐค์— ํ‡ด๊ทผํ•  ๋•Œ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด 1๊ฐœ์˜ ๋ณธ๋ž˜์˜ GPU์ธ์Šคํ„ด์Šค๋กœ ๋งŒ๋“ค์–ด ์‚ฌ์šฉ ๊ฐ๊ฐ์˜ sub-group์„ ํ˜•์„ฑํ•˜๋Š” gpu๋“ค์€ ์„œ๋กœ ๋…๋ฆฝ์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰์‹œ์ผœ๋„ CUDA๊ฐ€ ๊ฐ๊ฐ ์ธ์Šคํ„ด์Šค์— ๋งž์ถฐ ์‹คํ–‰ ์ด๋Ÿฌํ•œ MIG ( Muiti-Instance GPU )๊ธฐ์ˆ ์€ ์ปจํ…Œ์ด๋„ˆ ๋˜๋Š” ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค์™€ ๊ฐ™์ด DevOps์œ ์ €๋“ค์—๊ฒŒ ํŠนํžˆ ์œ ์šฉํ•จ https://89douner.tistory.com/159?category=913897