Tensorコアのソースを表示

{{Uncategorized|date=2023年5月}}
{{更新|date=2023-12}}
{{簡易区別|[[TensorFlow]]、[[Google Tensor]]、[[テンソル・プロセッシング・ユニット]]}}
'''Tensorコア'''（{{lang-en-short|Tensor Cores}}）は[[NVIDIA]]が開発する混合精度行列積和[[アクセラレータ]]である<ref name=":13">"Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy." NVIDIA. ''[https://www.nvidia.com/en-us/data-center/tensor-cores/ NVIDIA Tensor Cores]''. 2023-05-16閲覧.</ref>。[[2017年]]、[[データセンター]]用[[Graphics Processing Unit|GPU]]「[[NVIDIA Tesla|Tesla]] V100」（Volta世代）に初めて搭載された。[[Geforce]]と[[Quadro]]にはRTXシリーズ（Turing世代）で初めて搭載された。

== 概要 ==
高い並列計算能力をもったグラフィックス専用の処理ユニットである[[Graphics Processing Unit|GPU]]は、その並列計算特性が着目され現在では[[高性能計算|HPC]]など様々な分野で並列計算機として利用されている（[[GPGPU]]）。用途が広がるにつれ、[[深層学習]]をはじめとする一部の計算では高い数値計算精度よりも高い演算速度が求められることがわかってきた。[[NVIDIA GeForce]]シリーズをはじめとした様々な[[Graphics Processing Unit|GPU]]を手掛ける[[NVIDIA]]がこれに応えるために開発した、行列積和アクセラレータが'''Tensorコア'''である<ref name=":13" /><ref>"Tensor Cores are specialized high-performance compute cores for matrix math operations that provide groundbreaking performance for AI and HPC applications. " NVIDIA. ''[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf NVIDIA A100 Tensor Core GPU Architecture]''. 2023-05-16閲覧.</ref>。

Tensorコアは行列の混合精度[[積和演算#融合積和演算|融合積和演算]]（FMA）に特化した機能を有する。すなわち1命令で低精度行列積 + 高精度アキュムレータ加算を実行する。例えばFP32の行列 <math>A, B ,C</math> を積和演算する際、行列積 <math>BC</math> をFP16でおこないこれをFP32の <math>A</math> へとアキュムレートする。

専用回路（Tensorコア）でこれを実行するため、1命令で大量の演算を処理できる。行列積は低精度であるため計算負荷が小さく、和は高精度かつFMAであるため追加の誤差を生じさせない。例えば [[NVIDIA Tesla#Ampereマイクロアーキテクチャ|NVIDIA A100 GPU]] では単純なFP32が 19.5 TLOPSであるのに対し、FP16の低精度積でTensorCoreを使った場合は 312 TFLOPS すなわち16倍の演算を実行できる。このようにTensorコアは混合精度行列FMAのアクセラレータとして機能する。

== 対応 ==
Tensorコアには世代があり、世代ごとに速度およびサポートする精度が異なる。
{| class="wikitable"
|+表. Tensorコア世代とサポート精度
! rowspan="2" |世代 (対応arch)
! colspan="8" |multiply 精度
! colspan="4" |accum 精度
|-
!FP64
!TF32
!FP16
!BF16
!FP8
!INT8
!INT4
!INT1
!FP64
!FP32
!FP16
!INT32
|-
|1 (Volta) <ref>"FP16 precision introduced on the Volta Tensor Core ... Volta GPU ... Results are accumulated into FP32 for mixed precision training or FP16 for inference." NVIDIA. ''[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf NVIDIA A100 Tensor Core GPU Architecture]''. 2023-05-16閲覧.</ref>
| -
| -
|✔
| -
| -
| -
| -
| -
| -
|✔
|✔
| -
|-
|2 (Turing) <ref>"the INT8, INT4 and binary 1-bit precisions added in the Turing Tensor Core" NVIDIA. ''[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf NVIDIA A100 Tensor Core GPU Architecture]''. 2023-05-16閲覧.</ref>
| -
| -
|✔
| -
| -
|✔
|✔
|✔
| -
|✔
|✔
|
|-
|3 (Ampere) <ref>"In addition to FP16 precision ... INT8, INT4 and binary 1-bit precisions ... the A100 Tensor Core adds support for TF32, BF16 and FP64 formats" NVIDIA. ''[https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf NVIDIA A100 Tensor Core GPU Architecture]''. 2023-05-16閲覧.</ref>
|✔
|✔
|✔
|✔
| -
|✔
|✔
|✔
|✔
|✔
|✔
|✔
|-
|4 (Hopper) <ref>"NVIDIA H100 GPU ... New fourth-generation Tensor Cores" NVIDIA. ''[https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf NVIDIA H100 Tensor Core GPU Architecture]''. 2023-05-16閲覧.</ref><ref>"FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types are supported." NVIDIA. ''[https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf NVIDIA H100 Tensor Core GPU Architecture]''. 2023-05-16閲覧.</ref>
|✔
|✔
|✔
|✔
|✔
|✔
| -
| -
|✔
|✔
|✔
|✔
|}

== TensorFloat-32 ==
'''TensorFloat-32'''（'''TF32'''）はTensorコアにおける混合精度FMAモードの1つである<ref>"TF32 is a new compute mode added to Tensor Cores in the Ampere generation of GPU architecture." NVIDIA Developer. ''[https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ Accelerating AI Training with NVIDIA TF32 Tensor Cores]''. 2023-06-18閲覧.</ref><ref name=":0">"TF32 is only exposed as a Tensor Core operation mode, not a type." NVIDIA Developer. ''[https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ Accelerating AI Training with NVIDIA TF32 Tensor Cores]''. 2023-06-18閲覧.</ref>。

Tensorコアは低精度行列積を高速実行するアクセラレータであり、FP32 FMAをそのまま高速計算はできない。TF32モードではFP32入力を内部的に19bitへキャスト、その行列積をTensorコアで高速計算し、最終的にFP32のアキュムレータへ加算する。すなわち、TensorFloat-32はFP32 FMAの内部低精度高速FMAモードである<ref>"All storage in memory and other operations remain completely in FP32, only convolutions and matrix-multiplications convert their inputs to TF32 right before multiplication." NVIDIA Developer. ''[https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/ Accelerating AI Training with NVIDIA TF32 Tensor Cores]''. 2023-06-18閲覧.</ref>。

低精度行列積はある程度の計算誤差が不可避である。従来用いられていた16bit演算（FP16・BF16）は混合精度計算によりその悪影響を最低限に抑えており、またフレームワークの自動混同精度（AMP）機能により最低限のコード変更で実行を可能にしていた。しかし計算誤差が推論精度へ大きく影響するケースが一部にはあり、また最低限であれコードの変更が必須であった。

TF32は 1bit の符号、8bit の指数（= BF16の指数）、10bit の仮数（=FP16の仮数）からなる19bitの表現を利用しており、16bitに比べて精度低下を更に軽減している。またTensorコア内部でキャストしFP32にアキュムレートしているため外部的にはFP32 FMAをそのまま実行しているように見え、コードの変更が一切必要ない。ゆえにTF32は最低限の精度低下かつコード変更なしで数倍の演算効率を得られるモードになっている。あくまでコア内部精度の変更であるため、データ量（例: 使用GPUメモリ量）の減少等はできない。またTensorコアが19bit表現を高速処理できるからこそ高速化されるのであり、TF32はあくまでTensorコアがもつ特化機能/モードの一種である<ref name=":0" />。

== 関連項目 ==
* [[Graphics Processing Unit]]
* [[NVIDIA]]
* [[深層学習]]
* [[ディープ・ラーニング・スーパー・サンプリング]]

== 脚注 ==
{{reflist}}

{{デフォルトソート:てんそるこあ}}
[[Category:NVIDIA]]
[[Category:AIアクセラレータ]]
[[Category:ディープラーニング]]