累積和のソースを表示

'''累積和'''（るいせきわ、{{lang-en-short|prefix sum, cumulative sum, scan}}）とは、[[計算機科学]]において、[[数列]] <math>x_0, x_1, x_2, \dots</math> に対して、その先頭部分の[[総和]]を求めることによって得られる数列 <math>y_0, y_1, y_2, \dots</math> の事。

:<math>y_0 = x_0</math>
:<math>y_1 = x_0 + x_1</math>
:<math>y_2 = x_0 + x_1 + x_2</math>
:<math>\dots</math>

例えば、[[自然数]]の累積和は[[三角数]]になる。

:{| class="wikitable" style="text-align:right;"
|-
!自然数
| 1 || 2 || 3 || 4 || 5 || 6 || ...
|-
!自然数の累積和
| 1 || 3 || 6 || 10 || 15 || 21 || ...
|}

累積和は、単に[[加法|足し算]]だけで無く、二項演算子 <math>\oplus</math> に一般化することが可能であり、そのため幅広い応用が可能である。これにより、[[関数型プログラミング言語]]では、scanと呼ばれる基本的な処理となっている。なお、途中の計算過程を記録する必要が無く、最終結果だけが必要な場合はfoldと呼ばれる。<ref name="blelloch11">{{citation
| last1 = Blelloch | first1 = Guy | author1-link = Guy E. Blelloch
| publisher = Carnegie Mellon University
| title = Prefix Sums and Their Applications (Lecture Notes)
| url = https://www.cs.cmu.edu/afs/cs/academic/class/15750-s11/www/handouts/PrefixSumBlelloch.pdf
| year = 2011}}</ref><ref name="callahan1995">{{citation
| last1 = Callahan | first1 = Paul 
| last2 = Kosaraju | first2 = S. Rao
| title = A Decomposition of Multi-Dimensional Point Sets with Applications to k-Nearest-Neighbors and n-Body Potential Fields
| journal = Journal of the ACM
 | volume = 42
| issue = 1
| pages = 67–90 
| doi = 10.1145/200836.200853
| year = 1995| s2cid = 1818562 
| doi-access = free}}</ref>

いくつかのアルゴリズムにおいて有用な基本処理であり、カウントソートなどのアルゴリズムで利用されている。二項演算子 <math>\oplus</math> に対して[[減法|引き算]] <math>\ominus</math> が存在する場合、事前に累積和を求めておくと、m番目からn番目までの総和が「n番目の累積和 <math>\ominus</math> (m-1)番目の累積和」により、高速に求めることができる。2次元の場合はこれを[[:en:summed-area table|summed-area table]]と呼ぶ。<ref name="clrs">{{citation
| last1 = Cormen | first1 = Thomas H.
| last2 = Leiserson | first2 = Charles E.
| last3 = Rivest | first3 = Ronald L.
| last4 = Stein | first4 = Clifford
| edition = 2nd
| isbn = 0-262-03293-7
| pages = 168–170
| publisher = [[MIT Press]] and [[McGraw-Hill]]
| title = Introduction to Algorithms
| year = 2001}}.</ref><ref name="cv86">{{Citation
| last1 = Cole | first1 = Richard
| last2 = Vishkin | first2 = Uzi
| year = 1986
| title = Deterministic coin tossing with applications to optimal parallel list ranking
| journal = Information and Control
| volume = 70
| issue = 1
| pages = 32–53
| doi = 10.1016/S0019-9958(86)80023-7
| doi-access = free
| url = https://archive.org/download/deterministiccoi00vish/deterministiccoi00vish_bw.pdf
}}</ref>

累積和は、逐次計算においては、単に前の結果と計算するだけで簡単に求まるが、[[並列計算]]の分野でも広く研究されており、foldやscanの二項演算子が <math>(a \oplus b) \oplus c = a \oplus (b \oplus c)</math> という[[結合法則]]を満たすと並列化することが可能であり、[[並列アルゴリズム]]の有用な基本処理になっている。<ref name="lf80">{{citation|first1=R. E.|last1=Ladner|first2=M. J.|last2=Fischer|author2-link=Michael J. Fischer|title=Parallel Prefix Computation|journal=[[Journal of the ACM]]|volume=27|issue=4|year=1980|pages=831–838|mr=0594702|doi=10.1145/322217.322232|citeseerx=10.1.1.106.6247|s2cid=207568668 }}</ref><ref name="tv85">{{citation
| last1 = Tarjan | first1 = Robert E. | authorlink1 = Robert E. Tarjan
| last2 = Vishkin | first2 = Uzi | authorlink2 = Uzi Vishkin
| title = An efficient parallel biconnectivity algorithm
| journal = [[SIAM Journal on Computing]]
| volume = 14 | issue = 4 | year = 1985 | doi = 10.1137/0214061 | pages = 862–874| citeseerx = 10.1.1.465.8898 }}</ref><ref name="Prefix.book.92">{{citation
| last1 = Lakshmivarahan
| first1 = S.
| last2 = Dhall
| first2 = S.K.
| isbn = 0-19508849-2
| publisher = [[Oxford University Press]]
| title = Parallelism in the Prefix Problem
| year = 1994
| url-access = registration
| url = https://archive.org/details/parallelcomputin0000laks
}}</ref>

== 高階関数のscan ==
[[関数型プログラミング言語]]の観点では、累積和は加算に限らず任意の二項演算子へと一般化できる。この一般化によって得られる[[高階関数]]はscanと呼ばれ、foldと密接に関連している。scanとfoldはどちらも与えられた二項演算子を同じ数列に適用するが、両者には違いがある。scanは演算の途中結果を含む全ての中間値の列を返すのに対し、foldは最終結果のみを返す。

例えば、階乗数列は、自然数列に対して加算の代わりに乗算を用いたscanを行うことで生成できる。

:{| class="wikitable" style="text-align:right;"
|-
!入力
|  1 ||  2 ||  3 ||  4 ||   5 ||   6 || ...
|-
!累積の積
|  1 ||  2 ||  6 || 24 || 120 || 720 || ...
|}

=== inclusiveとexclusive===
プログラミング言語やライブラリにおけるscanの実装には、inclusiveとexclusiveの2種類が存在する。

:{| class="wikitable" style="text-align:right;"
|-
!入力
| 1 || 2 || 3 || 4 || 5 || 6 || ...
|-
! inclusive
| 1 || 3 || 6 || 10 || 15 || 21 || ...
|-
! exclusive
| 0 || 1 || 3 || 6 || 10 || 15 || ...
|} 

inclusive scanでは、出力 <math>y_i</math> を計算する際に入力 <math>x_i</math> を含めるのに対し（すなわち、<math display="inline">y_i = \bigoplus_{j=0}^{i} x_j</math>）、exclusive scanでは <math>x_i</math> を含めない（すなわち、<math display="inline">y_i = \bigoplus_{j=0}^{i-1} x_j</math>）。後者の場合、<math>y_0</math> を未定義のままとするか、scanの初期値として特別な値 <math>x_{-1}</math> を受け取る実装が一般的である。  

inclusive scanとexclusive scanは相互に変換可能である。inclusive scanをexclusive scanに変換するには、scanで得られた配列を右に1つシフトし、左端に単位元（identity value）を挿入すればよい。逆に、exclusive scanをinclusive scanに変換するには、scanで得られた配列を左に1つシフトし、右端に「scanの最後の要素と入力配列の最後の要素の和」を挿入すればよい。<ref name="gpu_gem_3">{{Cite web |title=Chapter 39. Parallel Prefix Sum (Scan) with CUDA - GPU Gems 3 |author= |work=NVIDIA Developer |date= |access-date=15 March 2025 |url= https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda}}</ref>

以下はプログラミング言語でのscanの実装の一覧。

{| class="wikitable" style="text-align: left;"
!プログラミング言語
!inclusive scan
!exclusive scan
|-
|[[APL]]
|<code>+\</code>
|
|-
|[[C++]]
|[https://en.cppreference.com/w/cpp/algorithm/inclusive_scan <code>std::inclusive_scan</code>]
|[https://en.cppreference.com/w/cpp/algorithm/exclusive_scan <code>std::exclusive_scan</code>]
|-
|[[Clojure]]
|[https://clojuredocs.org/clojure.core/reductions <code>reductions</code> init無し]
|[https://clojuredocs.org/clojure.core/reductions <code>reductions</code> init有り]
|-
|[[CUDA]]
|[https://nvidia.github.io/cccl/thrust/api/function_group__prefixsums_1ga88178012310104145d0f3169e7fc371e.html <code>thrust::inclusive_scan</code>]<br>[https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html <code>cub::DeviceScan::InclusiveScan</code>]
|[https://nvidia.github.io/cccl/thrust/api/function_group__prefixsums_1gaa3f981950f16c9dae693590b79a9ff90.html <code>thrust::exclusive_scan</code>]<br>[https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html <code>cub::DeviceScan::ExclusiveScan</code>]
|-
|[[F Sharp|F#]]
|
|[https://fsharp.github.io/fsharp-core-docs/reference/fsharp-collections-listmodule.html#scan <code>scan</code>]
|-
|[[Haskell]]
|[https://hackage.haskell.org/package/base-4.17.0.0/docs/Prelude.html#v:scanl1 <code>scanl1</code>]
|[https://hackage.haskell.org/package/base-4.17.0.0/docs/Prelude.html#v:scanl <code>scanl</code>]
|-
|[[High Performance Fortran|HPF]]
|<code>sum_prefix</code>
|<code>sum_prefix(..., exclusive=.true.)</code>
|-
|[[Java]]
|[https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/stream/Gatherers.html#scan(java.util.function.Supplier,java.util.function.BiFunction) <code>Gatherers.scan</code>]<br>[https://docs.oracle.com/en/java/javase/24/docs/api/java.base/java/util/Arrays.html#parallelPrefix(T%5B%5D,java.util.function.BinaryOperator) <code>Arrays.parallelPrefix</code>]
|
|-
|[[Kotlin]]
|
|[https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/scan.html <code>scan</code>]
|-
|[[Message Passing Interface|MPI]]
|<code>MPI_Scan</code>
|<code>MPI_Exscan</code>
|-
|[[Python]]
|[https://docs.python.org/3/library/itertools.html#itertools.accumulate <code>itertools.accumulate</code>]<br>引数initialがNoneの場合
|[https://docs.python.org/3/library/itertools.html#itertools.accumulate <code>itertools.accumulate</code>]<br>引数initialがNoneでない場合
|-
|[[Rust (プログラミング言語)|Rust]]
|[https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.scan <code>scan</code>]
|
|-
|[[Scala]]
|
|[https://scala-lang.org/api/current/scala/collection/Iterable.html#scan-fffff9e1 <code>scan</code>]
|}

== 並列計算 ==
二項演算子 <math>\oplus</math> が <math>(a \oplus b) \oplus c = a \oplus (b \oplus c)</math> という[[結合法則]]を満たす場合は並列化が可能である。

この並列化手法は[[Graphics Processing Unit|GPU]]でも利用可能である。[[NVIDIA]]のGPU Gems 3のFigure 39-7によると、2007年当時、要素数nが1,000の場合はCPUの方が高速だが、要素数nが1,000,000ある場合はGPU([[CUDA]])の方が高速である。<ref name="gpu_gem_3"/>

並列計算において、累積和を求めるためのアルゴリズムは多数存在する。NVIDIAでは2013年にNVIDIA CUDAのCUB 1.0.1で実装され<ref>{{Cite web |title=cub::DeviceScan — cub 2.5 documentation |author= |work=nvidia.github.io |date= |access-date=18 March 2025 |url= https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html}}</ref>、2016年に論文が書かれたdecoupled look-back法を使用していて、二項演算子が単純な足し算の場合は、要素数nが十分大きければ、メモリ帯域がボトルネックとなっていて、要素数nのメモリコピーと同じ速度で動く<ref name="nvidia2016">{{Cite web |title=Single-pass Parallel Prefix Scan with Decoupled Look-back {{!}} Research |author= |work=research.nvidia.com |date= |access-date=17 March 2025 |url= https://research.nvidia.com/publication/2016-03_single-pass-parallel-prefix-scan-decoupled-look-back}}</ref>。この手法はIntel GPU用のIntel oneAPI DPC++でも使用されている<ref>{{Cite web |title=Inclusive Scan - Intel® oneAPI DPC++ Library Developer Guide and Reference |author= |work=Intel |date= |access-date=17 March 2025 |url= https://www.intel.com/content/www/us/en/docs/onedpl/developer-guide/2022-7/inclusive-scan.html}}</ref>。

以下、2007年に出版されたNVIDIAのGPU Gems 3のChapter 39で紹介されているアルゴリズムを説明する<ref name="gpu_gem_3"/>。これらはより効率が良いものが発見されているので現在はNVIDIAは使用していない。1つ目のアルゴリズムは、より短いスパン（計算の依存関係の深さ）を持ち、高い並列性を実現できるが、ワーク効率（計算量の総和）が低い。2つ目のアルゴリズムは、ワーク効率が高いものの、スパンが2倍となり並列性が低下する。以下に、それぞれのアルゴリズムについて説明する。二項演算子 <math>\oplus</math> の計算量は <math>O(1)</math> とする。

=== アルゴリズム1：短いスパン、高い並列性 ===
[[File:Hillis-Steele Prefix Sum.svg|class=skin-invert-image|thumb|300px|並列性の高い16入力の並列累積和の回路表現]]
HillisとSteeleは、以下の並列累積和アルゴリズムを提案している。<ref name="hs1986">
{{cite journal
| last1 = Hillis | first1 = W. Daniel
| last2 = Steele, Jr. | first2 = Guy L.
| title = Data parallel algorithms
| journal = Communications of the ACM
| date = December 1986
| volume = 29
| issue = 12
| pages = 1170–1183
| doi = 10.1145/7902.7903
| doi-access = free
}}</ref>

 '''for''' ''i'' <- 0 to floor(log<sub>2</sub>(''n'')) '''do'''
     '''for''' ''j'' <- 0 to ''n'' - 1 '''do in parallel'''
         '''if''' ''j'' < 2<sup>''i''</sup> '''then'''
             ''x''{{su|b=''j''|p=''i''+1}} <- ''x''{{su|b=''j''|p=''i''}}
         '''else'''
             ''x''{{su|b=''j''|p=''i''+1}} <- ''x''{{su|b=''j''|p=''i''}} <math>\oplus</math> ''x''{{su|b=''j'' - 2<sup>''i''</sup>|p=''i''}}

ここで、<math>x^i_j</math> は、ステップ <math>i</math> における配列 <math>x</math> の <math>j</math> 番目の要素の値を表す。  

このアルゴリズムを単一プロセッサで実行した場合、計算量は <math>O(n \log n)</math> となる。しかし、少なくとも <math>n</math> 個のプロセッサを用いて内側のループを並列実行できる環境であれば、外側のループの繰り返し回数に等しい <math>O(\log n)</math> 時間で計算を完了できる。<ref>{{Cite journal
|title=A Survey of General-Purpose Computation on Graphics Hardware
|last1=Owens
|first1=John D.
|last2=Luebke
|first2=David
|last3=Govindaraju
|first3=Naga
|last4=Harris
|first4=Mark
|last5=Krüger
|first5=Jens
|last6=Lefohn
|first6=Aaron E.
|last7=Purcell
|first7=Timothy J.
|volume=26
|issue=1
|year=2007
|pages=80–113
|doi=10.1111/j.1467-8659.2007.01012.x
|doi-access=free
|journal=Computer Graphics Forum}}</ref>

=== アルゴリズム2：ワーク効率が高い方法 ===
[[File:Prefix sum 16.svg|class=skin-invert-image|thumb|300px|ワーク効率が高い16入力の並列累積和の回路表現]]
ワーク効率の良い並列累積和は、以下の手順で計算できる。<ref name="lf80"/><ref name="offman">{{citation|first=Yu.|last=Ofman|authorlink=Yuri Petrovich Ofman|script-title=ru:Об алгоритмической сложности дискретных функций|journal=[[Proceedings of the USSR Academy of Sciences|Doklady Akademii Nauk SSSR]]|volume=145|issue=1|year=1962|pages=48–51|language=Russian|mr=0168423}}. English translation, "On the algorithmic complexity of discrete functions", ''Soviet Physics Doklady'' '''7''': 589–591 1963.</ref><ref name="krapchenko">{{citation|first=V. M.|last=Khrapchenko|title=Asymptotic Estimation of Addition Time of a Parallel Adder|journal=Problemy Kibernet.|volume=19|year=1967|pages=107–122|language=Russian}}. English translation in ''Syst. Theory Res.'' '''19'''; 105–122, 1970.</ref>

# 隣接する要素の和を計算する（ペアの先頭要素のインデックスが偶数であるものを対象とする）。  
#* 例：<math>z_0 = x_0 \oplus x_1, z_1 = x_2 \oplus x_3, \dots</math>  
# ステップ1で得た数列 <math>z_0, z_1, z_2, \dots</math> に対して、再帰的に累積和を計算する。  
#* 結果として、新たな数列 <math>w_0, w_1, w_2, \dots</math> を得る。  
# 最終的な累積和 <math>y_0, y_1, y_2, \dots</math> を、中間数列の値を用いて求める。  
#* 具体的には、各 <math>y_i</math> は、これまでに計算された中間数列の要素の和で表される。  
#* 例：  
#** <math>y_0 = x_0</math>  
#** <math>y_1 = z_0</math>  
#** <math>y_2 = z_0 \oplus x_2</math>  
#** <math>y_3 = w_1</math>  
#* 最初の値 <math>y_0</math> を決めた後、それ以降の各 <math>y_i</math> は、数列 <math>w</math> の半分の位置にある値をコピーするか、直前の値に数列 <math>x</math> の一部の値を加えることで求める。  

入力数列の長さを <math>n</math> とすると、このアルゴリズムの再帰の深さは <math>O(\log n)</math> となる。したがって、並列実行時の計算時間も <math>O(\log n)</math> に抑えられる。  

このアルゴリズムの総ステップ数は <math>O(n)</math> であり、<math>O(n / \log n)</math> 個のプロセッサを持つ[[並列ランダムアクセス機械]]（PRAM）上で、非対称的な遅延なしに実装可能である。これは、プロセッサの数よりも要素数が多い段階では、1つのプロセッサが複数のインデックスを処理するように割り当てることで実現される。

=== より一般的なfoldやscanに適用する方法 ===
本項では、要素 <math>\oplus</math> 要素（Haskellの表記では型a -> a -> a<ref>{{Cite web |title=scanl1 - Prelude |author= |work=hackage.haskell.org |date= |access-date=16 March 2025 |url= https://hackage.haskell.org/package/base-4.17.0.0/docs/Prelude.html#v:scanl1 }}</ref>）として書いているが、関数型プログラミング言語では、
* 初期状態（型b<ref name="haskell-scanl">{{Cite web |title=scanl - Prelude |author= |work=hackage.haskell.org |date= |access-date=16 March 2025 |url= https://hackage.haskell.org/package/base-4.17.0.0/docs/Prelude.html#v:scanl}}</ref>）
* f(状態, 要素) = 次の状態（型b -> a -> b<ref name="haskell-scanl"/>）
と考えることにより、foldやscanで逐次処理を表現している。foldは最終状態（型b）で、scanは状態遷移列（型[b]）である。

ここで、[[結合法則]]を満たす
* 状態 <math>\oplus</math> 状態 = 結合した状態（型b -> b -> b）
が存在すると、
# まず、f(状態, 要素) は f(初期状態, 要素) でしか計算しないので、要素から f(初期状態, 要素) へのmap変換を行う。
# その上で、状態 <math>\oplus</math> 状態 -> 結合状態を使用すると、上記の並列アルゴリズムが適用できる。
これにより、逐次処理のfoldやscanで表現していたものに対して、並列アルゴリズムが使えるようになる。

具体例として、指数[[移動平均]] <math>s_t = (1-\alpha)s_{t-1} + \alpha x_t,\ s_0 = 0</math> を考える。要素は <math>x_t</math> だが、状態を (指数移動平均の最後の値, 指数移動平均の長さ) の[[タプル]]とすると下記変換式により並列計算ができる。
* 初期状態 = <math>(0, 0)</math>
* f(初期状態 <math>(0, 0)</math>, 要素 <math>x_t</math>) = <math>(\alpha x_t, 1)</math>
* 状態 <math>(s_t, l_t)</math> <math>\oplus</math> 状態 <math>(s_u, l_u)</math> = 結合状態 <math>((1 - \alpha)^{l_u} s_t + s_u, l_t + l_u)</math>

これをNVIDIA [[CUDA]]のThrust([[C++]])で実装する。要素数nが十分大きく、下記のように二項演算子の計算量が小さい場合はinclusive_scanはメモリコピーの速度で動くが<ref name="nvidia2016"/>、要素から状態への並列map、状態の並列scan、状態から要素への並列mapが必要で、単純に実装すると3回分のメモリコピーが発生するが、Thrustではtransform_iteratorを使用すると1回のメモリコピー分の計算時間にまとめることができるので、それを使用している。
<syntaxhighlight lang="cpp">
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_output_iterator.h>

struct State { float v; int len; };

struct toState { // 要素→状態
    const float alpha;
    __device__ State operator()(const float x) const { return State{alpha * x, 1}; }
};
struct toElement { // 状態→要素
    __device__ float operator()(const State &s) const { return s.v; }
};
struct plusStates { // 状態＋状態
    const float alpha;
    __device__ State operator()(const State &a, const State &b) const {
        return State{std::pow(1 - alpha, (float) b.len) * a.v + b.v, a.len + b.len};
    }
};

int main() {
    const float alpha = 0.2f;
    thrust::device_vector<float> input = {3, 1, 4, 1, 5, 9, 2, 6};
    thrust::device_vector<float> output(input.size());

    // 指数移動平均の計算
    thrust::inclusive_scan(
        thrust::make_transform_iterator(input.begin(), toState{alpha}),
        thrust::make_transform_iterator(input.end(), toState{alpha}),
        thrust::make_transform_output_iterator(output.begin(), toElement{}),
        plusStates{alpha}
    );

    // 計算結果の出力
    for (float x: output) {
        std::cout << x << " ";
    }
    std::cout << std::endl;

    return 0;
}
</syntaxhighlight>

== 出典 ==
{{reflist}}

{{アルゴリズム}}

{{DEFAULTSORT:るいせきわ}}
[[Category:並列コンピューティング]]
[[Category:高階関数]]