limingwong
diff --git a/‎docs/References.html
Lines changed: 1 addition & 1 deletion b/‎docs/References.html
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/Releases.html
Lines changed: 1 addition & 1 deletion b/‎docs/Releases.html
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/kmeans.html
Lines changed: 2 additions & 2 deletions b/‎docs/kmeans.html
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/kmeans_cudaflow.html
Lines changed: 2 additions & 2 deletions b/‎docs/kmeans_cudaflow.html
Lines changed: 2 additions & 2 deletions
@@ -56,7 +56,7 @@ <h3>Contents</h3>
             <li><a href="#RefRecognition">Recognition</a></li>
           </ul>
         </div>
-<p>This page summarizes a list of publication related to Taskflow. If you are using Taskflow, please cite the following paper we publised at 2019 IEEE IPDPS:</p><p>Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, and Martin Wong, &quot;<a href="ipdps19.pdf">Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++</a>,&quot; <em>IEEE International Parallel and Distributed Processing Symposium (IPDPS)</em>, pp. 974-983, Rio de Janeiro, Brazil, 2019</p><section id="RefConference"><h2><a href="#RefConference">Conference</a></h2><ol><li>Dian-Lun Lin and Tsung-Wei Huang, &quot;Efficient GPU Computation using Task Graph Parallelism,&quot; <em>European Conference on Parallel and Distributed Computing (EuroPar)</em>, 2021</li><li>Tsung-Wei Huang, &quot;<a href="iccad20.pdf">A General-purpose Parallel and Heterogeneous Task Programming System for VLSI CAD</a>,&quot; <em>IEEE/ACM International Conference on Computer-aided Design (ICCAD)</em>, CA, 2020</li><li>Chun-Xun Lin, Tsung-Wei Huang, and Martin Wong, &quot;<a href="icpads20.pdf">An Efficient Work-Stealing Scheduler for Task Dependency Graph</a>,&quot; <em>IEEE International Conference on Parallel and Distributed Systems (ICPADS)</em>, Hong Kong, 2020</li><li>Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, and Martin Wong, &quot;<a href="ipdps19.pdf">Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++</a>,&quot; <em>IEEE International Parallel and Distributed Processing Symposium (IPDPS)</em>, pp. 974-983, Rio de Janeiro, Brazil, 2019</li><li>Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, and Martin Wong, &quot;<a href="mm19.pdf">A Modern C++ Parallel Task Programming Library</a>,&quot; <em>ACM Multimedia Conference (MM)</em>, pp. 2284-2287, Nice, France, 2019</li><li>Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, and Martin Wong, &quot;<a href="hpec19.pdf">An Efficient and Composable Parallel Task Programming Library</a>,&quot; <em>IEEE High-performance and Extreme Computing Conference (HPEC)</em>, pp. 1-7, Waltham, MA, 2019</li></ol></section><section id="RefJournal"><h2><a href="#RefJournal">Journal</a></h2><ol><li>Tsung-Wei Huang, Dian-Lun Lin, Yibo Lin, and Chun-Xun Lin, &quot;Cpp-Taskflow: A General-purpose Parallel Task Programming System at Scale,&quot; <em>IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems (TCAD)</em>, 2021</li><li>Tsung-Wei Huang, Dian-Lun Lin, Yibo Lin, and Chun-Xun Lin, &quot;<a href="2004.10908v2.pdf">Cpp-Taskflow v2: A General-purpose Parallel and Heterogeneous Task Programming System at Scale</a>,&quot; <em>Computing Research Repository (CoRR)</em>, arXiv:2004.10908, 2020</li></ol></section><section id="RefRecognition"><h2><a href="#RefRecognition">Recognition</a></h2><ol><li>Champion of the MIT/Amazon Graph Challenge at the 2020 IEEE High-performance Extreme Computing Conference</li><li>Second Prize of Open-Source Software Competition at the 2019 ACM Multimedia Conference</li><li>ACM SIGDA Outstanding PhD Dissertation Award at the 2019 ACM/IEEE Design Automation Conference</li><li>Best Poster Award at the 2018 Official C++ Conference, voted by thousands of developers</li></ol></section>
+<p>This page summarizes a list of publication related to Taskflow. If you are using Taskflow, please cite the following paper we publised at 2019 IEEE IPDPS:</p><p>Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, and Martin Wong, &quot;<a href="ipdps19.pdf">Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++</a>,&quot; <em>IEEE International Parallel and Distributed Processing Symposium (IPDPS)</em>, pp. 974-983, Rio de Janeiro, Brazil, 2019</p><section id="RefConference"><h2><a href="#RefConference">Conference</a></h2><ol><li>Dian-Lun Lin and Tsung-Wei Huang, &quot;Efficient GPU Computation using Task Graph Parallelism,&quot; <em>European Conference on Parallel and Distributed Computing (EuroPar)</em>, 2021</li><li>Tsung-Wei Huang, &quot;<a href="iccad20.pdf">A General-purpose Parallel and Heterogeneous Task Programming System for VLSI CAD</a>,&quot; <em>IEEE/ACM International Conference on Computer-aided Design (ICCAD)</em>, CA, 2020</li><li>Chun-Xun Lin, Tsung-Wei Huang, and Martin Wong, &quot;<a href="icpads20.pdf">An Efficient Work-Stealing Scheduler for Task Dependency Graph</a>,&quot; <em>IEEE International Conference on Parallel and Distributed Systems (ICPADS)</em>, Hong Kong, 2020</li><li>Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, and Martin Wong, &quot;<a href="ipdps19.pdf">Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++</a>,&quot; <em>IEEE International Parallel and Distributed Processing Symposium (IPDPS)</em>, pp. 974-983, Rio de Janeiro, Brazil, 2019</li><li>Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, and Martin Wong, &quot;<a href="mm19.pdf">A Modern C++ Parallel Task Programming Library</a>,&quot; <em>ACM Multimedia Conference (MM)</em>, pp. 2284-2287, Nice, France, 2019</li><li>Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, and Martin Wong, &quot;<a href="hpec19.pdf">An Efficient and Composable Parallel Task Programming Library</a>,&quot; <em>IEEE High-performance and Extreme Computing Conference (HPEC)</em>, pp. 1-7, Waltham, MA, 2019</li></ol></section><section id="RefJournal"><h2><a href="#RefJournal">Journal</a></h2><ol><li>Tsung-Wei Huang, Dian-Lun Lin, Yibo Lin, and Chun-Xun Lin, &quot;Cpp-Taskflow: A General-purpose Parallel Task Programming System at Scale,&quot; <em>IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems (TCAD)</em>, 2021</li><li>Tsung-Wei Huang, Dian-Lun Lin, Yibo Lin, and Chun-Xun Lin, &quot;<a href="2004.10908v2.pdf">Cpp-Taskflow v2: A General-purpose Parallel and Heterogeneous Task Programming System at Scale</a>,&quot; <em>Computing Research Repository (CoRR)</em>, arXiv:2004.10908, 2020</li></ol></section><section id="RefRecognition"><h2><a href="#RefRecognition">Recognition</a></h2><ol><li>Champion of Graph Challenge at the 2020 IEEE High-performance Extreme Computing Conference</li><li>Second Prize of Open-Source Software Competition at the 2019 ACM Multimedia Conference</li><li>ACM SIGDA Outstanding PhD Dissertation Award at the 2019 ACM/IEEE Design Automation Conference</li><li>Best Poster Award at the 2018 Official C++ Conference, voted by thousands of developers</li></ol></section>
       </div>
     </div>
   </div>
 
@@ -48,7 +48,7 @@
         <h1>
           Release Notes
         </h1>
-<p>This page summarizes the release notes of Taskflow. We classify each release with three numbers:</p><p><code>Major.Minor.Patch</code></p><p>A <em>major</em> release indicates significant codebase changes and API modifications, a <em>minor</em> release indicates technical improvement over a major release line, and a <em>patch</em> release indicates fixes of bugs and other issues.</p><p>All releases are available in <a href="https://github.com/taskflow/">project GitHub</a>.</p><ul><li><a href="release-roadmap.html" class="m-doc">Release Roadmap</a></li><li><a href="release-3-2-0.html" class="m-doc">Release 3.2.0 (Master)</a></li><li><a href="release-3-1-0.html" class="m-doc">Release 3.1.0 (2021/04/14)</a></li><li><a href="release-3-0-0.html" class="m-doc">Release 3.0.0 (2021/01/01)</a></li><li><a href="release-2-7-0.html" class="m-doc">Release 2.7.0 (2020/10/01)</a></li><li><a href="release-2-6-0.html" class="m-doc">Release 2.6.0 (2020/08/25)</a></li><li><a href="release-2-5-0.html" class="m-doc">Release 2.5.0 (2020/06/01)</a></li><li><a href="release-2-4-0.html" class="m-doc">Release 2.4.0 (2020/03/25)</a></li><li><a href="release-2-3-1.html" class="m-doc">Release 2.3.1 (2020/03/13)</a></li><li><a href="release-2-3-0.html" class="m-doc">Release 2.3.0 (2020/02/27)</a></li><li><a href="release-2-2-0.html" class="m-doc">Release 2.2.0 (2019/06/15)</a></li><li><a href="release-2-1-0.html" class="m-doc">Release 2.1.0 (2019/02/15)</a></li><li><a href="release-2-0-0.html" class="m-doc">Release 2.0.0 (2018/08/28)</a></li><li><a href="release-1-x-x.html" class="m-doc">Release 1.x.x (before 2018)</a></li></ul>
+<p>This page summarizes the release notes of Taskflow. We classify each release with three numbers:</p><p><code>Major.Minor.Patch</code></p><p>A <em>major</em> release indicates significant codebase changes and API modifications, a <em>minor</em> release indicates technical improvement over a major release line, and a <em>patch</em> release indicates fixes of bugs and other issues.</p><p>All releases are available in <a href="https://github.com/taskflow/">project GitHub</a>.</p><ul><li><a href="release-roadmap.html" class="m-doc">Release Roadmap</a></li><li><a href="release-3-3-0.html" class="m-doc">Release 3.3.0 (Master)</a></li><li><a href="release-3-2-0.html" class="m-doc">Release 3.2.0 (2021/07/29)</a></li><li><a href="release-3-1-0.html" class="m-doc">Release 3.1.0 (2021/04/14)</a></li><li><a href="release-3-0-0.html" class="m-doc">Release 3.0.0 (2021/01/01)</a></li><li><a href="release-2-7-0.html" class="m-doc">Release 2.7.0 (2020/10/01)</a></li><li><a href="release-2-6-0.html" class="m-doc">Release 2.6.0 (2020/08/25)</a></li><li><a href="release-2-5-0.html" class="m-doc">Release 2.5.0 (2020/06/01)</a></li><li><a href="release-2-4-0.html" class="m-doc">Release 2.4.0 (2020/03/25)</a></li><li><a href="release-2-3-1.html" class="m-doc">Release 2.3.1 (2020/03/13)</a></li><li><a href="release-2-3-0.html" class="m-doc">Release 2.3.0 (2020/02/27)</a></li><li><a href="release-2-2-0.html" class="m-doc">Release 2.2.0 (2019/06/15)</a></li><li><a href="release-2-1-0.html" class="m-doc">Release 2.1.0 (2019/02/15)</a></li><li><a href="release-2-0-0.html" class="m-doc">Release 2.0.0 (2018/08/28)</a></li><li><a href="release-1-x-x.html" class="m-doc">Release 1.x.x (before 2018)</a></li></ul>
       </div>
     </div>
   </div>
 
@@ -54,7 +54,7 @@ <h3>Contents</h3>
           <ul>
             <li><a href="#KMeansProblemFormulation">Problem Formulation</a></li>
             <li><a href="#ParallelKMeansUsingCPUs">Parallel k-means using CPUs</a></li>
-            <li><a href="#udaflow_1KMeansBenchmarking">Benchmarking</a></li>
+            <li><a href="#KMeansBenchmarking">Benchmarking</a></li>
           </ul>
         </div>
 <p>We study a fundamental clustering problem in unsupervised learning, <em>k-means clustering</em>. We will begin by discussing the problem formulation and then learn how to write a parallel k-means algorithm.</p><section id="KMeansProblemFormulation"><h2><a href="#KMeansProblemFormulation">Problem Formulation</a></h2><p>k-means clustering uses <em>centroids</em>, k different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it. We describe the k-means algorithm in the following steps:</p><ul><li>Step 1: initialize k random centroids</li><li>Step 2: for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it</li><li>Step 3: for every centroid, move the centroid to the average of the points assigned to that centroid</li><li>Step 4: go to Step 2 until converged (no more changes in the last few iterations) or maximum iterations reached</li></ul><p>The algorithm is illustrated as follows:</p><img class="m-image" src="kmeans_1.png" alt="Image" /><p>A sequential implementation of k-means is described as follows:</p><pre class="m-code"><span class="c1">// sequential implementation of k-means on a CPU</span>
@@ -414,7 +414,7 @@ <h3>Contents</h3>
 </g>
 </g>
 </svg>
-</div><p>The scheduler starts with <code>init</code>, moves on to <code>clean_up</code>, and then enters the parallel-for task <code>paralle-for</code> that spawns a subflow of 12 workers to perform parallel iterations. When <code>parallel-for</code> completes, it updates the cluster centroids and checks if they have converged through a condition task. If not, the condition task informs the scheduler to go back to <code>clean_up</code> and then <code>parallel-for</code>; otherwise, it returns a nominal index to stop the scheduler.</p></section><section id="udaflow_1KMeansBenchmarking"><h2><a href="#udaflow_1KMeansBenchmarking">Benchmarking</a></h2><p>Based on the discussion above, we compare the runtime of computing various k-means problem sizes between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.</p><table class="m-table"><thead><tr><th>N</th><th>K</th><th>M</th><th>CPU Sequential</th><th>CPU Parallel</th></tr></thead><tbody><tr><td>10</td><td>5</td><td>10</td><td>0.14 ms</td><td>77 ms</td></tr><tr><td>100</td><td>10</td><td>100</td><td>0.56 ms</td><td>86 ms</td></tr><tr><td>1000</td><td>10</td><td>1000</td><td>10 ms</td><td>98 ms</td></tr><tr><td>10000</td><td>10</td><td>10000</td><td>1006 ms</td><td>713 ms</td></tr><tr><td>100000</td><td>10</td><td>100000</td><td>102483 ms</td><td>49966 ms</td></tr></tbody></table><p>When the number of points is larger than 10K, the parallel CPU implementation starts to outperform the sequential CPU implementation.</p></section>
+</div><p>The scheduler starts with <code>init</code>, moves on to <code>clean_up</code>, and then enters the parallel-for task <code>paralle-for</code> that spawns a subflow of 12 workers to perform parallel iterations. When <code>parallel-for</code> completes, it updates the cluster centroids and checks if they have converged through a condition task. If not, the condition task informs the scheduler to go back to <code>clean_up</code> and then <code>parallel-for</code>; otherwise, it returns a nominal index to stop the scheduler.</p></section><section id="KMeansBenchmarking"><h2><a href="#KMeansBenchmarking">Benchmarking</a></h2><p>Based on the discussion above, we compare the runtime of computing various k-means problem sizes between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.</p><table class="m-table"><thead><tr><th>N</th><th>K</th><th>M</th><th>CPU Sequential</th><th>CPU Parallel</th></tr></thead><tbody><tr><td>10</td><td>5</td><td>10</td><td>0.14 ms</td><td>77 ms</td></tr><tr><td>100</td><td>10</td><td>100</td><td>0.56 ms</td><td>86 ms</td></tr><tr><td>1000</td><td>10</td><td>1000</td><td>10 ms</td><td>98 ms</td></tr><tr><td>10000</td><td>10</td><td>10000</td><td>1006 ms</td><td>713 ms</td></tr><tr><td>100000</td><td>10</td><td>100000</td><td>102483 ms</td><td>49966 ms</td></tr></tbody></table><p>When the number of points is larger than 10K, the parallel CPU implementation starts to outperform the sequential CPU implementation.</p></section>
       </div>
     </div>
   </div>
 
@@ -55,7 +55,7 @@ <h3>Contents</h3>
             <li><a href="#DefineTheKMeansKernels">Define the k-means Kernels</a></li>
             <li><a href="#DefineTheKMeanscudaFlow">Define the k-means cudaFlow</a></li>
             <li><a href="#RepeatTheExecutionofTheKMeanscudaFlow">Repeat the Execution of the k-means cudaFlow</a></li>
-            <li><a href="#KMeansBenchmarking">Benchmarking</a></li>
+            <li><a href="#KMeanscudaFlowBenchmarking">Benchmarking</a></li>
           </ul>
         </div>
 <p>Following up on <a href="kmeans.html" class="m-doc">k-means Clustering</a>, this page studies how to accelerate a k-means workload on a GPU using <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a>.</p><section id="DefineTheKMeansKernels"><h2><a href="#DefineTheKMeansKernels">Define the k-means Kernels</a></h2><p>Recall that the k-means algorithm has the following steps:</p><ul><li>Step 1: initialize k random centroids</li><li>Step 2: for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it</li><li>Step 3: for every centroid, move the centroid to the average of the points assigned to that centroid</li><li>Step 4: go to Step 2 until converged (no more changes in the last few iterations) or maximum iterations reached</li></ul><p>We observe Step 2 and Step 3 of the algorithm are parallelizable across individual points for use to harness the power of GPU:</p><ol><li>for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it</li><li>for every centroid, move the centroid to the average of the points assigned to that centroid.</li></ol><p>At a fine-grained level, we request one GPU thread to work on one point for Step 2 and one GPU thread to work on one centroid for Step 3.</p><pre class="m-code"><span class="c1">// px/py: 2D points</span>
@@ -870,7 +870,7 @@ <h3>Contents</h3>
 </g>
 </g>
 </svg>
-</div><p>We can see from the above taskflow the condition task is removed.</p></section><section id="KMeansBenchmarking"><h2><a href="#KMeansBenchmarking">Benchmarking</a></h2><p>We run three versions of k-means, sequential CPU, parallel CPUs, and one GPU, on a machine of 12 Intel i7-8700 CPUs at 3.20 GHz and a Nvidia RTX 2080 GPU using various numbers of 2D point counts and iterations.</p><table class="m-table"><thead><tr><th>N</th><th>K</th><th>M</th><th>CPU Sequential</th><th>CPU Parallel</th><th>GPU (conditional taksing)</th><th>GPU (using offload_n)</th></tr></thead><tbody><tr><td>10</td><td>5</td><td>10</td><td>0.14 ms</td><td>77 ms</td><td>1 ms</td><td>1 ms</td></tr><tr><td>100</td><td>10</td><td>100</td><td>0.56 ms</td><td>86 ms</td><td>7 ms</td><td>1 ms</td></tr><tr><td>1000</td><td>10</td><td>1000</td><td>10 ms</td><td>98 ms</td><td>55 ms</td><td>13 ms</td></tr><tr><td>10000</td><td>10</td><td>10000</td><td>1006 ms</td><td>713 ms</td><td>458 ms</td><td>183 ms</td></tr><tr><td>100000</td><td>10</td><td>100000</td><td>102483 ms</td><td>49966 ms</td><td>7952 ms</td><td>4725 ms</td></tr></tbody></table><p>When the number of points is larger than 10K, both parallel CPU and GPU implementations start to pick up the speed over than the sequential version. We can see that using the built-in predicate, <a href="classtf_1_1cudaFlow.html#ac2269fd7dc8ca04a294a718204703dad" class="m-doc">tf::<wbr />cudaFlow::<wbr />offload_n</a>, can avoid repetitively creating the graph over and over, resulting in two times faster than conditional tasking.</p></section>
+</div><p>We can see from the above taskflow the condition task is removed.</p></section><section id="KMeanscudaFlowBenchmarking"><h2><a href="#KMeanscudaFlowBenchmarking">Benchmarking</a></h2><p>We run three versions of k-means, sequential CPU, parallel CPUs, and one GPU, on a machine of 12 Intel i7-8700 CPUs at 3.20 GHz and a Nvidia RTX 2080 GPU using various numbers of 2D point counts and iterations.</p><table class="m-table"><thead><tr><th>N</th><th>K</th><th>M</th><th>CPU Sequential</th><th>CPU Parallel</th><th>GPU (conditional taksing)</th><th>GPU (using offload_n)</th></tr></thead><tbody><tr><td>10</td><td>5</td><td>10</td><td>0.14 ms</td><td>77 ms</td><td>1 ms</td><td>1 ms</td></tr><tr><td>100</td><td>10</td><td>100</td><td>0.56 ms</td><td>86 ms</td><td>7 ms</td><td>1 ms</td></tr><tr><td>1000</td><td>10</td><td>1000</td><td>10 ms</td><td>98 ms</td><td>55 ms</td><td>13 ms</td></tr><tr><td>10000</td><td>10</td><td>10000</td><td>1006 ms</td><td>713 ms</td><td>458 ms</td><td>183 ms</td></tr><tr><td>100000</td><td>10</td><td>100000</td><td>102483 ms</td><td>49966 ms</td><td>7952 ms</td><td>4725 ms</td></tr></tbody></table><p>When the number of points is larger than 10K, both parallel CPU and GPU implementations start to pick up the speed over than the sequential version. We can see that using the built-in predicate, <a href="classtf_1_1cudaFlow.html#ac2269fd7dc8ca04a294a718204703dad" class="m-doc">tf::<wbr />cudaFlow::<wbr />offload_n</a>, can avoid repetitively creating the graph over and over, resulting in two times faster than conditional tasking.</p></section>
       </div>
     </div>
   </div>