First, we need to build the Spark library that will execute the timeline. Apache Spark is a platform for doing massively-parallel processing. Although we are running this on a single file, Spark is designed to work on thousands of files distribued across many machines. Explaining HDFS, Hive, and Spark are beyond the scope of this tutorial, but for large datasets it's important to understand these concepts and that it's possible to run ReAgent in a distributed environment by simply changing the location of the input from a file to an HDFS folder.
0 commit comments