Integrating MLflow with Apache Spark
Apache Spark is a very scalable and popular big data framework that allows data processing at a large scale. For more details and documentation, please go to https://spark.apache.org/. As a big data tool, it can be used to speed up parts of your ML inference, as it can be set at a training or an inference level.
In this particular case, we will illustrate how to implement it to use the model developed in the previous section on the Databricks environment to scale the batch-inference job to larger amounts of data.
In other to explore Spark integration with MLflow, we will execute the following steps:
- Create a new notebook named
inference_job_sparkin Python, linking to a running cluster where thebitpred_poc.ipynbnotebook was just created. - Upload your data to
dbfson the File/Upload data link in the environment. - Execute the following script in a cell of the notebook, changing the
logged_modelanddffilenames for the ones...