Sign up
Sign in
Sign up
Sign in
João Pedro
Follow
Towards Data Science
—
Listen
Share
Apache Spark is one of the main tools for data processing and analysis in the BigData context. It’s a very complete (and complex) data processing framework, with functionalities that can be roughly divided into four groups: SparkSQL & DataFrames, the all-purpose data processing needs; Spark Structured Streaming, used to handle data-streams; Spark MLlib, for machine learning and data science and GraphX, the graph processing API.
I’ve already featured the first two in other posts: creating an ETL process for a Data Warehouse and integrating Spark and Kafka for stream processing. Today is the time for the third one — Let’s play with Machine Learning using Spark MLlib.
Machine Learning has a special place in my heart, because it was my entrance door to the data science field and, as probably many of yours, I started it with the classic Scikit-Learn library.
I’ve could write an entire post on why the Scikit-learn library is such a marvelous piece of software. It’s beginner-friendly, easy to use, covers most of the machine learning cycle, has very well-written documentation, and so on.
But why am I talking about this? If you are like me, used to coding with sklearn, keep in mind that the path in Apache Spark is not SO straightforward. It’s not hard but has a steeper learning curve.
Along this post, we’ll learn how to make the ‘full machine learning cycle’ of data preprocessing, feature engineering, model training, and validation with a Hands-On example.
Apache Spark is a distributed memory-based data transformation engine. It is geared to operate in distributed environments to parallelize processing between machines, achieving high-performance transformations by using its lazy evaluation philosophy and query optimizations.
And that’s the main reason to learn such a tool — Performance.
Even with optimizations, the Sklearn package (and other python packages) struggle when the dataset gets too big. That’s one of the potential blind spots Spark covers. As it scales horizontally, it is easier to increase the computational power to train models on BigData.
I’ve chosen the Avocado Price Dataset to make this project. Our task in this dataset is to predict the mean avocado price given the avocado type, date, the amount of avocado bags available and other features. See the dataset page on Kaggle for more information.
All you need is docker and docker-compose installed. The code is available on GitHub.
The architecture described (in the docker-compose.yaml) is depicted in the image below.
All the code was developed inside a jupyter/pyspark-notebook container with all the pyspark dependencies already configured.
To start the environment, just run:
Our objective is to learn how to implement our usual ML pipeline using Spark, which covers: Loading data and splitting it into train/test, cleaning the data, preprocessing+feature engineering, model definition, hyperparameter-tuning and final scorings.
The following sections will detail how to make each of these steps.
The first thing to do is connect to the Spark cluster, which is quite straightforward.
It may take a few seconds in the first run.
Now it’s time to work with the data. This part still has nothing to do with Spark’s MLlib package, just the usual data loading using Spark SQL.
As Spark is lazily evaluated, it’s interesting to cache the dataset in memory to speed up the execution of the next steps.
Let’s have a look at the data:
Again, more details about the columns can be found on the original dataset’s Kaggle page.
Let’s move on and split the DataFrame into train (75%) and test (25%) set using the randomSplit() method.
Before continuing, let’s know the tools we’ll be using. The Spark MLlib package has two main types of objects: Transformers and Estimators.
A Transformer is an object that’s able to transform Dataframes. They receive a raw DataFrame and return a processed one. Common transformers include PolynomialExpansion, SQLTransformer, and VectorAssembler (very important, discussed later).
Estimators, on the other hand, are objects that need to be fitted/trained on the data to generate a Transformer. These include machine learning predictors (Linear Regression, Logistic Regression, Decision Trees, etc), dimensionality reduction algorithms (PCA, ChiSquare Selector), and also other column transformers (StandardScaler, MinMaxScaler, TF-IDF, etc).
Let’s start by using the SQLTransformer on our Data. This is a powerful transformer, that allows one to select and transform columns using SQL queries.
The code above selects the AveragePrice and type columns, transforms the numerical columns using the Log function, and creates two new columns extracting the year (after 2000) and the month.
__THIS__ is the default name of the DataFrame currently being transformed.
The result:
Scaling is a prevalent practice in data preprocessing. Let’s scale the month column using the Min-Max scaling technique, putting all values in the [0, 1] interval. The MinMaxScaler is of type estimator so it needs to be fitted to the data before being used to transform it.
Most of the estimators (including all the prediction models) require the input columns to be in Vector form. Vector is a special column type used mostly in Spark MLlib. It is just what the name suggests, a fixed-size array of numbers.
To join columns into a single Vector column, we use the VectorAssembler Transformer.
The image below details the process.
The result:
With these concepts in mind, is just a matter of knowing the available transformers and using them in our pipeline.
For example, the column type has two values, “conventional” and “organic”, that need to be mapped into numbers. The transformer responsible for this is the StringIndexer.
It assigns a numerical value to each category in a column. As the column “type” only has two categories, it will be transformed into a column with only two values: 0 and 1, which is equivalent to applying the one-hot encoding technique.
From now on I’ll summarize what was done.
The numerical features (all columns except the type_index) generated were assembled in a single Vector called “features_num”, this final vector passes by a StandardScaler.
The categorical column “type_index” is then added to the final vector.
The final step is to join all the transformers created into a pipeline. A pipeline is just an object used to encapsulate a set of transformers and estimators to apply them to data sequentially. This helps us to avoid dealing with each intermediate transformation step individually (as we’re doing so far).
The final result:
That’s the moment we’re all waiting for.
After the long path of data preprocessing, all the features are already in their desired final vector form and we’re ready to train the model.
Unfortunately, this part will be very short compared with the previous one ¯_(ツ)_/¯
As mentioned earlier, ML models are just estimators, so the process repeats: Instantiate, Fit and Transform.
Let’s train a Linear Regression model:
It's necessary to specify the features column, the target/label column, and a name for the prediction column. Just like the other estimators we’ve met, an ML model will just add another column to the DataFrame.
See the result below:
To measure the model’s performance, we need an evaluator. I think its name is self-explanatory, it will compute a performance metric between the real labels and the model predictions.
In the cell below, a RegressionEvaluator is instantiated to measure the RMSE (Rooted Mean Squared Error) between predictions and real values (on train data).
Hyperparameter Tuning is one of the last stages in a Machine Learning Pipeline, so our adventure is coming to an end.
In this step, we test several variations of hyperparameters of our model/pipeline to pick the best one according to the chosen metric. A common way of doing this is using a cross-validation technique.
Here, we met the last building blocks of today’s post — The ParamGridBuilder and the CrossValidator.
Starting with the ParamGridBuilder: It is the object used to build a hyperparameter grid.
In the code above, several values are specified for the linear regression’s regParam and elasticNetParam. It’s important to note that the original object is used to reference the parameters.
The CrossValidator then joins everything together (estimator, hyperparameter grid, and evaluator) …
… and performs a cross-validation for the given number of folds when the method fit() is called with training data.
The results are accessed through the fitted Cross Validation object. The code below prints the best model’s name and score.
Let’s also see the linear regression’s best parameters.
Output:
And all come to this moment. This is the step where we measure the performance of the best model on test data.
Fortunately, there is nothing new to learn here, it’s just a matter of applying the best model to the test data and passing the result to the evaluator.
The performance was very similar to the one obtained in the cross-validation step.
As ML applications become popular and their requirements become more complex, knowledge of a wider variety of tools with different purposes becomes essential.
In this post, we learned a little about how Apache Spark can be used in the context of Machine Learning through the Spark MLlib module. With a hands-on project, we created a generic ML pipeline, covering the main concepts and basic topics of this module.
Learning a new tool mainly involves becoming familiar with its vocabulary, i.e., understanding the fundamental parts that compose it and how they can be used to solve a problem. Therefore, we focused on understanding the basics of Spark MLlib: Estimators, Transformers, Evaluators, and Pipelines.
I hope this brief post helped you understand how Spark can be used in Machine Learning applications.
As always, this post just scratches the surface of the topics explored, so I strongly encourage further reading, see the references below.
Thank you for reading 😉
All the code is available in this GitHub repository.
Data used — Avocado Prices, ODbL v1.0: Open database, Kaggle.
[1] Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big data processing made simple. “ O’Reilly Media, Inc.”
[2] Spark by examples — https://sparkbyexamples.com/
[3] Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. “ O’Reilly Media, Inc.”.
[4] Overview: estimators, transformers and pipelines — spark.ml. Spark Official documentation.
—
—
Towards Data Science
Bachelor of IT at UFRN. Graduate of BI at UFRN — IMD. Strongly interested in Machine Learning, Data Science and Data Engineering.
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams