Ensemble Learning Techniques: A Walkthrough with Random … – KDnuggets

A practical walkthrough for random forests in Python.
Machine learning models have become an integral component of decision-making across multiple industries, yet they often encounter difficulty when dealing with noisy or diverse data sets. That is where Ensemble Learning comes into play.
This article will demystify ensemble learning and introduce you to its powerful random forest algorithm. No matter if you are a data scientist looking to hone your toolkit or a developer looking for practical insights into building robust machine learning models, this piece is meant for everyone!
By the end of this article, you will gain a thorough knowledge of Ensemble Learning and how Random Forests in Python work. So whether you are an experienced data scientist or simply curious to expand your machine-learning abilities, join us on this adventure and advance your machine-learning expertise!
 
 
Ensemble learning is a machine learning approach in which predictions from multiple weak models are combined with each other to get stronger predictions. The concept behind ensemble learning is decreasing the bias and errors from single models by leveraging the predictive power of each model. 
To have a better example let’s take a life example imagine that you have seen an animal and you do not know what species this animal belongs to. So instead of asking one expert, you ask ten experts and you will take the vote of the majority of them. This is known as hard voting
Hard voting is when we take into account the class predictions for each classifier and then classify an input based on the maximum votes to a particular class. On the other hand, soft voting is when we take into account the probability predictions for each class by each classifier and then classify an input to the class with maximum probability based on the average probability (averaged over the classifier’s probabilities) for that class.
 
 
Ensemble learning is always used to improve the model performance which includes improving the classification accuracy and decreasing the mean absolute error for regression models. In addition to this ensemble learners always yield a more stable model. Ensemble learners work at their best when the models are not correlated then every model can learn something unique and work on improving the overall performance. 
 
 
Although ensemble learning can be applied in many ways, however when it comes to applying it to practice there are three strategies that have gained a lot of popularity due to their easy implementation and usage. These three strategies are:
Let’s dive deeper into each of these strategies and see how we can use Python to train these models on our dataset. 
 
 
Bagging takes random samples of data, and uses learning algorithms and the mean to find bagging probabilities; also known as bootstrap aggregating; it aggregates results from multiple models to get one broad outcome.
This approach involves:
Scikit-learn provides us with the ability to implement both a BaggingClassifier and BaggingRegressor. A BaggingMetaEstimator identifies random subsets of an original dataset to fit each base model, then aggregates individual base model predictions?—?either through voting or averaging?—?into a final prediction by aggregating individual base model predictions into an aggregate prediction using voting or averaging. This method reduces variance by randomizing their construction process.
Let’s take an example in which we use the bagging estimator using scikit learn:
 
The bagging classifier takes into consideration several parameters:
Now we will fit this classifier on the training set and score it.
 
We can do the same for regression tasks, the difference will be that we will be using regression estimators instead.
 
 
Stacking is a technique for combining multiple estimators in order to minimize their biases and produce accurate predictions. Predictions from each estimator are then combined and fed into an ultimate prediction meta-model trained through cross-validation; stacking can be applied to both classification and regression problems.
 

Ensemble Learning Techniques: A Walkthrough with Random Forests in Python
Stacking ensemble learning

 
Stacking occurs in the following steps:
In this example below, we begin by creating two base classifiers (RandomForestClassifier and GradientBoostingClassifier) and one meta-classifier (LogisticRegression) and use K-fold cross-validation to use predictions from these classifiers on training data (iris dataset) for input features for our meta-classifier (LogisticRegression). 
After using K-fold cross-validation to make predictions from the base classifiers on test data sets as input features for our meta-classifier, predictions on test sets using both sets together and evaluate their accuracy against their stacked ensemble counterparts.
 
 
Boosting is a machine learning ensemble technique that reduces bias and variance by turning weak learners into strong learners. These weak learners are applied sequentially to the dataset; firstly by creating an initial model and fitting it to the training set. Once errors from the first model have been identified, another model is designed to correct them.
There are popular algorithms and implementations for boosting ensemble learning techniques. Let’s explore the most famous ones. 
 
 
AdaBoost is an effective ensemble learning technique, that employs weak learners sequentially for training purposes. Each iteration prioritizes incorrect predictions while decreasing weight assigned to correctly predicted instances; this strategic emphasis on challenging observations compels AdaBoost to become increasingly accurate over time, with its ultimate prediction determined by aggregating majority votes or weighted sum of its weak learners.
AdaBoost is a versatile algorithm suitable for both regression and classification tasks, but here we focus on its application to classification problems using Scikit-learn. Let’s look at how we can use it for classification tasks in the example below:
 
In this example, we used the AdaBoostClassifier from scikit learn and set the n_estimators to 100. The default learn is a decision tree and you can change it. In addition to this, the parameters of the decision tree can be tuned.
 
 
eXtreme Gradient Boosting or is more popularly known as XGBoost, is one of the best implementations of boosting ensemble learners due to its parallel computations which makes it very optimized to run on a single computer. XGBoost is available to use through the xgboost package developed by the machine learning community.
 
 
LightGBM is another gradient-boosting algorithm that is based on tree learning. However, it is unlike other tree-based algorithms in that it uses leaf-wise tree growth which makes it converge faster. 
 
Ensemble Learning Techniques: A Walkthrough with Random Forests in Python
Leaf-wise tree growth / Image by LightGBM

 
In the example below we will apply LightGBM to a binary classification problem:
 
Ensemble learning and random forests are powerful machine learning models that are always used by machine learning practitioners and data scientists. In this article, we covered the basic intuition behind them, when to use them, and finally, we covered the most popular algorithms of them and how to use them in Python.  
 
 
 
 
Youssef Rafaat is a computer vision researcher & data scientist. His research focuses on developing real-time computer vision algorithms for healthcare applications. He also worked as a data scientist for more than 3 years in the marketing, finance, and healthcare domain.
 
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.


By subscribing you accept KDnuggets Privacy Policy
Subscribe To Our Newsletter
(Get The Complete Collection of Data Science Cheat Sheets & Great Big NLP Primer ebook)
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy
Get the FREE ebook ‘The Great Big Natural Language Processing Primer’ and ‘The Complete Collection of Data Science Cheat Sheets’ along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.
By subscribing you accept KDnuggets Privacy Policy

source

Leave a Comment