## Ensemble Learning

Ensemble learning is a machine learning archetype or theory where multiple learners are trained or applied to datasets to solve the same problem by extracting multiple predictions then combined into one composite prediction. It is a process that uses a set of models, each of them obtained by applying a learning process to a given problem. This set of models (ensemble) is integrated in some way to obtain the final prediction.” (Moreira, et al. 2012, 3)

This in contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use.

It is a powerful way to improve the performance of your model. It usually pays off to apply ensemble learning over and above various models you might be building. Time and again, people have used ensemble models in competitions like Kaggle and benefited from it (Analytics Vidhya, 2018).

Analogous to ensemble learning are Committee-based learning; Multiple classifier systems; Classifier combination.

### HISTORY OF ENSEMBLE LEARNING

It is tough to point out exactly how ensemble methods started, its history began since the basic idea of deploying multiple models started and has been in use for a long time, yet it is lucid that the wave of research on ensemble learning started in the 1990s owes much to two works. The first is an applied research conducted by Hansen and Salamon at the end of 1980s, where they figured out that predictions made by the combination of a set of classifiers are often more accurate than predictions made by the best single classifier (Hansen & Salaon,1990). The second is a theoretical research conducted in 1989, where Schapire proved that weak learners can be boosted to strong learners, and the proof resulted in Boosting, one of the most influential ensemble methods (Schapire, 1990).

### WHAT IS MACHINE LEARNING

Machine learning is simply the semi-automated extraction of knowledge from data. Its subsection of artificial intelligence (A.I) that uses statistical techniques to give computers the ability to “learn” (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed (Samuel, 1950).

According to Tom M. Mitchell, ”A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E” (Mitchell, T. 1997).

### Machine learning classification:

Machine learning tasks are typically classified into two broad categories, depending on whether there is a learning “signal” or “feedback” available to a learning system:

**1. Supervised learning:** This is also known as predictive modelling, i.e making predictions using data. Example is determining if a given email is a “Spam” or “Ham”. There is an outcome we are trying to predict.

The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs. As special cases, the input signal can be only partially available, or restricted to special feedback. Supervised learning includes the following subtypes:

a. Semi-supervised learning: the computer is given only an incomplete training signal: a training set with some (often many) of the target outputs missing.

b. Active learning: the computer can only obtain training labels for a limited set of instances (based on a budget), and also has to optimize its choice of objects to acquire labels for. c.

c. Reinforcement learning: training data (in form of rewards and punishments) is given only as feedback to the program’s actions in a dynamic environment, such as driving a vehicle or playing a game against an opponent.

**2. Unsupervised learning:** It’s the process of obtaining structure from data or learning the best way to represent data, there is no “right answer”. Example is segmenting all college students into clusters that exhibit similar behaviours.

No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

### ENSEMBLE THEORY

An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would, but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data.

Experimentally, ensembles seem to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees). Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.

**Ensemble Methods**

Ensemble methods are mathematical procedures that start with a set of base learner models. Multiple forecasts based on the different base learners are constructed and combined into an enhanced composite model superior to the base individual models. The final composite model will provide a superior prediction accuracy than the average of all the individual base models predictions. This integration of all good individual models into one improved composite model generally leads to higher accuracy levels.

### TYPES OF ENSEMBLES

**Bayes optimal classifier**

The Bayes Optimal Classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it. Naive Bayes Optimal Classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes Optimal Classifier can be expressed with the following equation:

**Bootstrap aggregating (bagging)**

Bootstrap aggregating, often abbreviated as bagging, involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy.

Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalized bagging, you can use different learners on different population. As you can expect this helps us to reduce the variance error.

Bagging (Source: Analytics Vidhya)

**Boosting**

Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may sometimes over fit on the training data.

Boosting (Source: Analytics Vidhya)

**Bayesian parameter averaging**

Bayesian parameter averaging (BPA) is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes’ law (which describes the probability of an event, based on prior knowledge of conditions that might be related to the event). Unlike the Bayes optimal classifier, Bayesian model averaging (BMA) can be practically implemented. Despite the theoretical correctness of this technique, early work showed experimental results suggesting that the method promoted over-fitting and performed worse compared to simpler ensemble techniques such as bagging; however, these conclusions appear to be based on a misunderstanding of the purpose of Bayesian model averaging vs. model combination.

**Bayesian model combination**

Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. The results from BMC have been shown to be better on average (with statistical significance) than BMA, and bagging.

**Bucket of models**

A “bucket of models” is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set.

The most common approach used for model-selection is cross-validation selection (sometimes called a “bake-off contest”). It is described with the following pseudo-code:

For each model m in the bucket:

Do c times: (where ‘c’ is some constant)

Randomly divide the training dataset into two datasets: A, and B.

Train m with A

Test m with B

Select the model that obtains the highest average score

Cross-Validation Selection can be summed up as: “try them all with the training set, and pick the one that works best”.

Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a perceptron is used for the gating model. It can be used to pick the “best” model, or it can be used to give a linear weight to the predictions from each model in the bucket.

When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best. (Wikipedia, 2018).

**Stacking**

This is a very interesting way of combining models. Here we use a learner to combine output from different learners. This can lead to decrease in either bias or variance error depending on the combining learner we use.

Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although, in practice, a logistic regression model is often used as the combiner.

Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression, classification and distance learning) and unsupervised learning (density estimation). It has also been used to estimate bagging’s error rate. It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition utilized blending, which may be considered to be a form of stacking.

**Voting**

Voting is perhaps the simplest ensemble algorithm, and is often very effective. It can be used for classification or regression problems.

Voting works by creating two or more sub-models. Each sub-model makes predictions which are combined in some way, such as by taking the mean or the mode of the predictions, allowing each sub-model to vote on what the outcome should be.

### COMMON APPROACHES TO ENSEMBLE METHODS

The ensemble learning process can be broken into different stages depending on the application and the approach implemented. We choose to categorize the learning process into three steps following (Roli et al. 2001):

a. Ensemble generation,

b. Ensemble pruning and

c. Ensemble integration.

In the ensemble generation phase, a number of base learner models are generated according to a chosen learning procedure, to be used to predict the final output. In the ensemble pruning step, a number of base models are filtered out based on various mathematical procedures to improve the overall ensemble accuracy. In the ensemble integration phase, the filtered learner models are combined intelligently to form one unified prediction that is more accurate than the average of all the individuals’ base models.

### IMPLEMENTATIONS IN STATISTICS PACKAGES

• R: at least three packages offer Bayesian model averaging tools, including the BMS (an acronym for Bayesian Model Selection) package, the BAS (an acronym for Bayesian Adaptive Sampling) package, and the BMA package. The H2O-package offers a lot of machine learning models including an ensembling model, which can also be trained using Spark.

• Python: Scikit-learn, a package for machine learning in Python offers packages for ensemble learning including packages for bagging and averaging methods.

• MATLAB: classification ensembles are implemented in Statistics and Machine Learning Toolbox.

• Weka: a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a dataset or called from your own Java code. It features machine learning, data mining, preprocessing, classification, regression, clustering, association rules, visualization etc.

• Tensor-flow: designed by google is an open-source software library used for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. It is used for both research and production at Google, often replacing its closed-source predecessor, DistBelief.

**Written by Onwuka Ugochukwu C.**

**REFERENCES**

Analytics Vidhya (July, 2018) Basics of Ensemble learning explained in simple term Retrieved f rom https://www.analyticsvidhya.com/blog/2015/08/introduction-ensemble-learning.

Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis

and Machine Intelligence 12(10) (1990) 993–1001

Machine Learning. McGraw Hill. p. 2. ISBN 0-07-042807-7.

Samuel, Arthur (1959). “Some Studies in Machine Learning Using the Game of Checkers”. IBM

Journal of Research and Development. 3 (3): 210–229.

Schapire, R.E.(1990) The strength of weak learnability. Machine Learning 5(2) 197–227

Scott Fortman (June, 2012) Understanding the Bias-Variance Tradeoff Retrieved from

http://scott.fortmann-roe.com/docs/BiasVariance.html.

Wikipedia (July, 2018) Ensemble Learning, Retrieved from wikipedia.org/ensemble-learning

Zhi-Hua Zhou (2013) National Key Laboratory for Novel Software Technology, Nanjing

University, Nanjing 210093, China.