An Introduction to Decision Trees and Random Forests
Decision Trees
The Structure of the Decision Tree Model
A decision tree is a method for classification and regression; this article focuses mainly on decision trees used for classification. A decision tree has a tree-like structure, and in classification problems it represents the process of classifying data based on features. It can usually be regarded as a collection of if-then rules, or as a conditional probability distribution defined over the feature space and the class space. Its main advantages are good model interpretability and fast classification. During training, the decision tree model is built from the training data according to the principle of minimizing a loss function. During prediction, the decision tree is used to classify new data. Learning a decision tree usually involves three steps: feature selection, decision tree generation, and decision tree pruning. These decision-tree ideas come mainly from the ID3 algorithm proposed by Quinlan in 1986 and the C4.5 algorithm proposed in 1993, as well as the CART algorithm proposed by Breiman et al. in 1984.
A decision tree used for classification is a tree-like structure for classifying data. A decision tree is composed mainly of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. An internal node represents a feature or attribute, while a leaf node represents a class. Its structure is shown in the figure:

Feature Selection
Feature selection consists of choosing features that have classification ability for the training data, which can improve the learning efficiency of the decision tree. If classifying with a particular feature produces results that are not much different from random classification, that feature is said to have no classification ability. Discarding such features usually has little effect on the learning accuracy of the decision tree. The criterion for feature selection is typically information gain or the information gain ratio.
In information theory, entropy is a measure of the uncertainty of a random variable. Let X be a discrete random variable that takes a finite number of values, with probability distribution:
Then the entropy of the random variable is defined as:
The larger the entropy, the greater the uncertainty of the random variable. From the definition one can verify that
When or , , and the random variable has no uncertainty at all; when , , the entropy reaches its maximum, and the uncertainty of the random variable is greatest.
The conditional entropy represents the uncertainty of a random variable given that another random variable is known. The conditional entropy of random variable given random variable is defined as the mathematical expectation, with respect to , of the entropy of the conditional probability distribution of given
Here . Information gain represents the degree to which the uncertainty of class is reduced by knowing the information of feature .
The information gain of feature with respect to training set is defined as the difference between the empirical entropy of set and the empirical conditional entropy of given feature , that is
The ID3 Algorithm for Building Decision Trees
The main decision tree algorithms are ID3, C4.5, and CART. The core of the ID3 algorithm is to apply the information gain criterion to select features at each node of the decision tree, recursively building the tree. The following gives the ID3 algorithm for building a decision tree: Input: training data set D, feature set A, threshold ; Output: decision tree T.
(1) If all instances in D belong to the same class , then T is a single-node tree, and is taken as the class label of that node; return T; (2) If A=, then T is a single-node tree, and the class with the largest number of instances in D is taken as the class label of that node; return T; (3) Otherwise, compute the information gain of each feature in A with respect to D according to the information gain algorithm, and select the feature with the largest information gain; (4) If the information gain of is less than the threshold , then T is a single-node tree, and the class with the largest number of instances in D is taken as the class label of that node; return T; (5) Otherwise, for each possible value of , partition D into several non-empty subsets according to , take the class with the largest number of instances in as the label, and construct child nodes; the node together with its child nodes forms the tree T; return T. For the i-th child node, with as the training set and as the feature set, recursively call steps 1 through 5 to obtain the subtree ; return .
Ensemble Learning
Ensemble learning accomplishes a learning task by building and combining multiple learners. It is sometimes also called a multi-classifier system, committee-based learning, and so on.
The figure below shows the general structure of ensemble learning: first build a group of “individual learners,” then combine them using some strategy. Individual learners are usually produced from the training data by an existing learning algorithm, such as the ID3 decision tree algorithm, the BP neural network algorithm, and so on. In this case the ensemble contains only individual learners of the same type—for example, a “decision tree ensemble” consists entirely of decision trees, and a “neural network ensemble” consists entirely of neural networks. Such an ensemble is “homogeneous.” The individual learners in a homogeneous ensemble are also called “base learners,” and the corresponding learning algorithm is called the “base learning algorithm.” An ensemble may also contain individual learners of different types—for example, containing both decision trees and neural networks—in which case the ensemble is “heterogeneous.” The individual learners in a heterogeneous ensemble are produced by different learning algorithms, so there is no longer a single base learning algorithm; correspondingly, the individual learners are generally not called base learners but are commonly called “component learners,” or simply individual learners.

By combining multiple learners, ensemble learning can often achieve generalization performance that is significantly superior to that of a single learner. This is especially evident for “weak learners,” and as a result much of the theoretical research on ensemble learning has been conducted with weak learners in mind, so base learners are sometimes directly referred to as weak learners. It should be noted, however, that although in theory an ensemble of weak learners is sufficient to achieve good performance, in practice—out of various considerations, such as the desire to use fewer individual learners, or to reuse some experience with common learners—people often use relatively strong learners.
According to how the individual learners are generated, current ensemble learning methods can be roughly divided into two categories: sequential methods, in which there are strong dependencies among the individual learners and they must be generated serially; and parallel methods, in which there are no strong dependencies among the individual learners and they can be generated simultaneously. The representative of the former is Boosting, while the representatives of the latter are Bagging and “Random Forest.”
The Random Forest Algorithm
Random Forest (RF for short) [Breiman, 2001a] is an extended variant of Bagging. Building on Bagging ensembles that use decision trees as base learners, RF further introduces random attribute selection into the training process of the decision trees. Specifically, when a traditional decision tree selects a splitting attribute, it chooses an optimal attribute from the attribute set of the current node (assume there are d attributes); whereas in RF, for each node of a base decision tree, a subset containing k attributes is first randomly selected from the node’s attribute set, and then an optimal attribute is selected from this subset for splitting. Here the parameter k controls the degree to which randomness is introduced: if we set k=d, then the construction of the base decision tree is the same as that of a traditional decision tree; if we set k=1, then an attribute is randomly selected for splitting. In general, the recommended value is .
Random Forest is simple, easy to implement, and computationally inexpensive, and surprisingly it has demonstrated strong performance in many real-world tasks, earning it the reputation of being “a method representative of the state of the art in ensemble learning techniques.” It can be seen that Random Forest makes only a small modification to Bagging, but unlike Bagging—where the “diversity” of the base learners comes only from sample perturbation—the diversity of the base learners in Random Forest comes not only from sample perturbation but also from attribute perturbation. This allows the generalization performance of the final ensemble to be further improved through an increase in the differences among the individual learners. The structure of the decision tree is shown in the figure below:

For programming with random forests, you can use the scikit-learn framework scikit-learn User Guide