Today, I am going to mention about creating a decision tree from a data table. Actually, for this simple operation, a lot of professionals pay huge amounts of money to the other professionals, let's say external data mining consultants, for having them give a meaning to their own enterprise data.
Here is the operation:
Firstly, we are going to analyze our columns (attributes) comprise the data table. The point of the analysis is to determine the classification attribute and the attributes which bring the highest level of information in their values. The attributes carrying the highest level of information are going to be our decision nodes in the tree.
Sample data table is below (fig. 1). The table is about passing a course at a college. The pink column, term result, is the classification attribute and available classes are "passed" and "failed". We are trying to reveal the effect of "self studying", "team work", "relations with the lecturer" and "the course attendance" to passing the course.
Figure 1.
There are some key points of the attrbute analysis. The value of an attribute regarding to the corresponding classification target is called Information Gain. The equation of information gain is as follows:
Equation 1.
Where P() denotes the probability and k stands for the label of the distinct values of an attribute. To make this clear, please check the example below:
Figure 2.
Figure 2 shows the sorted values and their corresponding classes of the attribute "course attendance". And it is a subset of fig. 1. The P(k) in equation 1 will take the values P(Bad), P(Good), P(Moderate) and P(None). The value of P(Bad) is 0.3 beacuse there are 3 out of 10 tuples labeled as Bad in fig. 2. Similarly the value of P(None) will be 0.1 (1/10).
The term Gini in equation 1 is the measure of impurity. And it is applied to the attributes of our table. Some attributes might be homogeneous which means that all the values of this attribute map the same class. In contrary, some attributes, like "course attendance" in fig. 2, have different values addressing the different classes and we call those attributes heterogenous or impure. There are different measures of impurity like classification error, entropy and Gini index. I am giving the equations of all but will be using the Gini in this text for measuring the impurity.
In the equations above
P(k) is the probability of the class mapped by the values of the attribute. We can look back to figure 2 for understanding it. Examine the rows containing the "moderate" course attendance which are given in figure 3, below.
Figure 3.
The P(k) in the equation of Gini index will be P(Failed) = 1/3 and P(Passed)=2/3. And the Gini index for course attendance attribute will be 1-[(1/3)^2+(2/3)^2] = 0.44 where ^2 denotes the power of 2, in other words square operator.
As you can see, we will compute the impurity for each distinct value of every attribute, firstly. Secondly, we will calculate the information gain by every attribute by using their impurity degrees. Third, we will compare the information gain of the attributes and pick the one with the highest information gain level as the decision node of our tree. It will be end of a node description cycle. The similar cycles may follow in a recursive manner if further splits in the tree needed.
Let's start the procedure... The Gini of our initial table will be calculated by using our classification column, term result. There are 5 Failed and 5 Passed classes in the table. So
P(Failed)=0.5
P(Passed)=0.5
Gini(Parent) = 1-[0.5^2+0.5^2] = 1 - 0.5 = 0.5
According to that parent Gini value, the below figure shows the evaluation of our attributes regarding to their information gain.
Figure 4.
The winner is course attendance with the gain of 0.23. So our decision tree has a root and some leaves now. The very first form of the tree is given below (Fig. 5).
Figure 5
Now, cycle 1 ended and initial table in fig. 1 became shorter as given in fig. 6 below. We filtered the rows by giving the condition of Course Attendance in Good and Moderate. We have to split the nodes for those values.
Figure 6.
We will use the first 3 rows of data while we are running the second decision node description cycle. So that, in this cycle those 3 rows will become our parent table. Gini(Good) and Gini(Moderate) values have been calculated in cycle 1 and can be examined in fig. 4. Gini(Good) will be our Gini(Parent) in cycle 2. In this stage, course attendance will no longer be evaluated as a node candidate, we will focus on the other 3 columns: Self studied, team work and relations with lecturer. Next figure (Fig. 7) shows the information gain values of those attributes for the first 3 columns of the table in fig. 6.
Figure 7.
We have 2 attributes with the info gain of 0.44 (team work and self studied) so we can chose any of them as our second decision node in the tree. My choice will be team work. After cycle 2, the tree becomes as follows.
Figure 8.
It is time for cycle 3, where the situation with the moderate course attendants will be defined. We will use the second 3 rows of the table in figure 6, Gini(Parent) will be 0.44 and our candidate attributes will be self studied, team work and relations with lecturer. Let us see calculation in figure 9:
Figure 9.
Winner of cycle 3 is the attribute self studied with 0.44 of information gain. Finally, the decision tree is fully constructed. Here you are.
Figure 10.
According to the decision tree, relations with lecturer has no effect on passing this course.
As you can see, a software specialist can easily write a program for creating a decision tree by using a data set. No need to pay extra for this :)
Formally, subject of the decision trees is a vein of classification of predictive data mining which is a branch of machine learning.
God bless mathematics ;)