Bora writes on...: July 2009

The Virus

Do you think that we, human beings, are the most dominant species on the Earth?
I really don't think so. I believe that microscopic life forms, especially the viral ones are the most powerful species around the world.
National Center for Health Statistics announced the top 10 causes of death in United States by 2006 as:

According to the some articles in the book, Next Fifty Years, by John Brockman, different sorts of long running viral infections cause most of the cancer types, and some diseases of those we know no proven cause yet, like Alzheimer, MS etc.
Therefore, I hate those God damn micro life forms. They're killing us cruelly since the beginning of time and we cannot fight against them in a sufficient way. Remember tuberculosis, black death, AIDS, SARS, ebola, Crimean-Congo haemorrhagic fever (CCHF), fatal influenza forms appearing annually in the recent years. What are we doing about them? We have beaten the historic forms of some viruses but we are very vulnerable to the evolved viral forms.
Come on doctors, work harder! Save us!
Nowadays, beautiful Charlize Theron is striving to beat the viral infection in her stomach. As far as I know, she spent 4-5 days in the intensive care unit of Cedars-Sinai Hospital, Los Angeles.
I am trying the understand the game of evolution... [Possibly, homo sapiens can be considered as the viruses of the universe, if we look at the scene from an objective macro perspective. But I don't want to sheer this way now...] What is the aim of this kind of a natural selection? One ugly ribonucleic acid chain (RNA) covered by a protein shell is trying to conquer one of the most beautiful organisms of the world: Charlize Theron.
It is insane! Really unfair.
Pray for Ms. Theron, my friends, we are not ready to lose her yet!
Get up girl! Get up and smile us again.

Decision Tree

Today, I am going to mention about creating a decision tree from a data table. Actually, for this simple operation, a lot of professionals pay huge amounts of money to the other professionals, let's say external data mining consultants, for having them give a meaning to their own enterprise data.
Here is the operation:
Firstly, we are going to analyze our columns (attributes) comprise the data table. The point of the analysis is to determine the classification attribute and the attributes which bring the highest level of information in their values. The attributes carrying the highest level of information are going to be our decision nodes in the tree.
Sample data table is below (fig. 1). The table is about passing a course at a college. The pink column, term result, is the classification attribute and available classes are "passed" and "failed". We are trying to reveal the effect of "self studying", "team work", "relations with the lecturer" and "the course attendance" to passing the course.

Figure 1.

There are some key points of the attrbute analysis. The value of an attribute regarding to the corresponding classification target is called Information Gain. The equation of information gain is as follows:

Equation 1.

Where P() denotes the probability and k stands for the label of the distinct values of an attribute. To make this clear, please check the example below:

Figure 2.

Figure 2 shows the sorted values and their corresponding classes of the attribute "course attendance". And it is a subset of fig. 1. The P(k) in equation 1 will take the values P(Bad), P(Good), P(Moderate) and P(None). The value of P(Bad) is 0.3 beacuse there are 3 out of 10 tuples labeled as Bad in fig. 2. Similarly the value of P(None) will be 0.1 (1/10).

The term Gini in equation 1 is the measure of impurity. And it is applied to the attributes of our table. Some attributes might be homogeneous which means that all the values of this attribute map the same class. In contrary, some attributes, like "course attendance" in fig. 2, have different values addressing the different classes and we call those attributes heterogenous or impure. There are different measures of impurity like classification error, entropy and Gini index. I am giving the equations of all but will be using the Gini in this text for measuring the impurity.

In the equations above P(k) is the probability of the class mapped by the values of the attribute. We can look back to figure 2 for understanding it. Examine the rows containing the "moderate" course attendance which are given in figure 3, below.

Figure 3.

The P(k) in the equation of Gini index will be P(Failed) = 1/3 and P(Passed)=2/3. And the Gini index for course attendance attribute will be 1-[(1/3)^2+(2/3)^2] = 0.44 where ^2 denotes the power of 2, in other words square operator.

As you can see, we will compute the impurity for each distinct value of every attribute, firstly. Secondly, we will calculate the information gain by every attribute by using their impurity degrees. Third, we will compare the information gain of the attributes and pick the one with the highest information gain level as the decision node of our tree. It will be end of a node description cycle. The similar cycles may follow in a recursive manner if further splits in the tree needed.

Let's start the procedure... The Gini of our initial table will be calculated by using our classification column, term result. There are 5 Failed and 5 Passed classes in the table. So

P(Failed)=0.5

P(Passed)=0.5

Gini(Parent) = 1-[0.5^2+0.5^2] = 1 - 0.5 = 0.5

According to that parent Gini value, the below figure shows the evaluation of our attributes regarding to their information gain.

Figure 4.

The winner is course attendance with the gain of 0.23. So our decision tree has a root and some leaves now. The very first form of the tree is given below (Fig. 5).

Figure 5

Now, cycle 1 ended and initial table in fig. 1 became shorter as given in fig. 6 below. We filtered the rows by giving the condition of Course Attendance in Good and Moderate. We have to split the nodes for those values.

Figure 6.

We will use the first 3 rows of data while we are running the second decision node description cycle. So that, in this cycle those 3 rows will become our parent table. Gini(Good) and Gini(Moderate) values have been calculated in cycle 1 and can be examined in fig. 4. Gini(Good) will be our Gini(Parent) in cycle 2. In this stage, course attendance will no longer be evaluated as a node candidate, we will focus on the other 3 columns: Self studied, team work and relations with lecturer. Next figure (Fig. 7) shows the information gain values of those attributes for the first 3 columns of the table in fig. 6.

Figure 7.

We have 2 attributes with the info gain of 0.44 (team work and self studied) so we can chose any of them as our second decision node in the tree. My choice will be team work. After cycle 2, the tree becomes as follows.

Figure 8.

It is time for cycle 3, where the situation with the moderate course attendants will be defined. We will use the second 3 rows of the table in figure 6, Gini(Parent) will be 0.44 and our candidate attributes will be self studied, team work and relations with lecturer. Let us see calculation in figure 9:

Figure 9.

Winner of cycle 3 is the attribute self studied with 0.44 of information gain. Finally, the decision tree is fully constructed. Here you are.

Figure 10.

According to the decision tree, relations with lecturer has no effect on passing this course.

As you can see, a software specialist can easily write a program for creating a decision tree by using a data set. No need to pay extra for this :)

Formally, subject of the decision trees is a vein of classification of predictive data mining which is a branch of machine learning.

God bless mathematics ;)

Envy-Free Cake Division 2

In my previous post, I discussed the mathematics of fair division and I introduced the Selfridge & Conway's algorithm.

While we are talking about the deterministic rules working for setting up controlled decision environments, everything might seem very clear and understandable so that we all might be persuaded about the conditions of the environment and motivations of the rules which are to solve the problems across the environment.

But.

Real life is much more vaguer than the well defined solution environments and because of this, most of the solutions devised can't satisfy the stakeholders of the problem domain. For instance, if I were the party A in the Selfridge-Conway procedure, I could have still been envying to a piece other than mine by arguing that "Why am I chosen for cutting the cake first? If cannot cut the cake in 3 equal pieces, it will be a disadvantage for me! I'd like to be the first picker!" And nobody can criticise me becuse of my opinion about the procedure.

One can say that, "Okay, you have to agree on the pillars of the procedure firstly!".

Yes, it is a good starting point. And a powerful statement that ties me up about obeying the rules. However, even if I agreed the rules of the procedure, after the procedure ends, I might still be unhappy with my piece and crying about it. It is possible!

Then, would it be my psychological problem or would it be the truth that such problems are unsolvable if you are dealing with human beings and their feeling of justice and satisfaction? Is there a way for highly intellectual people to share things without using brute force? In personal interactions, human beings had come a significant way when it comes to solve the problems in a civilized way. On the other hand, in international relations brute force is being used vastly between parties.

I believe that we are standing just besides a threshold!

After a revolutionary step, which has been led by "pure" science, over this threshold, the big question marks in our minds will disappear very fast. And we all are going to take our most relieved breath! ...before finding us new and bigger question marks :))

Think of this!

Envy-Free Cake Division

During my military service in Ankara, in one of the lectures at the military academy, a commander was talking about how to command the team in a fair manner. And he asked "How do you think 2 guys share a cake without envying to each other's piece?" Our answers involved the use of measurement tools but the commander gave the prerequisite that they cannot use any measurement tools, but the knife, hands and eyes. Then, the answer was very clever: "I cut, you choose" procedure works pretty fine for two guys. The first one cuts the cake into 2 equal pieces according to his perception of equity. And the second guy chooses the piece to pick for himself. No one envies.

When you are dealing with such a problem, and when you are a young engineer, your brain generates subsequent problems immediately: What if there are 3 guys to share the cake? I was not able to google this under those circumstances so I and a couple of my "brother-in-arms" discussed the problem and offered some premature solutions. Most successful one was reducing the problem t0 2 guys model after 1 arbitrary guy cuts the cake into 3 pieces. Procedure was like this for players A, B, C:

One of the guys, let's say A, cuts the cake into 3 pieces
B chooses one of the pieces
C chooses one of the pieces
A picks the last piece on the table
A-C, A-B, C-B pairs try to agree that all the pieces are equal. If they cannot agree, they use the "I cut, you choose" procedure for their pieces.

For example, after step 4, A checks the piece which belongs to B and if he envies B's piece, B cuts his piece into two and A picks one of the pieces and then, A cuts his piece into two and B picks one of the A's pieces and at the end they end up with an envy-free division. As a conclusion, A and B have 4 smaller pieces. After that, player C may want to do "I cut, you choose" with B and this operation results in C and B having 6 pieces. And the round goes...

The problem with this approach was this procedure tends to last forever because after A-B, B-C, A-C agreements, B might envy C's pieces and they start sharing again, the second tour... Perhaps, in practice it never goes too far but the procedure must be able to halt in a meaningful fashion.

After all the discussions, I have not focused on the problem for about 4 years.

Then suddenly the problem grabbed me again. But this time, I googled :)) And of course, I found all the literature about envy-free cake division problem. There are more than 6 algorithms designed. To me, two of them are very good: The first one was announced by Selfridge & Conway in 1960s. The latter is called "moving knives" and devised by Stromquist. I will explain the first one here.

The problem: Parties A, B and C want to share the cake fairly. They will use just a knife.

Figure 1. Selfridge & Conway stage 1.

The procedure comprises 2 stages. In stage 1,

Party A cuts the cake into 3 pieces regarding just his feelings of equity (step II in fig. 1).
Party B controls the pieces, if he thinks that at least two of the pieces are good for choosing (means those two are equal), he does nothing.
Party C picks a piece.
Party B picks a piece.
Party A picks the remaining piece.

Everything is perfect. However, if party B thinks that one piece is quite bigger than the other ones, in the second bullet above, he takes the knife and trims the biggest piece in order to produce at least 2 tied pieces. (See the III in fig. 1) Then, no one can pick the "trm" bit in the stage 1. And procedure goes:

Party C chooses a piece.
If party C did not choose the trimmed piece (piece number 3 in fig. 1), party B has to pick trimmed piece.
Party A picks the remaining one.

Stage 1 is done. No one envies the other. Now, it is time to divide the "trm" bit.

Figure 2. Selfridge & Conway stage 2.

Among B and C, the party who did not pick the trimmed piece in stage 1 cuts the trm bit into 3 by his knife.
If the party who cuts the trimming into 3 is B, then the parties pick the pieces (t1, t2, t3 in fig. 2) in C, A, B order. If the party that cuts the trm bit is C then the picking order is B, A, C.

Stage 2 ends. It is an envy-free division process. We're all done with this.

The problem, of course, goes too far by considering n parties to share the cake :)) but let us leave the rest of the story to the mathematicians.