ID3 Algorithm Implementation in Python

Introduction

ID3 is a classification algorithm which for a given set of attributes and class labels, generates the model/decision tree that categorizes a given input to a specific class label Ck [C1, C2, …, Ck]. The algorithm follows a greedy approach by selecting a best attribute that yields maximum information gain (IG) or minimum entropy (H). The algorithm then splits the data-set (S) recursively upon other unused attributes until it reaches the stop criteria (no further attributes to split). The non-terminal nodes in the decision tree represents the selected attribute upon which the split occurs and the terminal nodes represent the class labels.

Implementation

In this blog you can find step by step implementation of ID3 algorithm. The functions used in the implementation is also discussed.

Software Used

  1. Python 2.7
  2. Spyder IDE

Major steps involved in the implementation are,

  • Entropy Calculation
  • Attribute Selection
  • Split the data-set
  • Build Decision Tree

Step 1 : Entropy Calculation

The method takes the given data-set as an argument and performs Entropy calculation over the given data-set.

Entropy

Where,

H –> entropy,

S –> data-set,

X –> set of Class labels

p(x) –> no of elements in Class x to no of elements in entire data-set S

Information gain is the measure of difference in entropy before and after the data-set split.

IG

Where,

H(S) –>  entropy

T –> subset created from data-set split

p(t) –> no of elements in Class t to no of elements in entire data-set S

H(t) –> entropy of subset t

#Entropy Calculation method
def calc_entropy(data):
 #Calculate the length of the data-set
 entries = len(data) 
 labels = {}
 #Read the class labels from the data-set file into the dict object "labels"
 for rec in data:
   label = rec[-1] 
   if label not in labels.keys():
     labels[label] = 0
     labels[label] += 1
 #entropy variable is initialized to zero
 entropy = 0.0
 #For every class label (x) calculate the probability p(x)
 for key in labels:
   prob = float(labels[key])/entries
 #Entropy formula calculation
   entropy -= prob * log(prob,2) 
 #print "Entropy -- ",entropy
 #Return the entropy of the data-set
 return entropy

Step 2: Attribute Selection

We have to determine the attribute based on which the data-set (S) is to be split. The attribute-selection method returns the best attribute with the maximum information gain IG(S) which is used in building the decision tree.

#Function to determine the best attribute for the split criteria
def attribute_selection(data):
 #get the number of features available in the given data-set
 features = len(data[0]) - 1
 #Fun call to calculate the base entropy (entropy of the entire data-set)
 baseEntropy = calc_entropy(data)
 #initialize the info-gain variable to zero
 max_InfoGain = 0.0;
 bestAttr = -1
 #iterate through the features identified
 for i in range(features):
   #store the values of the features in a variable
   AttrList = [rec[i] for rec in data]
   #get the unique values from the feature values
   uniqueVals = set(AttrList)
   #initializing the entropy and the attribute entropy to zero
   newEntropy = 0.0
   attrEntropy = 0.0 
   #iterate through the list of unique values and perform split
   for value in uniqueVals:
     #function call to split the data-set
        newData = dataset_split(data, i, value) 
        #probability calculation
        prob = len(newData)/float(len(data)) 
        #entropy calculation for the attributes
        newEntropy = prob * calc_entropy(newData) 
        attrEntropy += newEntropy 
    #calculation of Information Gain
    infoGain = baseEntropy - attrEntropy 
    #identify the attribute with max info-gain
    if (infoGain > max_InfoGain):
    max_InfoGain = infoGain
    bestAttr = i 
 #return the attribute identified
 return bestAttr

Step 3: Split the data-set

In this step, considering the attribute selected from the previous step, the axis or the arc (attribute index) and the attribute value as input a split is done on the data-set. This method is recursively called from the <> step for every attribute present in the given data-set in the order of decreasing information gain or until the algorithm reaches the stop criteria.

#Function to split the data-set based on the attribute that has maximum information gain
#input values: data-set, attribute index and attribute-value
def dataset_split(data, arc, val):
 #declare a list variable to store the newly split data-set
 newData = []
 #iterate through every record in the data-set and split the data-set
 for rec in data: 
   if rec[arc] == val:
     reducedSet = list(rec[:arc]) 
     reducedSet.extend(rec[arc+1:])
     newData.append(reducedSet)
 #return the new list that has the data-set that is split on the selected attribute
 return newData

Step 4: Build Decision Tree

In this step we will integrate the above process flow into a single function and generate the rules in the decision tree.

#Function to build the decision tree
def decision_tree(data, labels):
 #list variable to store the class-labels (terminal nodes of decision tree)
 classList = [rec[-1] for rec in data]
 if classList.count(classList[0]) == len(classList):
   return classList[0]
 #functional call to identify the attribute for split
 maxGainNode = attribute_selection(data)
 #variable to store the class label value
 treeLabel = labels[maxGainNode] 
 #dict object to represent the nodes in the decision tree
 theTree = {treeLabel:{}}
 del(labels[maxGainNode])
 #get the unique values of the attribute identified
 nodeValues = [rec[maxGainNode] for rec in data]
 uniqueVals = set(nodeValues)
 for value in uniqueVals:
   subLabels = labels[:]
 #update the non-terminal node values of the decision tree
   theTree[treeLabel][value] = decision_tree(dataset_split(data, maxGainNode, value),subLabels) 
 #return the decision tree (dict object)
 return theTree

Once the tree is built, it can be used in predicting class labels in a classification task. Also the rules in the decision tree can be derived and visualized. Decision trees can be visualized using libraries like Graphviz in python.

 

Advertisements

Implementing K-Nearest Neighbors (KNN) algorithm for beginners in Python

Introduction: KNN is a simple machine learning algorithm for Regression and Classification problems.  It is widely used for classification problems as one can simply create the model using KNN algorithm and able to have quick insight about the data in a matter of ten minutes. As the name suggest here we try to find the number of closet neighbors and decides the lable. Suppose in the figure shown we try to predict the shape of green circle shown in the middle, here if the K value is 3 then we find the 3 neighbors and it is clearly evident that there are two triangles and only one square so the knn models predicts it as a triangle shape.

279px-KnnClassification.svg

Step 1: Finding the distance between two data points

def euclidean_distance(datapoint1, datapoint2, no_of_parameters):
distance = 0
# iterate through the parameters
for i in range( no_of_parameters ):
# sum up the distance
distance += pow( (datapoint1[i] – datapoint2[i]), 2 )
return math.sqrt( (distance) )

here no_of_parameters represents the number of dimensions

Step 2: Finding the Neighbors

def find_Neighbours(trainingset, testcase, k, no_of_features):
# List to store the classification and distance
distances = []
# Iterate the traing data
for row in (trainingset):
# Compute the distance and then add the class label, distance to list
distances.append( (row[2], euclidean_distance( row, testcase, no_of_features )) )

# Sort the distances based on distance
distances.sort( key=operator.itemgetter( 1 ) )

# Store the neighbours in list
neighbours = []
# Take the K neighbours
for i in range( k ):
# Append the neighbour
neighbours.append( (distances[i]) )

return neighbours

Function returns the K-neighbours for the give datapoint in training Set.

Step 3: Predicting the output class

def get_classification(neighbours):
# Dictionary to count the votes
class_votes = {}
# Iterate the neighbours
for i in range( len( neighbours ) ):
# Read the classification label
response = neighbours[i][-2]
# Check if the label is already in dictionary
if response in class_votes:
# Already this label is recorded increase the vote
class_votes[response] += 1
else:
# Add the label to the dictionary
class_votes[response] = 1
# Sort the votes based on votes
sortedVotes = sorted( class_votes.items( ), key=operator.itemgetter( 1 ), reverse=True )
# Return the class label with more no of votes

return sortedVotes[0][0]