Genetic Algorithms – A Friendly Introduction

A few days back, I was going through one of the research papers and realised I have a new topic to cover and have my hands dirty with in the field of Machine Learning (ML).

Yeah, you guessed it right..it’s what we call Genetic Algorithms (GAs). While to my surprise, I myself found very less content of GAs being used in ML. If you find some, do link them in the comments. I’ll love to see others’ work on the same.

Without further gossips, lets just dig in and understand what GAs are all about.

In their documentation MathWorks® define Genetic Algorithms as:

The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution.

For the sake of simplicity, let’s just say GAs are methods(set of algorithms) of search, often applied to optimization and learning.

While GAs are a set of algorithms, the bigger picture is that they are encapsulated by a much bigger set of algorithms known as Evolutionary Algorithms(EAs). So, what is so special about GAs? Well, their basis of evolutionary analogy is a thing that separates them from others in the same set of EAs. GAs follow the evolutionary Analogy – Survival of the Fittest. And if you weren’t sleeping in your Biology classes, you know that this evolutionary process is what we see in nature.

Now that we have a clear picture of what GAs are, let me introduce you to the concept of Search Space(a term constantly used in GAs).

Search Space

While solving a given problem, we generally look for a solution, which is best
among a set of possible solutions.
Going by the definition:

The space of all feasible solutions (it means objects among those the desired solution is) is called search space (also state space).

By the definition, it is clear that every point in search space refers to a feasible solution. To visualize these solutions, we can mark each point by its value or fitness for the problem at hand. The solution to the problem will be point or set of points from the search space.

How do we look for a solution? Well, looking for a solution generally follows the process of looking for an extreme(minimum or maximum) in the search space.

This might be a little confusing. Let’s just take an example. Consider a problem of finding a password which is 8 characters long containing only lower case alphabets. What do you think are plausible solutions to this problem? Well, since I attended my Permutations and Combinations classes, I know that the search space contains 26^8 choices. If you are confused, just think of 8 blanks and try thinking of possible choices(26 lower case alphabets) for each blank. The example was all good but the thing we have to keep in mind is in the case of a problem solved by GA – we usually know a few points from the search space and are generating other points as the process of finding solution continues.

Problem with search:
The problem is that search can be very COMPLICATED!
One does not know where to look for the solution and where to start.

To overcome this problem and find suitable solution (that may not necessarily be best solution), some nice guys came up with various methods, for example, hill climbing, Simulated annealing, etc. Solutions found using these methods are generally considered good because it is often not possible to prove what is real optimum(who knows in future maybe we’ll be able to – checkout these Quora answers on NP problems).

Before going ahead, we’ve to understand the concept of Chromosomes(don’t worry I’m not gonna blabber about the whole concept and go deep in the world of those nuclei).

Chromosome
We all know that every living being, from the mosquito you killed from that spray to you my friend, possess cells. Now, these cells have a fascinating thing about them, that is, every cell in an organism has the same set of chromosomes, for example, in humans there is a set of 23 pairs while in Gorillas there is a set of 24 pairs of chromosomes(1 pair can make a hell lot of difference, guys).

Chromosomes in general form are strings of DNA(I don’t want to go in detail of those ATGCs though) and serve as a model for the whole organism. Now, these chromosomes consist of genes (blocks of DNA).

Each gene(in certain terms) encodes a trait of the organism, example, color of eyes. Now, if you aren’t living all by yourself, you know that every individual organism in the same species have many distinctions from each other. This is because of different settings of the traits encoded in the genes. The possible settings for a trait(example blue eyes, brown hair, etc) are called alleles. These genes have a special position in each chromosome, which is reffered to as locus. The set of genetic material(a set of chromosomes) is called genome.
The last two terms of this biology lecture(bear with me for 2 more minutes guys)
So, we know every organism has this “internally coded, inheritable information”(set of chromosomes) and we refer to it as genotype. While these traits are being passed by parents, we only see a few of them(dominant ones). These actual observed properties of an organism is referred to as phenotype.

That was a heck of a biology lecture I know…you are a survivor man. This table gives quick view whenever you need to check what we are referring to in the Methodology below.

Methodology of A Genetic Algorithm

In one liner, we can explain the methodology used as,

In a genetic algorithm, a population of candidate solutions (called individuals, creatures or phenotypes) to an optimization problem is evolved towards better solutions.

Going deeper, each of these candidate solutions have a set of properties (its genotype or chromosomes) which can be mutated and altered. Traditional way of representing a candidate solution is to use binary numbers to form strings of 0s and 1s(but other encodings are also possible).

Representation Example:
Bits or subset of bits represent choice of some features, for example, let us represent choice of a smart TV using a 10 bit string using following protocol: Continuing to our discussion on Methodology…
Since we now understand how we can represent the solutions, let us understand how we go about making changes to them so as to reach a good solution.

In general terms, evolution of a population of randomly generated individuals is an iterative process, with population present in each iteration is called a generation. In every iteration, fitness of every individual in the population is evaluated(fitness is usually the value of objective function in the optimization problem being solved). To form a new generation with a bit of a randomness, we select more fit individuals stochastically from the current population and then modify the genome(recombine and randomly mutate) of each individual in the population. This new generation is then used in next iteration.Now, the question which might be popping in your mind is..Hey, you said its an iterative process, so when should it be brought to a hault?
Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.

With power comes responsibility. GAs give you a power to get good optimum solutions for problems but for every new problem to be solved using GAs, a typical GA requires:
1. A genetic representation of the solution domain(i.e., the genotype will be different for a different problem domain).
2. A fitness function to evaluate the solution domain(i.e., the fitness function can be different for different problem domains).
These two items must be developed again whenever a new problem is specified.

General Criteria Followed By Genetic Algorithms

Every GA follows a basic criteria for computing solutions. Let us have a look at these steps. So, the questions that arise after giving a thought over the algorithm are:
1. How do we select parents for crossover?. Though this can be achieved in many ways, but the main idea followed is to select the better parents(this is done in hopes that better parents will produce better offspring).
2. If the new population is generated only by new offspring then this can cause loss of the best chromosomes from the last population.. This is indeed true; To overcome this problem, a method named elitism is used. It simply means that at least one best solution is copied without changes to the new population so the best solution found can survive to the end of run.

Looking at the algorithm, it is clear that three more operations other than fitness evaluation are being performed – Selection, Mutation and Crossover. These are done using three Genetic operators. Let us visit the territory of these Genetic Operators and have a clear look.

Genetic Operators

Genetic operators, as mentioned above, are methods to perform Selection, Mutation and Crossover. And not so surprisingly, they are named as Selection, Mutation and Crossover. We will now visit each of these one at a time.

Selection

To perform crossover and produce new offspring, we need to select parents(chromosomes). The problem that lies in the selection process is how to select these chromosomes such that the best ones should survive and create new offspring. Traditional way of doing so is to use Proportional Selection, i.e., parents are chosen to mate with probability proportional to their fitness. Also, in traditional selection process, the children replace their parents. There are many methods for selecting the best chromosomes, for example, Roulette Wheel Selection, Boltzmann Selection, Tournament Selection, Rank Selection and Steady-State Selection(Code for these will be produced with explanation in future post).

Looking at the broader picture, one can clearly see the principle being used is Survival of the Fittest.

Mutation

Mutation(reminds us of X-MEN I know..oh Logan) is one of the most simple processes to explain in GA. This operator operates on a single parent at a time to produce an offspring. Think of it as a magic wand that can only change one (or more) bit in a binary representation, where each bit has same probability of mutation. Let us remember our previous example string.
Example String:  0101110011 (Represents a Samsung LED Smart TV in range 46″-49″ with 3D and HDR capabilities but no 4K UHD support and no curved display)
Could Mutate to: 0101111011 (Represents a Samsung LED Smart TV in range 46″-49″ with 3D and HDR capabilities along with 4K UHD support but no curved display)

Oh God..if only my algorithm can really mutate my smart TV!
It might or might not happen in future but we must keep going(with my broken heart though).

Crossover

This, my boy, is how you were born(not exactly though). I guess I must put a hault on my lame jokes now.So, this operator as you might have guessed, operates on two parents selected by Selection operator to produce one or two offspring. Classical crossover occurs at one or 2 points in the selected parents.

Example of generating offspring using 1 point crossover: Looking at the image, it is clear that we just swapped the sub-strings on either side of the point of crossover.

Example of generating offspring using 2 point crossover: Just like 1 point crossover, in 2 point crossover, we swapped the sub-strings enclosed in the two points of crossover.

With this, I think, you guys have a pretty good idea of how GAs work. Giving some rest to our minds, I’ll be back with a post on a very simple GA with an example problem to apply GA upon.

Connect with me on:

Logistic Regression – Hands on Experience

In my last post, I gave you a theoretical knowledge of how Logistic Regression works. Along with the intuition, I provided you with a dataset to apply the theoretical knowledge on your own at first. In this post, I’ll explain you my approach to get a working model for the dataset I provided.

Let us first revist and have a deeper look into our dataset..because daahh..data is what matters the most for making predictions (Well, that’s a topic for another day).

Dataset Explained.

The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’.

Where,
‘admit’ – represents whether a given student is provided with admit to the college(1) or not(0).
‘rank’ – represents the ranking of the college
‘gpa’ (or Grade Point Average) – represents the GPA of student in his previous academics.
‘gre’ (or Graduate Record Examination) – represents marks of student in the GRE exam.

Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to the college or not.

If you haven’t yet read my last post, I recommend doing it before going ahead.

I strongly recommend to try to code it out yourself before looking at my solution.

Libraries used

We’ll be using Pandas for data extraction and NumPy for matrix operations. So, make sure you have a little knowledge of the stuff. Although, I’ll try my best to explain all the minor details.

So, grab your cup of coffee and follow along. It is going to be a little of a long ride.

Step 1 – Setting Up

We’ll be making a function called run, the main purpose of which will be to set up our data and call all the functions to perform gradient descent.

def run():
#Collect data
dataframe = pd.read_csv('binary.csv', sep = ',')
data = dataframe.as_matrix()
y = data[:, 0]
X = data[:, 1:]
#Scale Data
X = feature_scaling(X)
#Step 2 - define hyperparameters
learning_rate = 3
num_iters = 1000
initial_theta = np.random.random(3)
#train our model
print('Starting gradient descent at theta = {0}, error = {1}, accuracy = {2}'
.format(initial_theta, compute_error_for_separator_given_data(initial_theta, X, y), accuracy(initial_theta, X, y)))
theta = gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters)
print('Ending gradient descent at Iteration = {0} theta = {1}, error = {2}, accuracy = {3}'
.format(num_iters, theta, compute_error_for_separator_given_data(theta, X, y), accuracy(theta, X, y)))

In line 3, pd.read_csv() takes in two arguments for our purpose. First file name and second separator used in file (For those of you who don’t know, .csv means comma separated values). The function returns a dataframe. Converting this dataframe to a matrix (as in line 4) makes it easy to operate on data. In next two lines, we separate our labels (‘admit’) from features (‘rank’, ‘gpa’ and ‘gre’).

Rest of the functions will be covered one at a time.

Step 2 – Feature Scaling

From the heading itself its clear that we are targeting the black box we know nothing about at line 8 of run function.

While I’ll be discussing this topic in full depth in a future post, let’s look at what it means to scale the features here. Looking at the data, it is clear that ‘gre’ marks are in hundreds while both other features are in unit place. If this variation in feature is reduced, then we can reach to global optimum more easily and effectively. For this purpose, we will be using what we call Normalization. Doing this ensures that all the values are in the range of 0 and 1.

The formula we’ll be using here is: def feature_scaling(X):
for i in range(X.shape):
X[:, i] = (X[:, i] - min(X[:, i])) / (max(X[:, i]) - min(X[:, i]))
return X

This could have also been done using NumPy as

def feature_scaling(X):
return (X - np.min(X, axis = 0)) / (np.max(X, axis = 0) - np.min(X, axis = 0))

Where the second keyword argument, as the name suggests apparently decides which axis to find the minimum or maximum on. Not using it will return minimum or maximum value out of the complete matrix.

Step 3 – Calculating Error

Before Calculating Error we need to define 2 more functions.
1. Sigmoid
2. Predict absolute value

Sigmoid
If you read my last post, then you definitely know what sigmoid function is and how we are going to use it here. So, without going into details lets jump to the code.

def sigmoid(z):
return 1 / (1 + np.exp(-z))

Here, -z multiplies -1(to be precise, it just flips the sign bit) to each element of z. Then, np.exp(-z) applies element-wise exponential function on negative of z. The addition and division shown here works in element-wise manner.

Predict absolute value
Let us just directly jump to the code.

def predict_abs(theta, X):
return sigmoid(np.dot(X, theta))

Here, np.dot(X, theta) performs simple matrix multiplication.

Now, since our base work is clear let us make our error computing function. As we already know, we use mean square error formula for this sake, i.e., def compute_error_for_separator_given_data(theta, X, y):
preds = predict_abs(theta, X)
total_error = np.sum((y - preds) ** 2)

We’ve scaled our features, we’ve computed the error.The only thing left to do is to run gradient descent on our data with our hyper-parameters.

Step 4 – Gradient Descent

From my last article we know that gradient descent works as: Ummm.. there is surely a call for separation of things to make the code easier to implement. So we will tackle each step of the Gradient Descent using a function called step_gradient.

def step_gradient(theta_current, X, y, learning_rate):
preds = predict_abs(theta_current, X)
theta_gradient = -(2 / len(y)) * np.dot(X.T, (y - preds))
theta = theta_current - learning_rate * theta_gradient
return theta

Going line by line, first we compute absolute predictions from given theta values and feature values. Then using them, we compute the partial differentiation of our cost function w.r.t theta and hence in the third line we find the value of new theta for current iteration of Gradient Descent.
Pretty Neat..!

So the only thing left is to iterate over this function a certain number of times as mentioned in our hyper-parameters.

def gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters):
theta = initial_theta
for i in range(num_iters):
theta = step_gradient(theta, X, y, learning_rate)
return theta

Well, I don’t think this last piece of code needs any explanation though.

So, we are finally here..
We first setup our parameters and data.
Then we setup our probability calculating function – Sigmoid
After that, we hit the rock bottom of calculating the error
In the after math, we calculate new theta values from initial values using Gradient Descent.

Source Code: Github Link
Part 1 of this Article: Logistic Regression – Let’s Classify Things..!!

PS: There might be many implementations far better than this. It will be great to see people raising issues in my repository on github. After all, we all are here to learn. And yeah this article was so delayed I’m sorry for I was really busy in some meetups and examinations. Many new articles coming this month.

Connect with me on:

Linear Regression – Giving Best Fit Line

In my last post, I gave you an intuition of types of Machine Learning problems. Here, we’ll be undertaking the problem of regression(or real values fitting) which is a sub-category of Supervised Machine Learning.

But first things first!
We’ll be using Python for implementation along with its open source libraries like – NumPy, Pandas, Matplotlib and Scikit-learn. So, knowledge of basic python syntax is a must. At the end of the post, there is a list of all the resources to help you out.

So, here comes the time of a deep dive…!

What is Linear Regression?

Linear Regression is the problem of fitting the best possible line to the data. Woooaahh!!

Let us go through it step by step…

Why..?
Because we (Programmers) break down our work into smaller, simpler pieces
Step 1: Collect data
Step 2: Find a mapping from input data to labels, which gives minimum error… Magic of Mathematics!!
Step 3: Use this mapping to find label to new or unseen data-sets.

Wait! Don’t leave I know its MATHS, but we’ll make it easy.

How to find the Line of Best Fit?

A line that will fit the data in the best possible manner will be the line that has the minimum error. So, here is our objective – to minimize the error between the Actual Labels and Calculated Labels. And in order to accomplish it, we’ll be using one of the widely used algorithms in the world of ML – Gradient Descent.

So lets just dive right into the coding part:

Step 1: Collecting Data

For this example, we’ll be using NumPy to read data from an excel file – “data.csv”. The file contains data comparing the amount of hours a student studied and how much he or she scored.

def run():
#Step 1 - Collect the data
points = np.genfromtxt('data.csv', delimiter = ',')

Step 2: Finding The mapping

Our goal is to find a mapping between the hours studied and marks obtained, i.e.

y = mx + b

where,
x = amount of hours studied
y = marks obtained
m = slope of the line
b = y intercept

Well, this step has further sub-steps..

Step 2.1: Computing Error

To Compute Error at each iteration, we’ll be using the cost function(or objective function): def compute_error_for_line_given_points(b, m, points):
total_error = 0
for ip in range(len(points)):
x = points[ip, 0]
y = points[ip, 1]
total_error += (y - (m * x + b )) ** 2

Looking at the mapping, it is clear that the only possible parts we can play around with to reduce the error are- m(the slope of the line) and b(the y intercept).

Step 2.2: Setting Hyper-parameters

In ML, we have certain parameters to tune in our model over the data we are analysing. These parameters are known as Hyper-parameters and they define certain terms that make sure our function(or model) fits data on a particular rate and in a particular way. We have a whole bunch of hyper-parameters with their own importance. More on this will be in a future post.

In our case, the Hyper-parameters that we must take into consideration are the Learning Rate, initial value of ‘b’, initial value of ‘m’ and number of iterations. Learning Rate defines how fast should our model converge. By converge, it means how fast model reaches the global optimum(in this case global minimum).

def run():
#Step 1 - Collect the data
points = np.genfromtxt('data.csv', delimiter = ',')

#Step 2 - define hyperparameters
learning_rate = 0.0001
initial_b = 0
initial_m = 0
num_iterations = 1000

Well, then why isn’t Learning Rate big like a million?

The main reason being there must be a balance. If learning rate is too big, our error function might not converge(decrease), but if its too small, our error function might take too long to converge.

Step 2.3: Perform Gradient Descent

In this step, we will modify the value of y intercept ‘b’ and slope of line ‘m’ to decrease the value of the error (or objective function).

To do this, we will subtract the partial differentiation of our objective function with respect to b and m from former values of b and m respectively.

#Computes new values of b and m for a given iteration of gradient_descent_runner
def step_gradient(points, b_current, m_current, learning_rate):
N = float(len(points))
for i in range(len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - learning_rate * b_gradient
new_m = m_current - learning_rate * m_gradient
return [new_b, new_m]

#Finds the best fit line to the data.
def gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations):
b = initial_b
m = initial_m
for i in range(num_iterations):
b, m = step_gradient(points, b, m, learning_rate)
return [b, m]

Step 3: Visualizing

Here, we will use Matplotlib – an open source library to plot graph of our mapping over the scatter plot of data.

def run():
#Step 1 - Collect the data
points = np.genfromtxt('data.csv', delimiter = ',')

#Step 2 - define hyperparameters
learning_rate = 0.0001
initial_b = 0
initial_m = 0
num_iterations = 1000

#Step 3 - train our model
print('Starting gradient descent at b = {0}, m = {1}, error = {2}'
.format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
b, m = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)

print('Ending gradient descent at Iteration = {0} b = {1}, m = {2}, error = {3}'
.format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))

#Step 4 - Visualization
x = [ix for ix in points]
y = [iy for iy in points]
y_predict = [m * ix + b for ix in x]

plt.figure(0)
plt.scatter(x, y)
plt.plot(x, y_predict)
plt.show() Output : Graph Displaying the Best Fit line over the Data

To see a visual comparison between decrease in error and line fitting – go to link.

GitHub link to Source Code

Resources for further reading:

PS: To get a good grip on the concept of Linear Regression visit this link.

Logistic Regression – Let’s Classify Things..!!

In my post on Categorising Deep Seas of ML, I introduced you to problems of Classification (a subcategory of Supervised Learning).

But wait..we are talking Logistic “Regression”. Blame history for it, but the only thing common between Logistic Regression and Regression is the word itself.

Logistic Regression Intuition

Consider a problem where you have to find the probability of a student being studious one or goofy one.
What can we do?
Data..key to every solution..yeah

So, we grade our test students on certain basis. Lets just say we rate them out of 10 on following points – study hrs/day(x1), attention in class(x2), interaction in class(x3), behaviour with peers(x4)..well for sake of simplicity lets just take these 4 features.

Representing this in a matrix Now, you might be thinking – Hey, I have seen probability in my mathematics class and I know it always lies in range [0, 1].

Hold your horse my friend. We are getting to that very step.

Activation Functions

The world of ML has taken many inputs from the field of Mathematics and perhaps this very part of activation function is taken entirely from the latter field.

A function used to transform the activation level of a unit(neuron) into an output signal is called an Activation Function.

Well, we’re going a little off from our topic here. But you can think of activation function as a function which provides us with the probability of our test being positive. In this case, it gives us the probability of a student being studious.

The function we’ll be using here is the Sigmoid function. Lets just have a look at the graph of the function. ‘z’ on x-axis vs. sigmoid(z) on y-axis

The graph clearly depicts that for any value of z, sigmoid function will return us a value between 0 and 1….MIND == BLOWN.. (“==” because for Programmers “=” != “==”).

So what’s left?
The only thing left for us to do is to define a mapping from our test data to z in sigmoid(z) and “minimize the error” in the mapping to get the best result.

“Minimize the error”..hmm..we have done something similar in Linear Regression too. Gradient Descent is the key.

So what’s our hypothesis? It will be nothing but We’ll calculate these coefficients by minimizing our cost function.

The very basic step of Gradient descent is to find a Cost Function. I know all these functions are getting on your nerve so lets just depict these by using a flow chart. Stick with me and we’ll make it easy. Flow Chart Depicting Logistic Regression.

Our Cost function here will be : Don’t worry you don’t have to memorise it. But lets just understand how this Cost function is implemented. Consider for our test data when y = 1 then our cost function is -log(h(theta)). The graph for the same is: h(theta) on x-axis vs. -log(h(theta)) on y-axis

This shows that as the value of calculated hypothesis goes from 0 to 1(required value) our cost function decreases.

Now, consider when y = 0 then our cost function is -log(1-h(theta)). The graph for same is: h(theta) on x-axis vs. -log(1-h(theta)) on y-axis

This shows that as the value of the calculated hypothesis goes from 1 to 0(required value) our cost function decreases. Pretty much what required.

Now, as we are familiar with our cost function lets just remember how Gradient Descent works. Simultaneously,  update for every Theta. Alpha being the Learning Rate.

Hmmmm..partial differentiation and our apparent hide and seek with it..Let me make your task easy. So, now we know how to get our coefficients tuned and how to run our gradient descent.

What about making predictions?
Well, that’s easy! A student with higher probability of being a studious one is of course more studious. But how will I compute it?. Deciding a threshold is upto you. For me.. a student with a probability greater than or equal to 0.5 works just fine. I am a little lenient I know. 😉

So, now you have it. Every tiny detail of logistic regression.

Now I’ve a task for you. I’ll be providing you with a dataset and you have to apply logistic regression on your own. No worries though, my next post will explain my way of logistic regression on the same dataset.

Explanation of dataset: The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’. When given the ‘rank’ of the college then the ‘admit’ shows whether the person is provided with the admit to the college(1) or not(0) provided he has a corresponding ‘gpa’ and ‘gre’ scores. Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to college or not.