Genetic Algorithms – A Friendly Introduction

A few days back, I was going through one of the research papers and realised I have a new topic to cover and have my hands dirty with in the field of Machine Learning (ML).

Yeah, you guessed it right..it’s what we call Genetic Algorithms (GAs). While to my surprise, I myself found very less content of GAs being used in ML. If you find some, do link them in the comments. I’ll love to see others’ work on the same.

Without further gossips, lets just dig in and understand what GAs are all about.

In their documentation MathWorks® define Genetic Algorithms as:

The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution.

For the sake of simplicity, let’s just say GAs are methods(set of algorithms) of search, often applied to optimization and learning.

While GAs are a set of algorithms, the bigger picture is that they are encapsulated by a much bigger set of algorithms known as Evolutionary Algorithms(EAs). So, what is so special about GAs? Well, their basis of evolutionary analogy is a thing that separates them from others in the same set of EAs. GAs follow the evolutionary Analogy – Survival of the Fittest. And if you weren’t sleeping in your Biology classes, you know that this evolutionary process is what we see in nature.

Now that we have a clear picture of what GAs are, let me introduce you to the concept of Search Space(a term constantly used in GAs).

Search Space

While solving a given problem, we generally look for a solution, which is best
among a set of possible solutions.
Going by the definition:

The space of all feasible solutions (it means objects among those the desired solution is) is called search space (also state space).

By the definition, it is clear that every point in search space refers to a feasible solution. To visualize these solutions, we can mark each point by its value or fitness for the problem at hand. The solution to the problem will be point or set of points from the search space.

How do we look for a solution? Well, looking for a solution generally follows the process of looking for an extreme(minimum or maximum) in the search space.

This might be a little confusing. Let’s just take an example. Consider a problem of finding a password which is 8 characters long containing only lower case alphabets. What do you think are plausible solutions to this problem? Well, since I attended my Permutations and Combinations classes, I know that the search space contains 26^8 choices. If you are confused, just think of 8 blanks and try thinking of possible choices(26 lower case alphabets) for each blank. The example was all good but the thing we have to keep in mind is in the case of a problem solved by GA – we usually know a few points from the search space and are generating other points as the process of finding solution continues.

Problem with search:
The problem is that search can be very COMPLICATED!
One does not know where to look for the solution and where to start.

To overcome this problem and find suitable solution (that may not necessarily be best solution), some nice guys came up with various methods, for example, hill climbing, Simulated annealing, etc. Solutions found using these methods are generally considered good because it is often not possible to prove what is real optimum(who knows in future maybe we’ll be able to – checkout these Quora answers on NP problems).

Before going ahead, we’ve to understand the concept of Chromosomes(don’t worry I’m not gonna blabber about the whole concept and go deep in the world of those nuclei).

Chromosome
We all know that every living being, from the mosquito you killed from that spray to you my friend, possess cells. Now, these cells have a fascinating thing about them, that is, every cell in an organism has the same set of chromosomes, for example, in humans there is a set of 23 pairs while in Gorillas there is a set of 24 pairs of chromosomes(1 pair can make a hell lot of difference, guys).

Chromosomes in general form are strings of DNA(I don’t want to go in detail of those ATGCs though) and serve as a model for the whole organism. Now, these chromosomes consist of genes (blocks of DNA).

Each gene(in certain terms) encodes a trait of the organism, example, color of eyes. Now, if you aren’t living all by yourself, you know that every individual organism in the same species have many distinctions from each other. This is because of different settings of the traits encoded in the genes. The possible settings for a trait(example blue eyes, brown hair, etc) are called alleles. These genes have a special position in each chromosome, which is reffered to as locus. The set of genetic material(a set of chromosomes) is called genome.
The last two terms of this biology lecture(bear with me for 2 more minutes guys)
So, we know every organism has this “internally coded, inheritable information”(set of chromosomes) and we refer to it as genotype. While these traits are being passed by parents, we only see a few of them(dominant ones). These actual observed properties of an organism is referred to as phenotype.

That was a heck of a biology lecture I know…you are a survivor man.

chromosomes_table

This table gives quick view whenever you need to check what we are referring to in the Methodology below.

Methodology of A Genetic Algorithm

In one liner, we can explain the methodology used as,

In a genetic algorithm, a population of candidate solutions (called individuals, creatures or phenotypes) to an optimization problem is evolved towards better solutions.

Going deeper, each of these candidate solutions have a set of properties (its genotype or chromosomes) which can be mutated and altered. Traditional way of representing a candidate solution is to use binary numbers to form strings of 0s and 1s(but other encodings are also possible).

Representation Example:
Bits or subset of bits represent choice of some features, for example, let us represent choice of a smart TV using a 10 bit string using following protocol:

representation_protocol

Continuing to our discussion on Methodology…
Since we now understand how we can represent the solutions, let us understand how we go about making changes to them so as to reach a good solution.

In general terms, evolution of a population of randomly generated individuals is an iterative process, with population present in each iteration is called a generation. In every iteration, fitness of every individual in the population is evaluated(fitness is usually the value of objective function in the optimization problem being solved). To form a new generation with a bit of a randomness, we select more fit individuals stochastically from the current population and then modify the genome(recombine and randomly mutate) of each individual in the population. This new generation is then used in next iteration.Now, the question which might be popping in your mind is..Hey, you said its an iterative process, so when should it be brought to a hault?
Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.

With power comes responsibility. GAs give you a power to get good optimum solutions for problems but for every new problem to be solved using GAs, a typical GA requires:
1. A genetic representation of the solution domain(i.e., the genotype will be different for a different problem domain).
2. A fitness function to evaluate the solution domain(i.e., the fitness function can be different for different problem domains).
These two items must be developed again whenever a new problem is specified.

General Criteria Followed By Genetic Algorithms

Every GA follows a basic criteria for computing solutions. Let us have a look at these steps.

general_GA.png

So, the questions that arise after giving a thought over the algorithm are:
1. How do we select parents for crossover?. Though this can be achieved in many ways, but the main idea followed is to select the better parents(this is done in hopes that better parents will produce better offspring).
2. If the new population is generated only by new offspring then this can cause loss of the best chromosomes from the last population.. This is indeed true; To overcome this problem, a method named elitism is used. It simply means that at least one best solution is copied without changes to the new population so the best solution found can survive to the end of run.

Looking at the algorithm, it is clear that three more operations other than fitness evaluation are being performed – Selection, Mutation and Crossover. These are done using three Genetic operators. Let us visit the territory of these Genetic Operators and have a clear look.

Genetic Operators

Genetic operators, as mentioned above, are methods to perform Selection, Mutation and Crossover. And not so surprisingly, they are named as Selection, Mutation and Crossover. We will now visit each of these one at a time.

Selection

To perform crossover and produce new offspring, we need to select parents(chromosomes). The problem that lies in the selection process is how to select these chromosomes such that the best ones should survive and create new offspring. Traditional way of doing so is to use Proportional Selection, i.e., parents are chosen to mate with probability proportional to their fitness. Also, in traditional selection process, the children replace their parents. There are many methods for selecting the best chromosomes, for example, Roulette Wheel Selection, Boltzmann Selection, Tournament Selection, Rank Selection and Steady-State Selection(Code for these will be produced with explanation in future post).

Looking at the broader picture, one can clearly see the principle being used is Survival of the Fittest.

Mutation

Mutation(reminds us of X-MEN I know..oh Logan) is one of the most simple processes to explain in GA. This operator operates on a single parent at a time to produce an offspring. Think of it as a magic wand that can only change one (or more) bit in a binary representation, where each bit has same probability of mutation. Let us remember our previous example string.
Example String:  0101110011 (Represents a Samsung LED Smart TV in range 46″-49″ with 3D and HDR capabilities but no 4K UHD support and no curved display)
Could Mutate to: 0101111011 (Represents a Samsung LED Smart TV in range 46″-49″ with 3D and HDR capabilities along with 4K UHD support but no curved display)

Oh God..if only my algorithm can really mutate my smart TV!
It might or might not happen in future but we must keep going(with my broken heart though).

Crossover

This, my boy, is how you were born(not exactly though). I guess I must put a hault on my lame jokes now.So, this operator as you might have guessed, operates on two parents selected by Selection operator to produce one or two offspring. Classical crossover occurs at one or 2 points in the selected parents.

Example of generating offspring using 1 point crossover:

point_1_cross

Looking at the image, it is clear that we just swapped the sub-strings on either side of the point of crossover.

Example of generating offspring using 2 point crossover:

point_2_cross

Just like 1 point crossover, in 2 point crossover, we swapped the sub-strings enclosed in the two points of crossover.

With this, I think, you guys have a pretty good idea of how GAs work. Giving some rest to our minds, I’ll be back with a post on a very simple GA with an example problem to apply GA upon.

Connect with me on:

fb logo   insta-logojpg.jpg   gmail.png

Logistic Regression – Hands on Experience

In my last post, I gave you a theoretical knowledge of how Logistic Regression works. Along with the intuition, I provided you with a dataset to apply the theoretical knowledge on your own at first. In this post, I’ll explain you my approach to get a working model for the dataset I provided.

Let us first revist and have a deeper look into our dataset..because daahh..data is what matters the most for making predictions (Well, that’s a topic for another day).

Dataset Explained.

The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’.

Where,
‘admit’ – represents whether a given student is provided with admit to the college(1) or not(0).
‘rank’ – represents the ranking of the college
‘gpa’ (or Grade Point Average) – represents the GPA of student in his previous academics.
‘gre’ (or Graduate Record Examination) – represents marks of student in the GRE exam.

Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to the college or not.

If you haven’t yet read my last post, I recommend doing it before going ahead.

I strongly recommend to try to code it out yourself before looking at my solution.

Libraries used

We’ll be using Pandas for data extraction and NumPy for matrix operations. So, make sure you have a little knowledge of the stuff. Although, I’ll try my best to explain all the minor details.

So, grab your cup of coffee and follow along. It is going to be a little of a long ride.

Step 1 – Setting Up

We’ll be making a function called run, the main purpose of which will be to set up our data and call all the functions to perform gradient descent.

def run():
    #Collect data
    dataframe = pd.read_csv('binary.csv', sep = ',')
    data = dataframe.as_matrix()
    y = data[:, 0]
    X = data[:, 1:]
    #Scale Data
    X = feature_scaling(X)
    #Step 2 - define hyperparameters
    learning_rate = 3
    num_iters = 1000
    initial_theta = np.random.random(3)
    #train our model
    print('Starting gradient descent at theta = {0}, error = {1}, accuracy = {2}'
    .format(initial_theta, compute_error_for_separator_given_data(initial_theta, X, y), accuracy(initial_theta, X, y)))
    theta = gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters)
    print('Ending gradient descent at Iteration = {0} theta = {1}, error = {2}, accuracy = {3}'
    .format(num_iters, theta, compute_error_for_separator_given_data(theta, X, y), accuracy(theta, X, y)))

In line 3, pd.read_csv() takes in two arguments for our purpose. First file name and second separator used in file (For those of you who don’t know, .csv means comma separated values). The function returns a dataframe. Converting this dataframe to a matrix (as in line 4) makes it easy to operate on data. In next two lines, we separate our labels (‘admit’) from features (‘rank’, ‘gpa’ and ‘gre’).

Rest of the functions will be covered one at a time.

Step 2 – Feature Scaling

From the heading itself its clear that we are targeting the black box we know nothing about at line 8 of run function.

While I’ll be discussing this topic in full depth in a future post, let’s look at what it means to scale the features here. Looking at the data, it is clear that ‘gre’ marks are in hundreds while both other features are in unit place. If this variation in feature is reduced, then we can reach to global optimum more easily and effectively. For this purpose, we will be using what we call Normalization. Doing this ensures that all the values are in the range of 0 and 1.

The formula we’ll be using here is:

Normalization

def feature_scaling(X):
    for i in range(X.shape[1]):
    X[:, i] = (X[:, i] - min(X[:, i])) / (max(X[:, i]) - min(X[:, i]))
    return X

This could have also been done using NumPy as

def feature_scaling(X):
    return (X - np.min(X, axis = 0)) / (np.max(X, axis = 0) - np.min(X, axis = 0))

Where the second keyword argument, as the name suggests apparently decides which axis to find the minimum or maximum on. Not using it will return minimum or maximum value out of the complete matrix.

Step 3 – Calculating Error

Before Calculating Error we need to define 2 more functions.
1. Sigmoid
2. Predict absolute value

Sigmoid
If you read my last post, then you definitely know what sigmoid function is and how we are going to use it here. So, without going into details lets jump to the code.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Here, -z multiplies -1(to be precise, it just flips the sign bit) to each element of z. Then, np.exp(-z) applies element-wise exponential function on negative of z. The addition and division shown here works in element-wise manner.

Predict absolute value
Let us just directly jump to the code.

def predict_abs(theta, X):
    return sigmoid(np.dot(X, theta))

Here, np.dot(X, theta) performs simple matrix multiplication.

Now, since our base work is clear let us make our error computing function. As we already know, we use mean square error formula for this sake, i.e.,

Slide1

def compute_error_for_separator_given_data(theta, X, y):
    preds = predict_abs(theta, X)
    total_error = np.sum((y - preds) ** 2)
    return total_error / float(len(y))

We’ve scaled our features, we’ve computed the error.The only thing left to do is to run gradient descent on our data with our hyper-parameters.

Step 4 – Gradient Descent

From my last article we know that gradient descent works as:

batch_gradient_descent

Ummm.. there is surely a call for separation of things to make the code easier to implement. So we will tackle each step of the Gradient Descent using a function called step_gradient.

def step_gradient(theta_current, X, y, learning_rate):
    preds = predict_abs(theta_current, X)
    theta_gradient = -(2 / len(y)) * np.dot(X.T, (y - preds))
    theta = theta_current - learning_rate * theta_gradient
    return theta

Going line by line, first we compute absolute predictions from given theta values and feature values. Then using them, we compute the partial differentiation of our cost function w.r.t theta and hence in the third line we find the value of new theta for current iteration of Gradient Descent.
Pretty Neat..!

So the only thing left is to iterate over this function a certain number of times as mentioned in our hyper-parameters.

def gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters):
    theta = initial_theta
    for i in range(num_iters):
        theta = step_gradient(theta, X, y, learning_rate)
        return theta

Well, I don’t think this last piece of code needs any explanation though.

So, we are finally here..
We first setup our parameters and data.
Then we setup our probability calculating function – Sigmoid
After that, we hit the rock bottom of calculating the error
In the after math, we calculate new theta values from initial values using Gradient Descent.

Important Links:
Source Code: Github Link
Gradient Descent: Youtube Link
Part 1 of this Article: Logistic Regression – Let’s Classify Things..!!

PS: There might be many implementations far better than this. It will be great to see people raising issues in my repository on github. After all, we all are here to learn. And yeah this article was so delayed I’m sorry for I was really busy in some meetups and examinations. Many new articles coming this month.

Connect with me on:

fb logo   insta-logojpg.jpg   gmail.png