Single Layer NN using TensorFlow

Single Layer NN using TensorFlow

The following code classifies MNIST dataset using a single layer NN with softmax activation function, Cross entropy loss function and Mini-batch Technique

Softmax Function

\sigma(z){_j} = \frac{e^{z_j}}{\sum_{k=1}^{K}e^{z_j}}

Cross Entropy

Cross Entropy = -{\sum_{i=1}^{i=n}Y_{i}^{'}.\log(Y^i)}

Mini-batch Technique

Taking a btach of 100 images in a single iteration. 2 Reasons to use it:

  1. Analyzing Single image results in a curvy descent. Knowledge of 100 images at a single time gives a more precise consensus of the gradient
  2. We do distributed processing using GPUs on which matrix multiplications works faster. (Optimised for GPUs)
In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline

Load Data

MNIST dataset is a handwritten numbers dataset. We download it from tensorflow examples.

Make sure to change the path according to your need

In [2]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("tutorials/data/MNIST/", one_hot=True)
print("Extraction of images is complete.")
Extracting tutorials/data/MNIST/train-images-idx3-ubyte.gz
Extracting tutorials/data/MNIST/train-labels-idx1-ubyte.gz
Extracting tutorials/data/MNIST/t10k-images-idx3-ubyte.gz
Extracting tutorials/data/MNIST/t10k-labels-idx1-ubyte.gz
Extraction of images is complete.

TensorFlow Placeholders

Tensorflow placeholders are like variables waiting for input. They are access points to the computational graphs on which we can just feed into the graph.

In [3]:
X = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
init = tf.global_variables_initializer()

Model

Our model is a Softmax Function applied over the weighted sum of the image pixels along with a bias added to it.
Y = softmax(X.W + b)
Where,

X = flattened image vector (1, 784)

W = Weights matrix of shape (784, 10) -> 784 weights for each pixel and 1 column for each class [0-9]

b = bias vector with 10 columns, each column representing a class

Role of Mini-batch

Since, we are using mini-batch technique. We will use X as a flattened image matrix with 100 rows, each row representing a flattened image.

In [4]:
Y = tf.nn.softmax(tf.matmul(X, W) + b)

Placeholder for correct Label

We need a Placeholder, to hold the correct labels, which will help us to compute the accuracy and cross-entropy of our model.

In [5]:
Y_ = tf.placeholder(tf.float32, [None, 10])

Loss Function

As discussed above we are using cross-entropy loss function
Cross Entropy = -{\sum_{i=1}^{i=n}Y_{i}^{'}.\log(Y^i)}

In [6]:
cross_entropy = -tf.reduce_sum(Y_ * tf.log(Y))

% of correct answers found in batch

Graph nodes to compute the accuracy of our model

In [7]:
is_correct = tf.equal(tf.arg_max(Y_, 1), tf.arg_max(Y, 1))
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

Optimizer

We are using the simplest Gradient Descent technique as an optimizer, with a learning rate of 0.003. This means we will be adding the 0.3% value of our gradient to the weights everytime.

Why Gradient?

Gradients have a unique property to point towards the minima of the curve. Thus resulting in, change of weights to reach a minima in the loss function.

In [8]:
optimizer = tf.train.GradientDescentOptimizer(0.003)
train_step = optimizer.minimize(cross_entropy)

Session

According to definition on tensorflow documentation:

A Session object encapsulates the environment in which Operation objects are executed, and Tensor objects are evaluated.

Well, we can say this is a kind of main function to our computational graph. That is, it starts the execution of the graph in the order, we added the computational nodes.

In [9]:
sess = tf.Session()
sess.run(init)

Training

Time to train the system. We run 2000 iterations on the train step and compute the accuracy and cross-entropy on each step. Along side we compute the accuracy and cross-entropy on the test set on each iteration, to see the improvement on alien data.

In [10]:
# no. of iterations
n_iter = 2000

# test set
test_data = {X: mnist.test.images, Y_: mnist.test.labels}

# lists to hold train accuracy and cross-entropy
acc_train_li = []
cross_train_li = []

# lists to hold test accuracy and cross-entropy
acc_test_li = []
cross_test_li = []

for i in range(n_iter):
    # load batch of images and correct answer
    bacth_X, batch_Y = mnist.train.next_batch(100)
    train_data = {X: bacth_X, Y_: batch_Y}
    
    # train
    sess.run(train_step, feed_dict=train_data)
    
    # find accuracy and cross entropy on current data
    a, c = sess.run([accuracy, cross_entropy], feed_dict=train_data)
    acc_train_li.append(a)
    cross_train_li.append(c)
    
    # find accuracy and cross entropy on test data
    a, c = sess.run([accuracy, cross_entropy], feed_dict=test_data)
    acc_test_li.append(a)
    cross_test_li.append(c)

Plot the graph

We plot 2 graphs:

  1. Accuracy graph – To display the accuracy on the train data and test data
  2. Cross-entropy graph – To display the cross-entropy loss on train and test data
In [11]:
x = list(range(n_iter))

blue_patch = mpatches.Patch(color='blue', label='Train Data')
red_patch = mpatches.Patch(color='red', label='Test Data')

plt.figure(0, figsize=(10, 12))

plt.subplot(211)
plt.title("Accuracy")
plt.legend(handles=[blue_patch, red_patch])
plt.plot(x, acc_train_li, color='blue')
plt.plot(x, acc_test_li, color='red')

plt.subplot(212)
plt.legend(handles=[blue_patch, red_patch])
plt.title("Cross-Entropy Loss")
plt.plot(x, cross_train_li, color='blue')
plt.plot(x, cross_test_li, color='red')

plt.show()
 acc_loss

Final Loss and Accuracy

Let’s have a peek at the final loss and accuracy of the training and test sets

In [12]:
print('Train Set Accuracy: {} \t Train Set cross-entropy Loss: {}'.format(acc_train_li[-1], cross_train_li[-1]))
print('Test Set Accuracy: {} \t Test Set cross-entropy Loss: {}'.format(acc_test_li[-1], cross_test_li[-1]))
Train Set Accuracy: 0.9200000166893005 	 Train Set cross-entropy Loss: 25.540407180786133
Test Set Accuracy: 0.921000063419342 	 Test Set cross-entropy Loss: 2800.803466796875

Conclusion

Using a Single layered neural network resulted in an accuracy of approximately 92%. Considering the situation of using this system in a post office to detect hand written numbers can be a devastating.
Why? Because, according to our finding it will misinterpret 8 out every 100 images (92%).

Code available at this repository

Found an issue? Or just want to say hello? You can contact me on my website

Genetic Algorithms – A Friendly Introduction

A few days back, I was going through one of the research papers and realised I have a new topic to cover and have my hands dirty with in the field of Machine Learning (ML).

Yeah, you guessed it right..it’s what we call Genetic Algorithms (GAs). While to my surprise, I myself found very less content of GAs being used in ML. If you find some, do link them in the comments. I’ll love to see others’ work on the same.

Without further gossips, lets just dig in and understand what GAs are all about.

In their documentation MathWorks® define Genetic Algorithms as:

The genetic algorithm is a method for solving both constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution.

For the sake of simplicity, let’s just say GAs are methods(set of algorithms) of search, often applied to optimization and learning.

While GAs are a set of algorithms, the bigger picture is that they are encapsulated by a much bigger set of algorithms known as Evolutionary Algorithms(EAs). So, what is so special about GAs? Well, their basis of evolutionary analogy is a thing that separates them from others in the same set of EAs. GAs follow the evolutionary Analogy – Survival of the Fittest. And if you weren’t sleeping in your Biology classes, you know that this evolutionary process is what we see in nature.

Now that we have a clear picture of what GAs are, let me introduce you to the concept of Search Space(a term constantly used in GAs).

Search Space

While solving a given problem, we generally look for a solution, which is best
among a set of possible solutions.
Going by the definition:

The space of all feasible solutions (it means objects among those the desired solution is) is called search space (also state space).

By the definition, it is clear that every point in search space refers to a feasible solution. To visualize these solutions, we can mark each point by its value or fitness for the problem at hand. The solution to the problem will be point or set of points from the search space.

How do we look for a solution? Well, looking for a solution generally follows the process of looking for an extreme(minimum or maximum) in the search space.

This might be a little confusing. Let’s just take an example. Consider a problem of finding a password which is 8 characters long containing only lower case alphabets. What do you think are plausible solutions to this problem? Well, since I attended my Permutations and Combinations classes, I know that the search space contains 26^8 choices. If you are confused, just think of 8 blanks and try thinking of possible choices(26 lower case alphabets) for each blank. The example was all good but the thing we have to keep in mind is in the case of a problem solved by GA – we usually know a few points from the search space and are generating other points as the process of finding solution continues.

Problem with search:
The problem is that search can be very COMPLICATED!
One does not know where to look for the solution and where to start.

To overcome this problem and find suitable solution (that may not necessarily be best solution), some nice guys came up with various methods, for example, hill climbing, Simulated annealing, etc. Solutions found using these methods are generally considered good because it is often not possible to prove what is real optimum(who knows in future maybe we’ll be able to – checkout these Quora answers on NP problems).

Before going ahead, we’ve to understand the concept of Chromosomes(don’t worry I’m not gonna blabber about the whole concept and go deep in the world of those nuclei).

Chromosome
We all know that every living being, from the mosquito you killed from that spray to you my friend, possess cells. Now, these cells have a fascinating thing about them, that is, every cell in an organism has the same set of chromosomes, for example, in humans there is a set of 23 pairs while in Gorillas there is a set of 24 pairs of chromosomes(1 pair can make a hell lot of difference, guys).

Chromosomes in general form are strings of DNA(I don’t want to go in detail of those ATGCs though) and serve as a model for the whole organism. Now, these chromosomes consist of genes (blocks of DNA).

Each gene(in certain terms) encodes a trait of the organism, example, color of eyes. Now, if you aren’t living all by yourself, you know that every individual organism in the same species have many distinctions from each other. This is because of different settings of the traits encoded in the genes. The possible settings for a trait(example blue eyes, brown hair, etc) are called alleles. These genes have a special position in each chromosome, which is reffered to as locus. The set of genetic material(a set of chromosomes) is called genome.
The last two terms of this biology lecture(bear with me for 2 more minutes guys)
So, we know every organism has this “internally coded, inheritable information”(set of chromosomes) and we refer to it as genotype. While these traits are being passed by parents, we only see a few of them(dominant ones). These actual observed properties of an organism is referred to as phenotype.

That was a heck of a biology lecture I know…you are a survivor man.

chromosomes_table

This table gives quick view whenever you need to check what we are referring to in the Methodology below.

Methodology of A Genetic Algorithm

In one liner, we can explain the methodology used as,

In a genetic algorithm, a population of candidate solutions (called individuals, creatures or phenotypes) to an optimization problem is evolved towards better solutions.

Going deeper, each of these candidate solutions have a set of properties (its genotype or chromosomes) which can be mutated and altered. Traditional way of representing a candidate solution is to use binary numbers to form strings of 0s and 1s(but other encodings are also possible).

Representation Example:
Bits or subset of bits represent choice of some features, for example, let us represent choice of a smart TV using a 10 bit string using following protocol:

representation_protocol

Continuing to our discussion on Methodology…
Since we now understand how we can represent the solutions, let us understand how we go about making changes to them so as to reach a good solution.

In general terms, evolution of a population of randomly generated individuals is an iterative process, with population present in each iteration is called a generation. In every iteration, fitness of every individual in the population is evaluated(fitness is usually the value of objective function in the optimization problem being solved). To form a new generation with a bit of a randomness, we select more fit individuals stochastically from the current population and then modify the genome(recombine and randomly mutate) of each individual in the population. This new generation is then used in next iteration.Now, the question which might be popping in your mind is..Hey, you said its an iterative process, so when should it be brought to a hault?
Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.

With power comes responsibility. GAs give you a power to get good optimum solutions for problems but for every new problem to be solved using GAs, a typical GA requires:
1. A genetic representation of the solution domain(i.e., the genotype will be different for a different problem domain).
2. A fitness function to evaluate the solution domain(i.e., the fitness function can be different for different problem domains).
These two items must be developed again whenever a new problem is specified.

General Criteria Followed By Genetic Algorithms

Every GA follows a basic criteria for computing solutions. Let us have a look at these steps.

general_GA.png

So, the questions that arise after giving a thought over the algorithm are:
1. How do we select parents for crossover?. Though this can be achieved in many ways, but the main idea followed is to select the better parents(this is done in hopes that better parents will produce better offspring).
2. If the new population is generated only by new offspring then this can cause loss of the best chromosomes from the last population.. This is indeed true; To overcome this problem, a method named elitism is used. It simply means that at least one best solution is copied without changes to the new population so the best solution found can survive to the end of run.

Looking at the algorithm, it is clear that three more operations other than fitness evaluation are being performed – Selection, Mutation and Crossover. These are done using three Genetic operators. Let us visit the territory of these Genetic Operators and have a clear look.

Genetic Operators

Genetic operators, as mentioned above, are methods to perform Selection, Mutation and Crossover. And not so surprisingly, they are named as Selection, Mutation and Crossover. We will now visit each of these one at a time.

Selection

To perform crossover and produce new offspring, we need to select parents(chromosomes). The problem that lies in the selection process is how to select these chromosomes such that the best ones should survive and create new offspring. Traditional way of doing so is to use Proportional Selection, i.e., parents are chosen to mate with probability proportional to their fitness. Also, in traditional selection process, the children replace their parents. There are many methods for selecting the best chromosomes, for example, Roulette Wheel Selection, Boltzmann Selection, Tournament Selection, Rank Selection and Steady-State Selection(Code for these will be produced with explanation in future post).

Looking at the broader picture, one can clearly see the principle being used is Survival of the Fittest.

Mutation

Mutation(reminds us of X-MEN I know..oh Logan) is one of the most simple processes to explain in GA. This operator operates on a single parent at a time to produce an offspring. Think of it as a magic wand that can only change one (or more) bit in a binary representation, where each bit has same probability of mutation. Let us remember our previous example string.
Example String:  0101110011 (Represents a Samsung LED Smart TV in range 46″-49″ with 3D and HDR capabilities but no 4K UHD support and no curved display)
Could Mutate to: 0101111011 (Represents a Samsung LED Smart TV in range 46″-49″ with 3D and HDR capabilities along with 4K UHD support but no curved display)

Oh God..if only my algorithm can really mutate my smart TV!
It might or might not happen in future but we must keep going(with my broken heart though).

Crossover

This, my boy, is how you were born(not exactly though). I guess I must put a hault on my lame jokes now.So, this operator as you might have guessed, operates on two parents selected by Selection operator to produce one or two offspring. Classical crossover occurs at one or 2 points in the selected parents.

Example of generating offspring using 1 point crossover:

point_1_cross

Looking at the image, it is clear that we just swapped the sub-strings on either side of the point of crossover.

Example of generating offspring using 2 point crossover:

point_2_cross

Just like 1 point crossover, in 2 point crossover, we swapped the sub-strings enclosed in the two points of crossover.

With this, I think, you guys have a pretty good idea of how GAs work. Giving some rest to our minds, I’ll be back with a post on a very simple GA with an example problem to apply GA upon.

Connect with me on:

fb logo   insta-logojpg.jpg   gmail.png

Logistic Regression – Hands on Experience

In my last post, I gave you a theoretical knowledge of how Logistic Regression works. Along with the intuition, I provided you with a dataset to apply the theoretical knowledge on your own at first. In this post, I’ll explain you my approach to get a working model for the dataset I provided.

Let us first revist and have a deeper look into our dataset..because daahh..data is what matters the most for making predictions (Well, that’s a topic for another day).

Dataset Explained.

The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’.

Where,
‘admit’ – represents whether a given student is provided with admit to the college(1) or not(0).
‘rank’ – represents the ranking of the college
‘gpa’ (or Grade Point Average) – represents the GPA of student in his previous academics.
‘gre’ (or Graduate Record Examination) – represents marks of student in the GRE exam.

Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to the college or not.

If you haven’t yet read my last post, I recommend doing it before going ahead.

I strongly recommend to try to code it out yourself before looking at my solution.

Libraries used

We’ll be using Pandas for data extraction and NumPy for matrix operations. So, make sure you have a little knowledge of the stuff. Although, I’ll try my best to explain all the minor details.

So, grab your cup of coffee and follow along. It is going to be a little of a long ride.

Step 1 – Setting Up

We’ll be making a function called run, the main purpose of which will be to set up our data and call all the functions to perform gradient descent.

def run():
    #Collect data
    dataframe = pd.read_csv('binary.csv', sep = ',')
    data = dataframe.as_matrix()
    y = data[:, 0]
    X = data[:, 1:]
    #Scale Data
    X = feature_scaling(X)
    #Step 2 - define hyperparameters
    learning_rate = 3
    num_iters = 1000
    initial_theta = np.random.random(3)
    #train our model
    print('Starting gradient descent at theta = {0}, error = {1}, accuracy = {2}'
    .format(initial_theta, compute_error_for_separator_given_data(initial_theta, X, y), accuracy(initial_theta, X, y)))
    theta = gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters)
    print('Ending gradient descent at Iteration = {0} theta = {1}, error = {2}, accuracy = {3}'
    .format(num_iters, theta, compute_error_for_separator_given_data(theta, X, y), accuracy(theta, X, y)))

In line 3, pd.read_csv() takes in two arguments for our purpose. First file name and second separator used in file (For those of you who don’t know, .csv means comma separated values). The function returns a dataframe. Converting this dataframe to a matrix (as in line 4) makes it easy to operate on data. In next two lines, we separate our labels (‘admit’) from features (‘rank’, ‘gpa’ and ‘gre’).

Rest of the functions will be covered one at a time.

Step 2 – Feature Scaling

From the heading itself its clear that we are targeting the black box we know nothing about at line 8 of run function.

While I’ll be discussing this topic in full depth in a future post, let’s look at what it means to scale the features here. Looking at the data, it is clear that ‘gre’ marks are in hundreds while both other features are in unit place. If this variation in feature is reduced, then we can reach to global optimum more easily and effectively. For this purpose, we will be using what we call Normalization. Doing this ensures that all the values are in the range of 0 and 1.

The formula we’ll be using here is:

Normalization

def feature_scaling(X):
    for i in range(X.shape[1]):
    X[:, i] = (X[:, i] - min(X[:, i])) / (max(X[:, i]) - min(X[:, i]))
    return X

This could have also been done using NumPy as

def feature_scaling(X):
    return (X - np.min(X, axis = 0)) / (np.max(X, axis = 0) - np.min(X, axis = 0))

Where the second keyword argument, as the name suggests apparently decides which axis to find the minimum or maximum on. Not using it will return minimum or maximum value out of the complete matrix.

Step 3 – Calculating Error

Before Calculating Error we need to define 2 more functions.
1. Sigmoid
2. Predict absolute value

Sigmoid
If you read my last post, then you definitely know what sigmoid function is and how we are going to use it here. So, without going into details lets jump to the code.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Here, -z multiplies -1(to be precise, it just flips the sign bit) to each element of z. Then, np.exp(-z) applies element-wise exponential function on negative of z. The addition and division shown here works in element-wise manner.

Predict absolute value
Let us just directly jump to the code.

def predict_abs(theta, X):
    return sigmoid(np.dot(X, theta))

Here, np.dot(X, theta) performs simple matrix multiplication.

Now, since our base work is clear let us make our error computing function. As we already know, we use mean square error formula for this sake, i.e.,

Slide1

def compute_error_for_separator_given_data(theta, X, y):
    preds = predict_abs(theta, X)
    total_error = np.sum((y - preds) ** 2)
    return total_error / float(len(y))

We’ve scaled our features, we’ve computed the error.The only thing left to do is to run gradient descent on our data with our hyper-parameters.

Step 4 – Gradient Descent

From my last article we know that gradient descent works as:

batch_gradient_descent

Ummm.. there is surely a call for separation of things to make the code easier to implement. So we will tackle each step of the Gradient Descent using a function called step_gradient.

def step_gradient(theta_current, X, y, learning_rate):
    preds = predict_abs(theta_current, X)
    theta_gradient = -(2 / len(y)) * np.dot(X.T, (y - preds))
    theta = theta_current - learning_rate * theta_gradient
    return theta

Going line by line, first we compute absolute predictions from given theta values and feature values. Then using them, we compute the partial differentiation of our cost function w.r.t theta and hence in the third line we find the value of new theta for current iteration of Gradient Descent.
Pretty Neat..!

So the only thing left is to iterate over this function a certain number of times as mentioned in our hyper-parameters.

def gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters):
    theta = initial_theta
    for i in range(num_iters):
        theta = step_gradient(theta, X, y, learning_rate)
        return theta

Well, I don’t think this last piece of code needs any explanation though.

So, we are finally here..
We first setup our parameters and data.
Then we setup our probability calculating function – Sigmoid
After that, we hit the rock bottom of calculating the error
In the after math, we calculate new theta values from initial values using Gradient Descent.

Important Links:
Source Code: Github Link
Gradient Descent: Youtube Link
Part 1 of this Article: Logistic Regression – Let’s Classify Things..!!

PS: There might be many implementations far better than this. It will be great to see people raising issues in my repository on github. After all, we all are here to learn. And yeah this article was so delayed I’m sorry for I was really busy in some meetups and examinations. Many new articles coming this month.

Connect with me on:

fb logo   insta-logojpg.jpg   gmail.png

Logistic Regression – Let’s Classify Things..!!

In my post on Categorising Deep Seas of ML, I introduced you to problems of Classification (a subcategory of Supervised Learning).

But wait..we are talking Logistic “Regression”. Blame history for it, but the only thing common between Logistic Regression and Regression is the word itself.

Logistic Regression Intuition

Consider a problem where you have to find the probability of a student being studious one or goofy one.
What can we do?
Data..key to every solution..yeah

So, we grade our test students on certain basis. Lets just say we rate them out of 10 on following points – study hrs/day(x1), attention in class(x2), interaction in class(x3), behaviour with peers(x4)..well for sake of simplicity lets just take these 4 features.

Representing this in a matrix

feature_matrix

Now, you might be thinking – Hey, I have seen probability in my mathematics class and I know it always lies in range [0, 1].

Hold your horse my friend. We are getting to that very step.

Activation Functions

The world of ML has taken many inputs from the field of Mathematics and perhaps this very part of activation function is taken entirely from the latter field.

A function used to transform the activation level of a unit(neuron) into an output signal is called an Activation Function.

Well, we’re going a little off from our topic here. But you can think of activation function as a function which provides us with the probability of our test being positive. In this case, it gives us the probability of a student being studious.

The function we’ll be using here is the Sigmoid function.

sigmoid

Lets just have a look at the graph of the function.

sigmoid_graph

‘z’ on x-axis vs. sigmoid(z) on y-axis

The graph clearly depicts that for any value of z, sigmoid function will return us a value between 0 and 1….MIND == BLOWN.. (“==” because for Programmers “=” != “==”).

So what’s left?
The only thing left for us to do is to define a mapping from our test data to z in sigmoid(z) and “minimize the error” in the mapping to get the best result.

“Minimize the error”..hmm..we have done something similar in Linear Regression too. Gradient Descent is the key.

So what’s our hypothesis? It will be nothing but

hypothesis

We’ll calculate these coefficients by minimizing our cost function.

The very basic step of Gradient descent is to find a Cost Function. I know all these functions are getting on your nerve so lets just depict these by using a flow chart. Stick with me and we’ll make it easy.

flow_chart_logistic

Flow Chart Depicting Logistic Regression.

Our Cost function here will be :

cost_function

Don’t worry you don’t have to memorise it. But lets just understand how this Cost function is implemented. Consider for our test data when y = 1 then our cost function is -log(h(theta)). The graph for the same is:

negative_log(y)

h(theta) on x-axis vs. -log(h(theta)) on y-axis

This shows that as the value of calculated hypothesis goes from 0 to 1(required value) our cost function decreases.

Now, consider when y = 0 then our cost function is -log(1-h(theta)). The graph for same is:

negative_log(1-y)

h(theta) on x-axis vs. -log(1-h(theta)) on y-axis

This shows that as the value of the calculated hypothesis goes from 1 to 0(required value) our cost function decreases. Pretty much what required.

Now, as we are familiar with our cost function lets just remember how Gradient Descent works.

batch_gradient_descent

Simultaneously,  update for every Theta. Alpha being the Learning Rate.

Hmmmm..partial differentiation and our apparent hide and seek with it..Let me make your task easy.

cost_func_differentiation

So, now we know how to get our coefficients tuned and how to run our gradient descent.

What about making predictions?
Well, that’s easy! A student with higher probability of being a studious one is of course more studious. But how will I compute it?. Deciding a threshold is upto you. For me.. a student with a probability greater than or equal to 0.5 works just fine. I am a little lenient I know. 😉

So, now you have it. Every tiny detail of logistic regression.

Now I’ve a task for you. I’ll be providing you with a dataset and you have to apply logistic regression on your own. No worries though, my next post will explain my way of logistic regression on the same dataset.

Explanation of dataset: The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’. When given the ‘rank’ of the college then the ‘admit’ shows whether the person is provided with the admit to the college(1) or not(0) provided he has a corresponding ‘gpa’ and ‘gre’ scores. Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to college or not.

Linear Regression – Giving Best Fit Line

In my last post, I gave you an intuition of types of Machine Learning problems. Here, we’ll be undertaking the problem of regression(or real values fitting) which is a sub-category of Supervised Machine Learning.

But first things first!
We’ll be using Python for implementation along with its open source libraries like – NumPy, Pandas, Matplotlib and Scikit-learn. So, knowledge of basic python syntax is a must. At the end of the post, there is a list of all the resources to help you out.

So, here comes the time of a deep dive…!

What is Linear Regression?

Linear Regression is the problem of fitting the best possible line to the data. Woooaahh!!

Let us go through it step by step…

Why..?
Because we (Programmers) break down our work into smaller, simpler pieces
Step 1: Collect data
Step 2: Find a mapping from input data to labels, which gives minimum error… Magic of Mathematics!!
Step 3: Use this mapping to find label to new or unseen data-sets.

Wait! Don’t leave I know its MATHS, but we’ll make it easy.

How to find the Line of Best Fit?

A line that will fit the data in the best possible manner will be the line that has the minimum error. So, here is our objective – to minimize the error between the Actual Labels and Calculated Labels. And in order to accomplish it, we’ll be using one of the widely used algorithms in the world of ML – Gradient Descent.

So lets just dive right into the coding part:

Step 1: Collecting Data

For this example, we’ll be using NumPy to read data from an excel file – “data.csv”. The file contains data comparing the amount of hours a student studied and how much he or she scored.


def run():
 #Step 1 - Collect the data
 points = np.genfromtxt('data.csv', delimiter = ',')

Step 2: Finding The mapping

Our goal is to find a mapping between the hours studied and marks obtained, i.e.

y = mx + b

where,
x = amount of hours studied
y = marks obtained
m = slope of the line
b = y intercept

Well, this step has further sub-steps..

Step 2.1: Computing Error

To Compute Error at each iteration, we’ll be using the cost function(or objective function):

Slide1


def compute_error_for_line_given_points(b, m, points):
 total_error = 0
 for ip in range(len(points)):
  x = points[ip, 0]
  y = points[ip, 1]
  total_error += (y - (m * x + b )) ** 2
 return total_error / float(len(points))

Looking at the mapping, it is clear that the only possible parts we can play around with to reduce the error are- m(the slope of the line) and b(the y intercept).

Step 2.2: Setting Hyper-parameters

In ML, we have certain parameters to tune in our model over the data we are analysing. These parameters are known as Hyper-parameters and they define certain terms that make sure our function(or model) fits data on a particular rate and in a particular way. We have a whole bunch of hyper-parameters with their own importance. More on this will be in a future post.

In our case, the Hyper-parameters that we must take into consideration are the Learning Rate, initial value of ‘b’, initial value of ‘m’ and number of iterations. Learning Rate defines how fast should our model converge. By converge, it means how fast model reaches the global optimum(in this case global minimum).


def run():
 #Step 1 - Collect the data
 points = np.genfromtxt('data.csv', delimiter = ',')

 #Step 2 - define hyperparameters
 learning_rate = 0.0001
 initial_b = 0
 initial_m = 0
 num_iterations = 1000

Well, then why isn’t Learning Rate big like a million?

The main reason being there must be a balance. If learning rate is too big, our error function might not converge(decrease), but if its too small, our error function might take too long to converge.

Step 2.3: Perform Gradient Descent

In this step, we will modify the value of y intercept ‘b’ and slope of line ‘m’ to decrease the value of the error (or objective function).

To do this, we will subtract the partial differentiation of our objective function with respect to b and m from former values of b and m respectively.


#Computes new values of b and m for a given iteration of gradient_descent_runner
def step_gradient(points, b_current, m_current, learning_rate):
 b_gradient = 0
 m_gradient = 0
 N = float(len(points))
 for i in range(len(points)):
  x = points[i, 0]
  y = points[i, 1]
  b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
  m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
 new_b = b_current - learning_rate * b_gradient
 new_m = m_current - learning_rate * m_gradient
 return [new_b, new_m]

#Finds the best fit line to the data.
def gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations):
 b = initial_b
 m = initial_m
 for i in range(num_iterations):
  b, m = step_gradient(points, b, m, learning_rate)
 return [b, m]

 

Step 3: Visualizing

Here, we will use Matplotlib – an open source library to plot graph of our mapping over the scatter plot of data.


def run():
 #Step 1 - Collect the data
 points = np.genfromtxt('data.csv', delimiter = ',')

 #Step 2 - define hyperparameters
 learning_rate = 0.0001
 initial_b = 0
 initial_m = 0
 num_iterations = 1000

 #Step 3 - train our model
 print('Starting gradient descent at b = {0}, m = {1}, error = {2}'
 .format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
 b, m = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)

 print('Ending gradient descent at Iteration = {0} b = {1}, m = {2}, error = {3}'
 .format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))

 #Step 4 - Visualization
 x = [ix[0] for ix in points]
 y = [iy[1] for iy in points]
 y_predict = [m * ix + b for ix in x]

 plt.figure(0)
 plt.scatter(x, y)
 plt.plot(x, y_predict)
 plt.show()

 

output

Output : Graph Displaying the Best Fit line over the Data

To see a visual comparison between decrease in error and line fitting – go to link.

GitHub link to Source Code

Resources for further reading:

Python tutorial
NumPy tutorial
Matplotlib tutorial

PS: To get a good grip on the concept of Linear Regression visit this link.

Categorising Deep Seas of Machine Learning…

In my last post, I gave you a teaser to what “Machine Learning” is. Taking things up a notch, let us dive into the deep seas of Machine Learning(ML).

Deep ML seas(don’t confuse with Deep Learning) have variations, curves, biasness, data and its diversity. There are different types of ML algorithms to tackle these stuffs.

Broadly, these ML algorithms can be divided into 3 categories:

Supervised Learning – To tackle the data which have a known answer.
Unsupervised Learning – To tackle the data which have no known answer.
Reinforcement Learning – To tackle the data based on final outcome.

Supervised Learning

The game of the labeled data resides in this colony of ML..

The word labeled data signifies that for a given values of a set of input variables(x) we will have a given value of output variable(y; labels). The algorithms belonging to this category generally gives you a mapping

y = f(x)

which gives the best possible relation between input and output variables.

Think of the algorithm being a student supervised by a teacher. Teacher, being aware of the correct answers, corrects his student(algorithm) on each iteration till the student(algorithm) gives an acceptable level of performance.

Going further down the lane, we can divide supervised learning problems in two sub-categories:

Classification: A problem where output variable is a category or a class, such as “Cancer positive” or “Cancer negative”.
Regression: A problem where output variable is a real value, such as “rupees” or “weight”.

We’ll further look into these deep corners of supervised learning in upcoming posts.

Unsupervised Learning

Have cluttered, unlabeled data – we’ve you covered..!!

Unsupervised Learning problems are where you have input data but no corresponding output.

The algorithms covering this domain have a goal to model the underlying data into meaningful structures in order to learn more about the data gathered.

Consider these algorithms as those innovative students who don’t have a supervisor(teacher) to guide them. They are left on their own with the data to fiddle with it and give it a structure as they like. “…Mischievous!!”

Unsupervised Learning problems can further be divided into two sub-categories:

Clustering: A problem to find the inherent groupings in the data, such as grouping customers by purchasing behavior.
Association: A problem to discover association between large portions of data, such as people that buy X tend to buy Y.

We’ll cover these sub-categories in the future posts, tackling one at a time.

Reinforcement Learning

Greed for reward is the key..!!

In Reinforcement Learning problems, a software agent adapts in an environment so as to maximise its rewards.

To understand it, consider teaching a dog a new trick – there is no way to tell a dog “what to do!”, but you can reward or punish it depending on it being right or wrong.

These algorithms can, in a similar manner, train computers to perform complex tasks, such as playing chess. In Reinforcement Learning problems, if the modelling of problem is well handled, some algorithms can even converge to global optimum (that is, to the ideal behavior that maximises reward).

While we are talking of Greed..let me give you a greedy motivation to give your best in this field and keep going.

A breakthrough in machine learning would be worth ten Microsoft’s

-Bill Gates

Machine Learning – A Dive into the Mystery

Machine Learning – you have heard about it, you have seen it in action but you are still confused over how it works. The fact that you are here shows your interest and curiosity of what’s all the buzz?.

A formal definition

Tom Mitchell in his book Machine Learning gives a slightly informal definition in opening line of the preface:

The field of machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.

I like this definition as it gives a simple understanding of what our goal is while developing the computer programs. Going little on the formal side, Mitchell gave a definition in the introduction which you will see repeatedly in every Machine Learning introduction article:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

This formalism kind of creeps out the people reading the definition. But don’t let it scare you of as this definition can help you out in a way no other machine learning definition can. We can use this definition as a basis and put E, T and P at the top of a table and list out complex problems with less ambiguity. It can be taken in as a pattern to remove a narrow approach and think over what data to collect(E), what decisions the program needs to make(T) and how we will evaluate its results(P). This power of resolving the ambiguity in a complex problem is like a super power to already the strongest people on the planet – PROGRAMMERS.

Now, let me answer the most awaited question:

What’s all the buzz about Machine Learning?

As I already said, the strongest people on the planet are PROGRAMMERS, and why so? This is because there isn’t any industry left that doesn’t require its own IT department to develop software for their business to grow, or to handle the loads of data effectively. But there is still one thing in software development that is eating up the industry more rapidly than other software types and that is Self Improving Software (in short software working on concept of Machine Learning). Various industry products like Google’s messaging app Allo, Amazon’s recommendations of products, Netflix suggesting you which movies you will love to watch and many more (check out this YouTube Link) make use of machine learning to give their users a product that kind of resembles to have self-consciousness.

Being a programmer myself, I can consider how these high level definitions and formalism can take their sweet time to sink in. So let’s just take into consideration the thing where we are best, giving this formalism a programmatic approach.

Programmatic Approach

In real world scenarios, you will find complex problems which will show you that it is not feasible to write every single if-statement for the instructions you need to cover. Let us take the example most commonly used to give a glimpse over the idea of machine learning – spam email detection. How would you write a program to filter incoming emails in inbox folder and spam folder?

A general programmer approach will be to collect a number of examples and then find patterns in the emails that are spams and that are not. Most commonly you’d abstract these patterns in the emails and design heuristics which will work on new emails in future as well. You’d go for crafty things for edge cases and will try to bring up the accuracy.

This manual derivation and hard coded programmer will be as good as the programmers’ ability to understand vital differences between spam and non-spam emails. A thing that will finally haunt you is maintenance nightmares.

I know the programmer inside you must be shouting at this point – “Automation! Automation!“.

Considering the above example in terms of machine learning:
Examples(E) are the emails we have collected
Task(T) was decision problem(classification) – deciding whether the email is a spam or not and placing them in correct folder.
Performance(P) measure will be accuracy in something, like percentage.
Then applying certain machine learning algorithm(will be discussed in upcoming articles) to get a model(heuristics) to work on new examples is a basic approach to get automation.

Some terminologies that are used in machine learning regularly:
Preparing a decision program is basically called training.
Collected examples are called training set.
Program is referred to as a model.

Now arises the biggest of all the questions you have in your mind:

Where to get started?

I know there are people out there of different preferences. Some prefer books over videos while others prefer video tutorials over books. So, I have made a list of things that can get you started.

Books:
Machine Learning by Mitchell
The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie,
Tibshirani and Friedman
Pattern Recognition and Machine Learning by Bishop
Machine Learning: An Algorithmic Perspective by Marsland.

Video Tutorials:
Coursera Machine Learning by Andrew NG
Intro To Machine Learning by Udacity
Siraj Raval YouTube channel