Logistic Regression – Hands on Experience

In my last post, I gave you a theoretical knowledge of how Logistic Regression works. Along with the intuition, I provided you with a dataset to apply the theoretical knowledge on your own at first. In this post, I’ll explain you my approach to get a working model for the dataset I provided.

Let us first revist and have a deeper look into our dataset..because daahh..data is what matters the most for making predictions (Well, that’s a topic for another day).

Dataset Explained.

The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’.

‘admit’ – represents whether a given student is provided with admit to the college(1) or not(0).
‘rank’ – represents the ranking of the college
‘gpa’ (or Grade Point Average) – represents the GPA of student in his previous academics.
‘gre’ (or Graduate Record Examination) – represents marks of student in the GRE exam.

Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to the college or not.

If you haven’t yet read my last post, I recommend doing it before going ahead.

I strongly recommend to try to code it out yourself before looking at my solution.

Libraries used

We’ll be using Pandas for data extraction and NumPy for matrix operations. So, make sure you have a little knowledge of the stuff. Although, I’ll try my best to explain all the minor details.

So, grab your cup of coffee and follow along. It is going to be a little of a long ride.

Step 1 – Setting Up

We’ll be making a function called run, the main purpose of which will be to set up our data and call all the functions to perform gradient descent.

def run():
    #Collect data
    dataframe = pd.read_csv('binary.csv', sep = ',')
    data = dataframe.as_matrix()
    y = data[:, 0]
    X = data[:, 1:]
    #Scale Data
    X = feature_scaling(X)
    #Step 2 - define hyperparameters
    learning_rate = 3
    num_iters = 1000
    initial_theta = np.random.random(3)
    #train our model
    print('Starting gradient descent at theta = {0}, error = {1}, accuracy = {2}'
    .format(initial_theta, compute_error_for_separator_given_data(initial_theta, X, y), accuracy(initial_theta, X, y)))
    theta = gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters)
    print('Ending gradient descent at Iteration = {0} theta = {1}, error = {2}, accuracy = {3}'
    .format(num_iters, theta, compute_error_for_separator_given_data(theta, X, y), accuracy(theta, X, y)))

In line 3, pd.read_csv() takes in two arguments for our purpose. First file name and second separator used in file (For those of you who don’t know, .csv means comma separated values). The function returns a dataframe. Converting this dataframe to a matrix (as in line 4) makes it easy to operate on data. In next two lines, we separate our labels (‘admit’) from features (‘rank’, ‘gpa’ and ‘gre’).

Rest of the functions will be covered one at a time.

Step 2 – Feature Scaling

From the heading itself its clear that we are targeting the black box we know nothing about at line 8 of run function.

While I’ll be discussing this topic in full depth in a future post, let’s look at what it means to scale the features here. Looking at the data, it is clear that ‘gre’ marks are in hundreds while both other features are in unit place. If this variation in feature is reduced, then we can reach to global optimum more easily and effectively. For this purpose, we will be using what we call Normalization. Doing this ensures that all the values are in the range of 0 and 1.

The formula we’ll be using here is:


def feature_scaling(X):
    for i in range(X.shape[1]):
    X[:, i] = (X[:, i] - min(X[:, i])) / (max(X[:, i]) - min(X[:, i]))
    return X

This could have also been done using NumPy as

def feature_scaling(X):
    return (X - np.min(X, axis = 0)) / (np.max(X, axis = 0) - np.min(X, axis = 0))

Where the second keyword argument, as the name suggests apparently decides which axis to find the minimum or maximum on. Not using it will return minimum or maximum value out of the complete matrix.

Step 3 – Calculating Error

Before Calculating Error we need to define 2 more functions.
1. Sigmoid
2. Predict absolute value

If you read my last post, then you definitely know what sigmoid function is and how we are going to use it here. So, without going into details lets jump to the code.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Here, -z multiplies -1(to be precise, it just flips the sign bit) to each element of z. Then, np.exp(-z) applies element-wise exponential function on negative of z. The addition and division shown here works in element-wise manner.

Predict absolute value
Let us just directly jump to the code.

def predict_abs(theta, X):
    return sigmoid(np.dot(X, theta))

Here, np.dot(X, theta) performs simple matrix multiplication.

Now, since our base work is clear let us make our error computing function. As we already know, we use mean square error formula for this sake, i.e.,


def compute_error_for_separator_given_data(theta, X, y):
    preds = predict_abs(theta, X)
    total_error = np.sum((y - preds) ** 2)
    return total_error / float(len(y))

We’ve scaled our features, we’ve computed the error.The only thing left to do is to run gradient descent on our data with our hyper-parameters.

Step 4 – Gradient Descent

From my last article we know that gradient descent works as:


Ummm.. there is surely a call for separation of things to make the code easier to implement. So we will tackle each step of the Gradient Descent using a function called step_gradient.

def step_gradient(theta_current, X, y, learning_rate):
    preds = predict_abs(theta_current, X)
    theta_gradient = -(2 / len(y)) * np.dot(X.T, (y - preds))
    theta = theta_current - learning_rate * theta_gradient
    return theta

Going line by line, first we compute absolute predictions from given theta values and feature values. Then using them, we compute the partial differentiation of our cost function w.r.t theta and hence in the third line we find the value of new theta for current iteration of Gradient Descent.
Pretty Neat..!

So the only thing left is to iterate over this function a certain number of times as mentioned in our hyper-parameters.

def gradient_descent_runner(X, y, initial_theta, learning_rate, num_iters):
    theta = initial_theta
    for i in range(num_iters):
        theta = step_gradient(theta, X, y, learning_rate)
        return theta

Well, I don’t think this last piece of code needs any explanation though.

So, we are finally here..
We first setup our parameters and data.
Then we setup our probability calculating function – Sigmoid
After that, we hit the rock bottom of calculating the error
In the after math, we calculate new theta values from initial values using Gradient Descent.

Important Links:
Source Code: Github Link
Gradient Descent: Youtube Link
Part 1 of this Article: Logistic Regression – Let’s Classify Things..!!

PS: There might be many implementations far better than this. It will be great to see people raising issues in my repository on github. After all, we all are here to learn. And yeah this article was so delayed I’m sorry for I was really busy in some meetups and examinations. Many new articles coming this month.

Connect with me on:

fb logo   insta-logojpg.jpg   gmail.png


Linear Regression – Giving Best Fit Line

In my last post, I gave you an intuition of types of Machine Learning problems. Here, we’ll be undertaking the problem of regression(or real values fitting) which is a sub-category of Supervised Machine Learning.

But first things first!
We’ll be using Python for implementation along with its open source libraries like – NumPy, Pandas, Matplotlib and Scikit-learn. So, knowledge of basic python syntax is a must. At the end of the post, there is a list of all the resources to help you out.

So, here comes the time of a deep dive…!

What is Linear Regression?

Linear Regression is the problem of fitting the best possible line to the data. Woooaahh!!

Let us go through it step by step…

Because we (Programmers) break down our work into smaller, simpler pieces
Step 1: Collect data
Step 2: Find a mapping from input data to labels, which gives minimum error… Magic of Mathematics!!
Step 3: Use this mapping to find label to new or unseen data-sets.

Wait! Don’t leave I know its MATHS, but we’ll make it easy.

How to find the Line of Best Fit?

A line that will fit the data in the best possible manner will be the line that has the minimum error. So, here is our objective – to minimize the error between the Actual Labels and Calculated Labels. And in order to accomplish it, we’ll be using one of the widely used algorithms in the world of ML – Gradient Descent.

So lets just dive right into the coding part:

Step 1: Collecting Data

For this example, we’ll be using NumPy to read data from an excel file – “data.csv”. The file contains data comparing the amount of hours a student studied and how much he or she scored.

def run():
 #Step 1 - Collect the data
 points = np.genfromtxt('data.csv', delimiter = ',')

Step 2: Finding The mapping

Our goal is to find a mapping between the hours studied and marks obtained, i.e.

y = mx + b

x = amount of hours studied
y = marks obtained
m = slope of the line
b = y intercept

Well, this step has further sub-steps..

Step 2.1: Computing Error

To Compute Error at each iteration, we’ll be using the cost function(or objective function):


def compute_error_for_line_given_points(b, m, points):
 total_error = 0
 for ip in range(len(points)):
  x = points[ip, 0]
  y = points[ip, 1]
  total_error += (y - (m * x + b )) ** 2
 return total_error / float(len(points))

Looking at the mapping, it is clear that the only possible parts we can play around with to reduce the error are- m(the slope of the line) and b(the y intercept).

Step 2.2: Setting Hyper-parameters

In ML, we have certain parameters to tune in our model over the data we are analysing. These parameters are known as Hyper-parameters and they define certain terms that make sure our function(or model) fits data on a particular rate and in a particular way. We have a whole bunch of hyper-parameters with their own importance. More on this will be in a future post.

In our case, the Hyper-parameters that we must take into consideration are the Learning Rate, initial value of ‘b’, initial value of ‘m’ and number of iterations. Learning Rate defines how fast should our model converge. By converge, it means how fast model reaches the global optimum(in this case global minimum).

def run():
 #Step 1 - Collect the data
 points = np.genfromtxt('data.csv', delimiter = ',')

 #Step 2 - define hyperparameters
 learning_rate = 0.0001
 initial_b = 0
 initial_m = 0
 num_iterations = 1000

Well, then why isn’t Learning Rate big like a million?

The main reason being there must be a balance. If learning rate is too big, our error function might not converge(decrease), but if its too small, our error function might take too long to converge.

Step 2.3: Perform Gradient Descent

In this step, we will modify the value of y intercept ‘b’ and slope of line ‘m’ to decrease the value of the error (or objective function).

To do this, we will subtract the partial differentiation of our objective function with respect to b and m from former values of b and m respectively.

#Computes new values of b and m for a given iteration of gradient_descent_runner
def step_gradient(points, b_current, m_current, learning_rate):
 b_gradient = 0
 m_gradient = 0
 N = float(len(points))
 for i in range(len(points)):
  x = points[i, 0]
  y = points[i, 1]
  b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
  m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
 new_b = b_current - learning_rate * b_gradient
 new_m = m_current - learning_rate * m_gradient
 return [new_b, new_m]

#Finds the best fit line to the data.
def gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations):
 b = initial_b
 m = initial_m
 for i in range(num_iterations):
  b, m = step_gradient(points, b, m, learning_rate)
 return [b, m]


Step 3: Visualizing

Here, we will use Matplotlib – an open source library to plot graph of our mapping over the scatter plot of data.

def run():
 #Step 1 - Collect the data
 points = np.genfromtxt('data.csv', delimiter = ',')

 #Step 2 - define hyperparameters
 learning_rate = 0.0001
 initial_b = 0
 initial_m = 0
 num_iterations = 1000

 #Step 3 - train our model
 print('Starting gradient descent at b = {0}, m = {1}, error = {2}'
 .format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
 b, m = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)

 print('Ending gradient descent at Iteration = {0} b = {1}, m = {2}, error = {3}'
 .format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)))

 #Step 4 - Visualization
 x = [ix[0] for ix in points]
 y = [iy[1] for iy in points]
 y_predict = [m * ix + b for ix in x]

 plt.scatter(x, y)
 plt.plot(x, y_predict)



Output : Graph Displaying the Best Fit line over the Data

To see a visual comparison between decrease in error and line fitting – go to link.

GitHub link to Source Code

Resources for further reading:

Python tutorial
NumPy tutorial
Matplotlib tutorial

PS: To get a good grip on the concept of Linear Regression visit this link.

Logistic Regression – Let’s Classify Things..!!

In my post on Categorising Deep Seas of ML, I introduced you to problems of Classification (a subcategory of Supervised Learning).

But wait..we are talking Logistic “Regression”. Blame history for it, but the only thing common between Logistic Regression and Regression is the word itself.

Logistic Regression Intuition

Consider a problem where you have to find the probability of a student being studious one or goofy one.
What can we do?
Data..key to every solution..yeah

So, we grade our test students on certain basis. Lets just say we rate them out of 10 on following points – study hrs/day(x1), attention in class(x2), interaction in class(x3), behaviour with peers(x4)..well for sake of simplicity lets just take these 4 features.

Representing this in a matrix


Now, you might be thinking – Hey, I have seen probability in my mathematics class and I know it always lies in range [0, 1].

Hold your horse my friend. We are getting to that very step.

Activation Functions

The world of ML has taken many inputs from the field of Mathematics and perhaps this very part of activation function is taken entirely from the latter field.

A function used to transform the activation level of a unit(neuron) into an output signal is called an Activation Function.

Well, we’re going a little off from our topic here. But you can think of activation function as a function which provides us with the probability of our test being positive. In this case, it gives us the probability of a student being studious.

The function we’ll be using here is the Sigmoid function.


Lets just have a look at the graph of the function.


‘z’ on x-axis vs. sigmoid(z) on y-axis

The graph clearly depicts that for any value of z, sigmoid function will return us a value between 0 and 1….MIND == BLOWN.. (“==” because for Programmers “=” != “==”).

So what’s left?
The only thing left for us to do is to define a mapping from our test data to z in sigmoid(z) and “minimize the error” in the mapping to get the best result.

“Minimize the error”..hmm..we have done something similar in Linear Regression too. Gradient Descent is the key.

So what’s our hypothesis? It will be nothing but


We’ll calculate these coefficients by minimizing our cost function.

The very basic step of Gradient descent is to find a Cost Function. I know all these functions are getting on your nerve so lets just depict these by using a flow chart. Stick with me and we’ll make it easy.


Flow Chart Depicting Logistic Regression.

Our Cost function here will be :


Don’t worry you don’t have to memorise it. But lets just understand how this Cost function is implemented. Consider for our test data when y = 1 then our cost function is -log(h(theta)). The graph for the same is:


h(theta) on x-axis vs. -log(h(theta)) on y-axis

This shows that as the value of calculated hypothesis goes from 0 to 1(required value) our cost function decreases.

Now, consider when y = 0 then our cost function is -log(1-h(theta)). The graph for same is:


h(theta) on x-axis vs. -log(1-h(theta)) on y-axis

This shows that as the value of the calculated hypothesis goes from 1 to 0(required value) our cost function decreases. Pretty much what required.

Now, as we are familiar with our cost function lets just remember how Gradient Descent works.


Simultaneously,  update for every Theta. Alpha being the Learning Rate.

Hmmmm..partial differentiation and our apparent hide and seek with it..Let me make your task easy.


So, now we know how to get our coefficients tuned and how to run our gradient descent.

What about making predictions?
Well, that’s easy! A student with higher probability of being a studious one is of course more studious. But how will I compute it?. Deciding a threshold is upto you. For me.. a student with a probability greater than or equal to 0.5 works just fine. I am a little lenient I know. 😉

So, now you have it. Every tiny detail of logistic regression.

Now I’ve a task for you. I’ll be providing you with a dataset and you have to apply logistic regression on your own. No worries though, my next post will explain my way of logistic regression on the same dataset.

Explanation of dataset: The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’. When given the ‘rank’ of the college then the ‘admit’ shows whether the person is provided with the admit to the college(1) or not(0) provided he has a corresponding ‘gpa’ and ‘gre’ scores. Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to college or not.