Logistic Regression – Let’s Classify Things..!!

In my post on Categorising Deep Seas of ML, I introduced you to problems of Classification (a subcategory of Supervised Learning).

But wait..we are talking Logistic “Regression”. Blame history for it, but the only thing common between Logistic Regression and Regression is the word itself.

Logistic Regression Intuition

Consider a problem where you have to find the probability of a student being studious one or goofy one.
What can we do?
Data..key to every solution..yeah

So, we grade our test students on certain basis. Lets just say we rate them out of 10 on following points – study hrs/day(x1), attention in class(x2), interaction in class(x3), behaviour with peers(x4)..well for sake of simplicity lets just take these 4 features.

Representing this in a matrix

feature_matrix

Now, you might be thinking – Hey, I have seen probability in my mathematics class and I know it always lies in range [0, 1].

Hold your horse my friend. We are getting to that very step.

Activation Functions

The world of ML has taken many inputs from the field of Mathematics and perhaps this very part of activation function is taken entirely from the latter field.

A function used to transform the activation level of a unit(neuron) into an output signal is called an Activation Function.

Well, we’re going a little off from our topic here. But you can think of activation function as a function which provides us with the probability of our test being positive. In this case, it gives us the probability of a student being studious.

The function we’ll be using here is the Sigmoid function.

sigmoid

Lets just have a look at the graph of the function.

sigmoid_graph

‘z’ on x-axis vs. sigmoid(z) on y-axis

The graph clearly depicts that for any value of z, sigmoid function will return us a value between 0 and 1….MIND == BLOWN.. (“==” because for Programmers “=” != “==”).

So what’s left?
The only thing left for us to do is to define a mapping from our test data to z in sigmoid(z) and “minimize the error” in the mapping to get the best result.

“Minimize the error”..hmm..we have done something similar in Linear Regression too. Gradient Descent is the key.

So what’s our hypothesis? It will be nothing but

hypothesis

We’ll calculate these coefficients by minimizing our cost function.

The very basic step of Gradient descent is to find a Cost Function. I know all these functions are getting on your nerve so lets just depict these by using a flow chart. Stick with me and we’ll make it easy.

flow_chart_logistic

Flow Chart Depicting Logistic Regression.

Our Cost function here will be :

cost_function

Don’t worry you don’t have to memorise it. But lets just understand how this Cost function is implemented. Consider for our test data when y = 1 then our cost function is -log(h(theta)). The graph for the same is:

negative_log(y)

h(theta) on x-axis vs. -log(h(theta)) on y-axis

This shows that as the value of calculated hypothesis goes from 0 to 1(required value) our cost function decreases.

Now, consider when y = 0 then our cost function is -log(1-h(theta)). The graph for same is:

negative_log(1-y)

h(theta) on x-axis vs. -log(1-h(theta)) on y-axis

This shows that as the value of the calculated hypothesis goes from 1 to 0(required value) our cost function decreases. Pretty much what required.

Now, as we are familiar with our cost function lets just remember how Gradient Descent works.

batch_gradient_descent

Simultaneously,  update for every Theta. Alpha being the Learning Rate.

Hmmmm..partial differentiation and our apparent hide and seek with it..Let me make your task easy.

cost_func_differentiation

So, now we know how to get our coefficients tuned and how to run our gradient descent.

What about making predictions?
Well, that’s easy! A student with higher probability of being a studious one is of course more studious. But how will I compute it?. Deciding a threshold is upto you. For me.. a student with a probability greater than or equal to 0.5 works just fine. I am a little lenient I know. 😉

So, now you have it. Every tiny detail of logistic regression.

Now I’ve a task for you. I’ll be providing you with a dataset and you have to apply logistic regression on your own. No worries though, my next post will explain my way of logistic regression on the same dataset.

Explanation of dataset: The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’. When given the ‘rank’ of the college then the ‘admit’ shows whether the person is provided with the admit to the college(1) or not(0) provided he has a corresponding ‘gpa’ and ‘gre’ scores. Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to college or not.

Advertisements

3 comments

  1. Abhishek Pathak · March 23

    The flow chart actually cleared out things.
    Is Sigmoid function here the actual machine learning step, because it feasts upon the final theta value and unseen data and gives the prediction?

    Like

    • piyush2804 · March 23

      Well ML is not just about final value..it’s about how u approach your data..Sigmoid here will give you a probability of how much positive your result is..you have to make a balanced choice of your threshold..which I took here to be 0.5.. More about choosing a threshold and balancing between accuracy and recall will be in future post..remember that threshold matters a lot here

      Liked by 1 person

  2. Pingback: Logistic Regression – Hands on Experience | codeflaunt

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s