Machine Learning – A Dive into the Mystery

Machine Learning – you have heard about it, you have seen it in action but you are still confused over how it works. The fact that you are here shows your interest and curiosity of what’s all the buzz?.

A formal definition

Tom Mitchell in his book Machine Learning gives a slightly informal definition in opening line of the preface:

The field of machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.

I like this definition as it gives a simple understanding of what our goal is while developing the computer programs. Going little on the formal side, Mitchell gave a definition in the introduction which you will see repeatedly in every Machine Learning introduction article:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

This formalism kind of creeps out the people reading the definition. But don’t let it scare you of as this definition can help you out in a way no other machine learning definition can. We can use this definition as a basis and put E, T and P at the top of a table and list out complex problems with less ambiguity. It can be taken in as a pattern to remove a narrow approach and think over what data to collect(E), what decisions the program needs to make(T) and how we will evaluate its results(P). This power of resolving the ambiguity in a complex problem is like a super power to already the strongest people on the planet – PROGRAMMERS.

Now, let me answer the most awaited question:

What’s all the buzz about Machine Learning?

As I already said, the strongest people on the planet are PROGRAMMERS, and why so? This is because there isn’t any industry left that doesn’t require its own IT department to develop software for their business to grow, or to handle the loads of data effectively. But there is still one thing in software development that is eating up the industry more rapidly than other software types and that is Self Improving Software (in short software working on concept of Machine Learning). Various industry products like Google’s messaging app Allo, Amazon’s recommendations of products, Netflix suggesting you which movies you will love to watch and many more (check out this YouTube Link) make use of machine learning to give their users a product that kind of resembles to have self-consciousness.

Being a programmer myself, I can consider how these high level definitions and formalism can take their sweet time to sink in. So let’s just take into consideration the thing where we are best, giving this formalism a programmatic approach.

Programmatic Approach

In real world scenarios, you will find complex problems which will show you that it is not feasible to write every single if-statement for the instructions you need to cover. Let us take the example most commonly used to give a glimpse over the idea of machine learning – spam email detection. How would you write a program to filter incoming emails in inbox folder and spam folder?

A general programmer approach will be to collect a number of examples and then find patterns in the emails that are spams and that are not. Most commonly you’d abstract these patterns in the emails and design heuristics which will work on new emails in future as well. You’d go for crafty things for edge cases and will try to bring up the accuracy.

This manual derivation and hard coded programmer will be as good as the programmers’ ability to understand vital differences between spam and non-spam emails. A thing that will finally haunt you is maintenance nightmares.

I know the programmer inside you must be shouting at this point – “Automation! Automation!“.

Considering the above example in terms of machine learning:
Examples(E) are the emails we have collected
Task(T) was decision problem(classification) – deciding whether the email is a spam or not and placing them in correct folder.
Performance(P) measure will be accuracy in something, like percentage.
Then applying certain machine learning algorithm(will be discussed in upcoming articles) to get a model(heuristics) to work on new examples is a basic approach to get automation.

Some terminologies that are used in machine learning regularly:
Preparing a decision program is basically called training.
Collected examples are called training set.
Program is referred to as a model.

Now arises the biggest of all the questions you have in your mind:

Where to get started?

I know there are people out there of different preferences. Some prefer books over videos while others prefer video tutorials over books. So, I have made a list of things that can get you started.

Machine Learning by Mitchell
The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Hastie,
Tibshirani and Friedman
Pattern Recognition and Machine Learning by Bishop
Machine Learning: An Algorithmic Perspective by Marsland.

Video Tutorials:
Coursera Machine Learning by Andrew NG
Intro To Machine Learning by Udacity
Siraj Raval YouTube channel


Logistic Regression – Let’s Classify Things..!!

In my post on Categorising Deep Seas of ML, I introduced you to problems of Classification (a subcategory of Supervised Learning).

But wait..we are talking Logistic “Regression”. Blame history for it, but the only thing common between Logistic Regression and Regression is the word itself.

Logistic Regression Intuition

Consider a problem where you have to find the probability of a student being studious one or goofy one.
What can we do?
Data..key to every solution..yeah

So, we grade our test students on certain basis. Lets just say we rate them out of 10 on following points – study hrs/day(x1), attention in class(x2), interaction in class(x3), behaviour with peers(x4)..well for sake of simplicity lets just take these 4 features.

Representing this in a matrix


Now, you might be thinking – Hey, I have seen probability in my mathematics class and I know it always lies in range [0, 1].

Hold your horse my friend. We are getting to that very step.

Activation Functions

The world of ML has taken many inputs from the field of Mathematics and perhaps this very part of activation function is taken entirely from the latter field.

A function used to transform the activation level of a unit(neuron) into an output signal is called an Activation Function.

Well, we’re going a little off from our topic here. But you can think of activation function as a function which provides us with the probability of our test being positive. In this case, it gives us the probability of a student being studious.

The function we’ll be using here is the Sigmoid function.


Lets just have a look at the graph of the function.


‘z’ on x-axis vs. sigmoid(z) on y-axis

The graph clearly depicts that for any value of z, sigmoid function will return us a value between 0 and 1….MIND == BLOWN.. (“==” because for Programmers “=” != “==”).

So what’s left?
The only thing left for us to do is to define a mapping from our test data to z in sigmoid(z) and “minimize the error” in the mapping to get the best result.

“Minimize the error”..hmm..we have done something similar in Linear Regression too. Gradient Descent is the key.

So what’s our hypothesis? It will be nothing but


We’ll calculate these coefficients by minimizing our cost function.

The very basic step of Gradient descent is to find a Cost Function. I know all these functions are getting on your nerve so lets just depict these by using a flow chart. Stick with me and we’ll make it easy.


Flow Chart Depicting Logistic Regression.

Our Cost function here will be :


Don’t worry you don’t have to memorise it. But lets just understand how this Cost function is implemented. Consider for our test data when y = 1 then our cost function is -log(h(theta)). The graph for the same is:


h(theta) on x-axis vs. -log(h(theta)) on y-axis

This shows that as the value of calculated hypothesis goes from 0 to 1(required value) our cost function decreases.

Now, consider when y = 0 then our cost function is -log(1-h(theta)). The graph for same is:


h(theta) on x-axis vs. -log(1-h(theta)) on y-axis

This shows that as the value of the calculated hypothesis goes from 1 to 0(required value) our cost function decreases. Pretty much what required.

Now, as we are familiar with our cost function lets just remember how Gradient Descent works.


Simultaneously,  update for every Theta. Alpha being the Learning Rate.

Hmmmm..partial differentiation and our apparent hide and seek with it..Let me make your task easy.


So, now we know how to get our coefficients tuned and how to run our gradient descent.

What about making predictions?
Well, that’s easy! A student with higher probability of being a studious one is of course more studious. But how will I compute it?. Deciding a threshold is upto you. For me.. a student with a probability greater than or equal to 0.5 works just fine. I am a little lenient I know. 😉

So, now you have it. Every tiny detail of logistic regression.

Now I’ve a task for you. I’ll be providing you with a dataset and you have to apply logistic regression on your own. No worries though, my next post will explain my way of logistic regression on the same dataset.

Explanation of dataset: The provided dataset contains 4 columns, namely – ‘admit’, ‘rank’, ‘gpa’ and ‘gre’. When given the ‘rank’ of the college then the ‘admit’ shows whether the person is provided with the admit to the college(1) or not(0) provided he has a corresponding ‘gpa’ and ‘gre’ scores. Your task is to find a mapping from ‘rank’, ‘gre’ and ‘gpa’ to ‘admit’ so as to find whether a person will be admitted to college or not.