Since publishing this blog post I’ve started a whole website dedicated to teaching machine learning concepts and giving people hands-on experience with data: databriefing.com. If you enjoy this article be sure to check out DataBriefing and to sign up for the newsletter there.
Are you interested in machine learning? Many people I know are but don’t start to learn about machine learning (ML) on their own because making first steps in such a vast subject is overwhelming and feels daunting.
In this guide I’ll take you by the hand and show you one possible path through the land of machine learning that in my experience is very effective.
As long as you have a bit of programming and math background it is actually easy to learn the basics. It’s even simple to get some real life practice thanks to a combination of excellent online courses, open source libraries, documentation and freely available data.
Do you worry about your hardware? In fact you won’t need fancy data centers to work on interesting challenges and your standard laptop will be enough.
Only a minuscule warning: depending on your OS, setting up the tools and libraries might be tricky. But there’s StackOverflow. So nothing a bit of googling won’t fix.
Of course it helps to have some prior knowledge. Machine learning is at the intersection of computer science and statistics. So knowledge in both helps. If you don’t know anything about statistics try Head-First Statistics. It’s an easy (albeit superficial) intro to stats.
Andrew Ng’s Machine Learning course on Coursera is probably the best start. You’ll learn fundamental concepts and powerful algorithms. The course uses GNU Octave (“Open Source Matlab”) for programming assignments. By the end of the course you will have implemented your own algorithms including a neural network and have a thorough understanding of concepts like bias/variance. You will already have glimpsed the challenges and solutions of large scale machine learning. My advice is to really do the quizzes and the assignments to get the most out of the course.
Once you know the basics it’s time to…
Get Your Hands Dirty
At this point you’ve implemented many of the algorithms yourself so you already know a lot about ML. (Congratulations by the way!) Now you probably can’t wait to get some real data into your hands and train models. For that kaggle.com is the perfect place. Either you pick an intro competition or even a simple real competition. You should have a feel for what is realistic at that point. Just browse through the competition descriptions and pick what interests you. To have the most fun here you can use IPython/Jupyter notebooks with pandas to explore the data. Of course there’s a low-barrier intro to pandas. En route you’ll also learn basics of numpy. You should also look around the Kaggle website. They have a forum with a very helpful community, IPython notebooks and scripts that show how to load and transform data and finally how to feed them into ML libraries. The most popular of which is scikit-learn. So try to load data, get rid of features you don’t think are important or are too complicated to transform and feed that data into a suitable scikit-learn algorithm. (You know which problems call for which algorithm.)
If you ever get stuck, take a look at how other people solved the problem in your competition.
Once you’ve made it to your first submission you’ve actually come a long way.
The only thing cooler than working with data is working with even more data. So you could look into deep learning. Google has just released TensorFlow and there’s a Udacity course about deep learning with TensorFlow. It’s not as guided as the ML course but you’ll get some example IPython notebooks and should be able to figure everything out. And on the way you’ll be exposed to some interesting problems. At this point you’ll maybe find yourself googling for powerful GPUs.
 If you could pick an OS for ML your preferences would be in that order: Linux, MacOS, Windows.