Although I tend to use a lot of python and data science in my work, I recently decided to gain some formal post grad qualifications in the field. As part of my post graduate courses, I've been performing k-Nearest Neighbor analysis on the
Kaggle Titanic data set. If you haven't heard of the Titanic Data set, it's one of their introduction to Machine Learning tutorials that is a good starting point for people wanting to understand how to apply a relatively simple sorting algorithm to a sample data set.
What's been interesting in this course so far, is that we aren't allowed to use any of the standard python tools for data analysis such as numpy, pandas and ski-kit learn and instead are hand coding the matrix algebra and sorting of the k-NN components. At first I was resistant to this approach as I use numpy and pandas all the time and well... being separated from them was a little more work. Also, doesn't sci-kit just do all that for you? It raised some interesting philosophical questions for me actually. The most attractive thing about python is that it's supported by a ridiculous number of modules to do most tasks. While it's very easy to get accustomed to this, it can actually can be counter productive from a learning perspective, such as in this case. If you really want to understand the workings of something as intricate as machine learning, no better way than to be forced to deal with the underlying code. Hand coding, while not a huge amount of work, is considerably more effort than the 10 or so lines of code you'd need to run the algorithm in sci-kit.
So, having completed the assignment, I've uploaded the implementation to my
git repository if anyone is in need of a bare bones k-NN algo or just curious.