Thursday, May 11, 2017

NetworkX to Sigma.js Graph Generator

Network X is a visualization and analytic platform for Python which focuses on creating and analysing network graphs. These are graphs of interconnected nodes used for representing relationships such as those found in social networks. A similar browser based network graph framework is Sigma JS, a javascript library for displaying interactive network graphs, similar to to the widely used and all powerful D3. They both excel in different areas. Network X excels in analysis and is also great for dynamically creating graph trees, while Sigma is a front end interactive GUI framework. I've been utilizing both of these lately and thought I would share some code I've written up in Python for assembling a graph network in Python using Network X and exporting the resulting JSON in a format for Sigma.

The example code below is part of a larger program that displays word associations from twitter posts. I haven't included the entire Twitter mining component as that is quite complex and uses a lot of natural language processing which isn't related to the use of either Network X or Sigma. The input (list_in) is a list of word associations (index 0 and 1), their ranking (index 2, an integer) and whether they are a primary 'P' or tertiary node 'T' (index 3). This is based around their position in relation to a central node, in this case, the work 'country'.

The output from the print statement can be dropped into any Sigma.JS canvas to be displayed and looks like this.


Wednesday, April 19, 2017

Naive Bayes Classifier in Java Tutorial

Lately I've been implementing some machine learning using the Naive Bayes algorithm in Java. I wanted a side project to get under the hood with Java and this has also co-incided with courses in probability theory I've been taking. There are also a lot of Python tutorials online for writing this type of classifier but not a lot in the Java realm. While Java isn't used a lot in analytics compared to Python and R, it is used a lot in data engineering pipelines so hopefully this might be useful contribution.

Some of the math involved


This classifier is based around Bayes Theorem which underpins a lot of conditional statistics and describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Where X and Y are events and P(B) ≠ 0, the likelihood of event X occurring given Y has occurred is as follows.

$$ P(X|Y) = \dfrac{P(Y|X)P(X)}{P(Y)} $$

Probability theory is fundamental to a lot of machine learning but that isn't the focus of the tutorial. What is important is how does this work as a machine learning classifier? The equation that we'll be using within the code is the normal distribution approximation (or event model) to the Naive Bayes classifier. This uses the probability density function described below. Probability density functions are used for numerical data and describe the probability of a random variable occurring within a given distribution. This requires that we know the standard deviation and mean of each variable in the training set in order to model that distribution. These are easily found and we will be calculating these for the vectors associated with each class labels (0 and 1 in this case) we can predict the probability of a random variable as belonging to one or the other class.

$$ \displaystyle pdf = \frac{1}{\sigma_1^2 \sqrt{2 \pi}}\ e^{-\frac{(x-\mu)^2}{2 \sigma^2}} $$

The thing that can be difficult to grasp initially is how the decision is made for the individual classes and one that I've often found explained poorly from a beginners perspective. Even if we have an equation that determines the probability of one variable occurring within a certain distribution, how can each vector be assigned a class label. Assuming our class labels are Y and the variables in our vector are X1...Xn, then the naive bayes algorithm can be simplified to assume that each of the vectors are independent from each other which just means there is no influence between them that can determine them belonging to either class label.

$$ P(X_1...X_n|Y) = \prod_{i=1}^n P(X_i|Y) $$

This equation implies that the product of all individual probabilities in the vector will give us a overall probability of all the variables being as they are for a specific class label. As we will calculate the individual probability of each variable in the vector using the normal distribution approximation above, it's a matter or multiplying these for the mean and standard deviation associated with each class label. Once we have these values, the one with the highest probability is chosen as the predicted class.

From a programming point of view, it's easier to see how these pieces fit together, so the diagram below does some of that work in showing how the statistics work on the flow of the data.



Programming and code examples


The dataset I'll be using for testing is an implementation of Leo Breiman's twonorm example twonorm example which is a vector of 20 random variables from two normally distributed models that lie inside each other. This is a synthetic data set, so it will return a very high accuracy, however it works for the purpose of testing the classifier.

The program itself is split into four classes in addition to the public static void main class that is the standard main() class that Java executes at run time. Here is a list of the packages that I'll be using in the code which you may need to path if your IDE doesn't import them automatically. Thanks IntelliJ :)


CSVReader


The first class CSVReader contains a single method parseCSV that is used to import the data into an array of arrays. An ArrayList<ArrayList> to be exact. Not to be confused with regular arrays, ArrayLists are more flexible than standard Arrays since they are mutable. This data structure will be the basis of the program and mimicks a 2D matrix.

Java imports data via a BufferedReader and we define that as br and start reading in each line. Each row of the data contains a vector of numbers that are the numerical values for the variables (X1 ... Xn). The last entry is either a 0 or 1 and is the class label. This is the category that this vector belongs to.

Our CSV file consists of vectors on each line that include a prediction in the final column. The length of the vector determines the number of columns in our matrix and this is defined as Integer len. We cast the string as a double since we can't be certain that our data consists purely of integers. We then loop through this with a for loop and enter this into an ArrayList<Double> before dumping these into our final matrix.



DataProcess


Out next class is DataProcess where we will separate out the data we intend to use for our test and training sets. Our main method splitSet takes the matrix we created and a split ratio between 0 and 1.
A random number is generated based on the size of our data and we use this as an index to retrieve the random vector for inclusion in our training set. We remove this vector from the test set and add it to our training set. Finally both the sets are returned via an array .


Statistics


Next we need to generate some statistics for our sets and we do this within the statistics class. We will use this class to produce mean and standard deviations for each of the variables. One for the positive class results and one for the negative results.

We begin by sending the trainSet to the classStats method, after which we use the getCol mthod in a for loop to retrieve each column (variable). These are then separated into the positive and negative class results by checking if the idCol (the last column in the matrix) matches 0 or not. Although this is a binary classifier, it would be possible to split out sepearate classification groups by adding more conditional controls here. Once these class separations are performed, each variable is sent to the individualStats method which calls the meanList and stdDev methods.

The method classStats eventually returns a hashmap with 0:{n x [mean, stdDev]}, 1:{n x [mean, stdDev]} where n is the number of variables.


Predictions


The final class to be implemented is the Predictions class which will be used to calculate probability and decide which class label each vector is predicted to be.

The summaries and testSet for each vector are sent to the goPredict method. It's now time to calculate the predictions for the remaining data. Each vector from the testSet is sent to the decidePredict method and added to the finalPredictions ArrayList.

The decidePredict method obtains the combined probability from the classProbability method for each of the class labels which determines the probability of each variable in the vector by sending the summary statistics of each class label (0 and 1) and the vector variable to the densityFunc method. This calculates the probability using the probability density function for the normal distribution which is outlined in the math section above. The highest of these two values is selected and returned by the decidePredict method.

The last method accuracy is used to calculate what percentage of the class predictions were correct. We pass the testSet and the resulting class predictions to the method and count the number of successes by comparing the prediction to the last column in the testSet. A percentage is then calculated and returned.

Lastly we arrange all the method calls in the public static void main method which performs all the requested operations and returns our prediction accuracy.

Sunday, February 12, 2017

PyMySQL script

Nearly two months since an update. I'm going to put that down to holiday time taking it's toll on my personal geek time. Sure. That's a good excuse.

I've been working on a web app for Android mostly in my spare time. It's a web scraping and data visualization project for... wait for it... Twitter. As if there aren't enough Twitter data apps I hear you say. Well, get ready for another one. Also since I'd like to get more experience in Java/Android development, I'm going to be releasing it on the Play Store. But more on that later.

For now, here's a link to an upload on GitHub I made last year.

It's a pretty simple python script that uses Pymysql for inserting data from an XLS file to two MYSQL tables, one which has a foreign key constraint to an auto-increment to a parent table. 

This was designed for splitting a XLS template into several tables in MySQL. Each entry in the template was a spatial observation and the challenge was to upload the data whilst retaining the same ID for the points so that they could be queried later on from the database. Compounding this was that some observations were often missing and in some instances rows could be empty in either one or the other tables.

One excel sheet can be divided up into numerous SQL tables based on variable indexing. This includes initial table creation.

Empty rows in tables are cleaned up from the database however the indexes between the tables are not altered, ensuring that data can be retrieved.


More at the link here.



Thursday, November 17, 2016

Data Visualization in R

Thought I'd update some R visualization I've recently completed. Although I prefer Python, I have to admit, R is an incredibly powerful language for translating data into graphs and visualization and I might just be warming to it, especially for visualization.

The data is the journey to work census conducted by the NSW Bureau of Transportation. The data is a snapshot of daily commuter behavior and which suburbs people travel to and from in a single weekday in Sydney. The first image is a heatmap of the origin and destination suburbs that workers travel to. It does a nice job of clustering together commercial hubs in the Sydney area and showing the suburbs that people travel from to get there.



The next is a network connection map that is showing the number of people travelling between each destination, with the transparency of the line indicating more people. While not really quantifying much information, it shows hundreds of connections in a single image. The human eye can process a whole lot of information, and this drops all that on you with a glance, which I think is pretty amazing. 






Sunday, October 16, 2016

Past and Future

I have a background in 3d art and design and have worked in 3D animation, games and the VFX industry. Working in that field taught me a lot about the complexity of vision and just how much information can be processed by the human brain. Also, how easily it can be to offend that same visual process with the wrong aesthetics. So with this background, I find the visualization of data fascinating and have always been drawn to present data in as pleasing and informative way as possible.

Right now, I think we're on the verge of a second golden age in data visualization. The first golden age apparently occurred in the latter half of the 19th century. This was steamrolled by the advent of what we consider modern statistics in the 20th century. T-test, z-scores, ANOVA, look up tables. All necessary for quantification, but lacking in visual appeal. Graphs and charts have been the mainstay of this era but creativity and, dare I say, story telling have been absent.

Seems like there is a perfect storm of programming literacy, public interest in data and need in public and private sectors that is seeing the development of tools like D3.js, bokeh and of course R.

Looking at the dataisbeautiful subreddit, you would also be forgiven for thinking that there is a hunger at the moment to see stories told through data. with a healthy competitiveness to see who can reveal some new insight into what is happening in the world around us.

Tuesday, September 13, 2016

Python K-NN on GIT

Although I tend to use a lot of python and data science in my work, I recently decided to gain some formal post grad qualifications in the field. As part of my post graduate courses, I've been performing k-Nearest Neighbor analysis on the Kaggle Titanic data set. If you haven't heard of the Titanic Data set, it's one of their introduction to Machine Learning tutorials that is a good starting point for people wanting to understand how to apply a relatively simple sorting algorithm to a sample data set.

What's been interesting in this course so far, is that we aren't allowed to use any of the standard python tools for data analysis such as numpy, pandas and ski-kit learn and instead are hand coding the matrix algebra and sorting of the k-NN components. At first I was resistant to this approach as I use numpy and pandas all the time and well... being separated from them was a little more work. Also, doesn't sci-kit just do all that for you? It raised some interesting philosophical questions for me actually. The most attractive thing about python is that it's supported by a ridiculous number of modules to do most tasks. While it's very easy to get accustomed to this, it can actually can be counter productive from a learning perspective, such as in this case. If you really want to understand the workings of something as intricate as machine learning, no better way than to be forced to deal with the underlying code. Hand coding, while not a huge amount of work, is considerably more effort than the 10 or so lines of code you'd need to run the algorithm in sci-kit.

So, having completed the assignment, I've uploaded the implementation to my git repository if anyone is in need of a bare bones k-NN algo or just curious.


Sunday, August 28, 2016

Anatomy of a Tweet

I've been doing some Twitter API mining lately using the Tweepy module in python. There is something quite zen about watching the tsunami of real time time updates wizz past faster than you can read them. Real matrix stuff.. " blonde, brunette, redhead.. ".. I digress.

Using the twitter API is nothing new, but I've got some ideas about how to utilize the data with network graphs using the NetworkX module which is another awesome toolkit for implementing network graphs and exporting them as JSON or Gephi format.

While mucking around with figuring out how Twitter API data is returned, I came across this nice map of a Twitter Status Object and thought it deserved a post.