Thursday, November 17, 2016

Data Visualization in R

Thought I'd update some R visualization I've recently completed. Although I prefer Python, I have to admit, R is an incredibly powerful language for translating data into graphs and visualization and I might just be warming to it, especially for visualization.

The data is the journey to work census conducted by the NSW Bureau of Transportation. The data is a snapshot of daily commuter behavior and which suburbs people travel to and from in a single weekday in Sydney. The first image is a heatmap of the origin and destination suburbs that workers travel to. It does a nice job of clustering together commercial hubs in the Sydney area and showing the suburbs that people travel from to get there.



The next is a network connection map that is showing the number of people travelling between each destination, with the transparency of the line indicating more people. While not really quantifying much information, it shows hundreds of connections in a single image. The human eye can process a whole lot of information, and this drops all that on you with a glance, which I think is pretty amazing. 






Sunday, October 16, 2016

Past and Future

I have a background in 3d art and design and have worked in 3D animation, games and the VFX industry. Working in that field taught me a lot about the complexity of vision and just how much information can be processed by the human brain. Also, how easily it can be to offend that same visual process with the wrong aesthetics. So with this background, I find the visualization of data fascinating and have always been drawn to present data in as pleasing and informative way as possible.

Right now, I think we're on the verge of a second golden age in data visualization. The first golden age apparently occurred in the latter half of the 19th century. This was steamrolled by the advent of what we consider modern statistics in the 20th century. T-test, z-scores, ANOVA, look up tables. All necessary for quantification, but lacking in visual appeal. Graphs and charts have been the mainstay of this era but creativity and, dare I say, story telling have been absent.

Seems like there is a perfect storm of programming literacy, public interest in data and need in public and private sectors that is seeing the development of tools like D3.js, bokeh and of course R.

Looking at the dataisbeautiful subreddit, you would also be forgiven for thinking that there is a hunger at the moment to see stories told through data. with a healthy competitiveness to see who can reveal some new insight into what is happening in the world around us.

Tuesday, September 13, 2016

Python K-NN on GIT

Although I tend to use a lot of python and data science in my work, I recently decided to gain some formal post grad qualifications in the field. As part of my post graduate courses, I've been performing k-Nearest Neighbor analysis on the Kaggle Titanic data set. If you haven't heard of the Titanic Data set, it's one of their introduction to Machine Learning tutorials that is a good starting point for people wanting to understand how to apply a relatively simple sorting algorithm to a sample data set.

What's been interesting in this course so far, is that we aren't allowed to use any of the standard python tools for data analysis such as numpy, pandas and ski-kit learn and instead are hand coding the matrix algebra and sorting of the k-NN components. At first I was resistant to this approach as I use numpy and pandas all the time and well... being separated from them was a little more work. Also, doesn't sci-kit just do all that for you? It raised some interesting philosophical questions for me actually. The most attractive thing about python is that it's supported by a ridiculous number of modules to do most tasks. While it's very easy to get accustomed to this, it can actually can be counter productive from a learning perspective, such as in this case. If you really want to understand the workings of something as intricate as machine learning, no better way than to be forced to deal with the underlying code. Hand coding, while not a huge amount of work, is considerably more effort than the 10 or so lines of code you'd need to run the algorithm in sci-kit.

So, having completed the assignment, I've uploaded the implementation to my git repository if anyone is in need of a bare bones k-NN algo or just curious.


Sunday, August 28, 2016

Anatomy of a Tweet

I've been doing some Twitter API mining lately using the Tweepy module in python. There is something quite zen about watching the tsunami of real time time updates wizz past faster than you can read them. Real matrix stuff.. " blonde, brunette, redhead.. ".. I digress.

Using the twitter API is nothing new, but I've got some ideas about how to utilize the data with network graphs using the NetworkX module which is another awesome toolkit for implementing network graphs and exporting them as JSON or Gephi format.

While mucking around with figuring out how Twitter API data is returned, I came across this nice map of a Twitter Status Object and thought it deserved a post.




Thursday, August 11, 2016

Play framework for Java





I'm playing around with the Java framework Play ideally for hosting a MySQL database for a personal data project I'm working on. I could do this in Python with the Django or Flask frameworks, but Play seems like a good way to learn some more Java. So here I go.

I came across an error which thought would be worth sharing the solution to.

Play uses Lightbend Activator to install the required run times. After I set the system path as it requires you to, I tried to run activator new, from the command line. It threw the following hissy fit: 



Unable to access jarfile C:Program Files\activator-dist-1.3.10\bin\test\play-java\libexec\activator-launch-1.3.9.jar 

Turns out the problem is spaces in the path names. The whole directory at line 42 the activator.bat needs to be declared as a string to run properly. 

Simply add quotes as follows: 

before: 
for %%d in (%BIN_DIRECTORY%) do set ACTIVATOR_HOME=%%~dpd 

after: 
for %%d in ("%BIN_DIRECTORY%") do set ACTIVATOR_HOME=%%~dpd