Monday, 29 July 2013

Crossfilter, dc.js and d3.js for Data Discovery

The following post is a portion of the D3 Tips and Tricks book which is free to download. To use this post in context, consider it with the others in the blog or just download the the book as a pdf / epub or mobi .
----------------------------------------------------------
The ability to interact with visual data is the third step on the road to data nirvana in my humble opinion.

Step 1: Raw data
Step 2: Visualize data
Step 3: Interact with data

But I think that there might be a 4th step where data is a more fluid construct. Where the influences of interaction have a more profound impact on how information is presented and perceived. I think that the visualization tools that we're going to explore in this chapter take that 4th step.

Step 4: Data immersion

The tools we're going to use are not the only way that can achieve the effect of immersion, but they are simple enough for me to use and they incorporate d3.js at their core.

Introduction to Crossfilter

Crossfilter is a JavaScript library for exploring large datasets that include many variables in the browser. It supports extremely fast interactions with concurrent views and was built to power analytics for Square Register so that online merchants can slice and dice their payment history fluidly. It was developed for Square by (amongst other people) the ever tireless Mike Bostock and was released under the Apache Licence.
Crossfilter provides a map-reduce function to data using 'dimensions' and 'groups'. Map-reduce is an interesting concept itself and it's useful to understand it in a basic form to understand crossfilter better.

Map-reduce

Wikipedia tells us that "MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster". Loosely translated into language I can understand, I think of a large data set having one dimension 'mapped' or loaded into memory ready to be worked on. In practical terms, this could be an individual column of data from a larger group of information. This column of data has 'key' values which we can define as being distinct. In the case of the data below, this could be earthquake magnitudes.
The reduce function then takes that dimension and 'reduces' it by grouping it according to a specific aspect. For instance in the example above we may want to group each unique value of magnitude (by counting how many occurrences of each there are) to know how many earthquakes of a specific magnitude have taken place. Leaving us with a very specific subset of our data.
Magnitude Count
2.6       63
2.7       134
2.8       292
2.9       299
3.0       378
3.1       351
3.2       403
3.3       455
3.4       512
3.5       688
Please don't think that this is the sum total of information you need to know to be the master of map-reduce. This is a ridiculously simplistic view which is only intended to supply enough information to get you familiar with the way that we will use crossfilter later :-).

What can crossfilter do?

The best way to get a feel for the capabilities of crossfilter is to visit the demo page for crossfilter and to play with their example.

Here we are presented with five separate views of a data set that represents flight records demonstrating airline on-time performance. There are 231,083 flight records in the database being used, so getting that rendered in a web page is no small feat in itself.
The bottom view is a table showing data for individual flights. The top, left view is of the number of flights that occur at a specific hour of the day.
The top, middle graph shows the amount of delay for flights grouped in 10 minute intervals.
The top, right graph shows the distance covered by each flight grouped in 50 mile chunks.

The wider bar graph in the second row shows the number of flights per day.
This particular graph is the first to give a hint at how cool this visualization really is, because it includes a section in the middle of the graph which is selected with 'handles' on either side of the selection. You can move these handles with a mouse and as a result you will find all the data represented in the other graphs adjusting dynamically to follow your selection.
This same feature is available in all the graphs. So you are able to filter dynamically and have the results presented virtually instantaneously. This is where you can start to have fun and discover things that might not be immediately obvious.
For instance, if we select only the flights that arrived late, we can see a marked skew in the time of day. Does this mean that flights that are delayed will typically be in the late evening?
So this is why tools like crossfilter are cool. All we need to do now is learn how to make them ourselves :-). But we'll have to wait for the next blog post...

The description above (and heaps of other stuff) is in the D3 Tips and Tricks book that can be downloaded for free (or donate if you really want to :-)).

2 comments:

  1. I think it may be already asked somewhere(I couldn't find on stackoverflow) -

    how to highlight the bars related to the selected row in data-table??

    ReplyDelete
    Replies
    1. The table should update automatically so that only the selected data is visible. However, if you're asking how the mouse hover highlighting works that can be enabled using bootstrap (only for individual lines using '.table-hover' in the class definition at the start of the table (http://getbootstrap.com/2.3.2/base-css.html#tables).

      Delete