The following post is a portion of the D3 Tips and Tricks document which is free to download. To use this post in context, consider it with the others in the blog or just download the pdf and / or the examples from the downloads page:-)
-------------------------------------------------------
The following is a follow on from the previous posts on generating Sankey diagrams in d3.js and goes over some mechanisms for ingesting data.
From a JSON file with numeric link values
As explained in the previous section, data to form
a Sankey diagram needs to be a combination of nodes and links.
{ "nodes":[ {"node":0,"name":"node0"}, {"node":1,"name":"node1"}, {"node":2,"name":"node2"}, {"node":3,"name":"node3"}, {"node":4,"name":"node4"} ], "links":[ {"source":0,"target":2,"value":2}, {"source":1,"target":2,"value":2}, {"source":1,"target":3,"value":2}, {"source":0,"target":4,"value":2}, {"source":2,"target":3,"value":2}, {"source":2,"target":4,"value":2}, {"source":3,"target":4,"value":4} ]}As we also noted earlier, the `“node”` entries in the `”nodes”` section of the json file are superfluous and are really only there for our benefit since D3 will automatically index the nodes starting at zero. As a test to check this out we can change our data to the following;
{ "nodes":[ {"name":"Barry"}, {"name":"Frodo"}, {"name":"Elvis"}, {"name":"Sarah"}, {"name":"Alice"} ], "links":[ {"source":0,"target":2,"value":2}, {"source":1,"target":2,"value":2}, {"source":1,"target":3,"value":2}, {"source":0,"target":4,"value":2}, {"source":2,"target":3,"value":2}, {"source":2,"target":4,"value":2}, {"source":3,"target":4,"value":4} ]}(for reference this file is saved as sankey-formatted-names-and-numbers.json and the html file is Sankey-formatted-names-and-numbers.html)
This will produce the following graph;
As you can see, essentially the same, but with
easier to understand names.
As you can imagine, while the end result is great,
the creation of the JSON file manually would be painful at best and
doing something similar but with a greater number of nodes / links
would be a nightmare.
So let's see if we can make the process a bit
easier and more flexible.
From a JSON file with links as names
It would make thing much easier if you are
building the data from hand to have nodes with names, and the
'source' and 'target' links have those same name values as
identifiers.
In other words a list of unique names for the
nodes (and perhaps some details) and a list of the links between
those nodes using the names for the nodes.
So, something like this;
{ "nodes":[ {"name":"Barry"}, {"name":"Frodo"}, {"name":"Elvis"}, {"name":"Sarah"}, {"name":"Alice"} ], "links":[ {"source":"Barry","target":"Elvis","value":2}, {"source":"Frodo","target":"Elvis","value":2}, {"source":"Frodo","target":"Sarah","value":2}, {"source":"Barry","target":"Alice","value":2}, {"source":"Elvis","target":"Sarah","value":2}, {"source":"Elvis","target":"Alice","value":2}, {"source":"Sarah","target":"Alice","value":4} ]}Once again, D3 to the rescue!
The little piece of code that can do this for us
is here;
var nodeMap = {}; graph.nodes.forEach(function(x) { nodeMap[x.name] = x; }); graph.links = graph.links.map(function(x) { return { source: nodeMap[x.source], target: nodeMap[x.target], value: x.value }; });This elegant solution comes from here; http://stackoverflow.com/questions/14629853/json-representation-for-d3-networks and was provided by Chris Pettitt (nice job).
So if we sneak this piece of code into here...
d3.json("data/sankey-formatted.json", function(error, graph) { // <= Put the code here. sankey .nodes(graph.nodes) .links(graph.links) .layout(32);… and this time we use our JSON file with just names (sankey-formatted-names.json) and our new html file (sankey-formatted-names.html) we find our Sankey diagram working perfectly!
Looking at our new piece of code...
var nodeMap = {}; graph.nodes.forEach(function(x) { nodeMap[x.name] = x; });… the first thing it does is create an object called `nodeMap` (The difference between an array and an object in JavaScript is one that is still a little blurry to me and judging from online comments, I am not alone).
Then for each of the `graph.node` instances (where
`x` is a range of numbers from 0 to the last node), we assign each
node name to a number.
Then in the next piece of code...
graph.links = graph.links.map(function(x) { return { source: nodeMap[x.source], target: nodeMap[x.target], value: x.value };
… we go through all the links we have and for
each link, we map the appropriate number to the correct name.
Very clever.
From a CSV with 'source', 'target' and 'value' info only
In the first iteration of this section I had no solution to creating a Sankey diagram using a csv file as the source of the data.
But cometh the hour, cometh the man. Enter @timelyportfolio who, while claiming no expertise in D3 or JavaScript was able to demonstrate a [solution](http://bl.ocks.org/timelyportfolio/5052095) to exactly the problem I was facing! Well done Sir! I salute you and name the technique the timelyportfolio csv method!
So here's the cleverness that @timelyportfolio demonstrated;
Using a csv file (in this case called `sankey.csv`) that looks like this;
source,target,value
Barry,Elvis,2
Frodo,Elvis,2
Frodo,Sarah,2
Barry,Alice,2
Elvis,Sarah,2
Elvis,Alice,2
Sarah,Alice,4
We take this single line from our original Sankey diagram code;
And finally...
There you have it. A Sankey diagram from a csv file. Well played @timelyportfolio!
Both the html file for the diagram (`Sankey.formatted-csv.html`) and the data file (`sankey.csv`) can be found in the downloads section of d3noob.org.
But cometh the hour, cometh the man. Enter @timelyportfolio who, while claiming no expertise in D3 or JavaScript was able to demonstrate a [solution](http://bl.ocks.org/timelyportfolio/5052095) to exactly the problem I was facing! Well done Sir! I salute you and name the technique the timelyportfolio csv method!
So here's the cleverness that @timelyportfolio demonstrated;
Using a csv file (in this case called `sankey.csv`) that looks like this;
source,target,value
Barry,Elvis,2
Frodo,Elvis,2
Frodo,Sarah,2
Barry,Alice,2
Elvis,Sarah,2
Elvis,Alice,2
Sarah,Alice,4
We take this single line from our original Sankey diagram code;
d3.json("data/sankey-formatted.json", function(error, graph) {And replace it with the following block;
d3.csv("data/sankey.csv", function(error, data) { //set up graph in same style as original example but empty graph = {"nodes" : [], "links" : []}; data.forEach(function (d) { graph.nodes.push({ "name": d.source }); graph.nodes.push({ "name": d.target }); graph.links.push({ "source": d.source, "target": d.target, "value": +d.value }); }); // return only the distinct / unique nodes graph.nodes = d3.keys(d3.nest() .key(function (d) { return d.name; }) .map(graph.nodes)); // loop through each link replacing the text with its index from node graph.links.forEach(function (d, i) { graph.links[i].source = graph.nodes.indexOf(graph.links[i].source); graph.links[i].target = graph.nodes.indexOf(graph.links[i].target); }); //now loop through each nodes to make nodes an array of objects // rather than an array of strings graph.nodes.forEach(function (d, i) { graph.nodes[i] = { "name": d }; });The comments in the code (and they are fuller in @timelyportfolio's [original gist solution](http://bl.ocks.org/timelyportfolio/5052095)) explain the operation;
d3.csv("data/sankey.csv", function(error, data) {… Loads the csv file from the data directory.
graph = {"nodes" : [], "links" : []};… Declares `graph` to consist of two empty arrays called `nodes` and `links`.
data.forEach(function (d) { graph.nodes.push({ "name": d.source }); graph.nodes.push({ "name": d.target }); graph.links.push({ "source": d.source, "target": d.target, "value": +d.value }); });… Takes the `data` loaded with the csv file and for each row loads variables for the `source` and `target` into the `nodes` array then for each row loads variables for the `source` `target` and `value` into the `links` array.
graph.nodes = d3.keys(d3.nest() .key(function (d) { return d.name; }) .map(graph.nodes));… Is a routine that Mike Bostock described on [Google Groups](https://groups.google.com/forum/#!msg/d3-js/pl297cFtIQk/Eso4q_eBu1IJ) that (as I understand it) nests each node name as a key so that it returns with only unique nodes.
graph.links.forEach(function (d, i) { graph.links[i].source = graph.nodes.indexOf(graph.links[i].source); graph.links[i].target = graph.nodes.indexOf(graph.links[i].target); });… Goes through each `link` entry and for each `source` and `target`, it finds the unique index number of that name in the nodes array and assigns the link source and target an appropriate number.
And finally...
graph.nodes.forEach(function (d, i) { graph.nodes[i] = { "name": d }; });… Goes through each node and (in the words of @timelyportfolio) “*make nodes an array of objects rather than an array of strings*” (I don't really know what that means :-(. I just know it works :-).)
There you have it. A Sankey diagram from a csv file. Well played @timelyportfolio!
Both the html file for the diagram (`Sankey.formatted-csv.html`) and the data file (`sankey.csv`) can be found in the downloads section of d3noob.org.
From MySQL as link information only automatically.
So, here we are. Faced with a dilemma of trying to
get my csv formatted links into a Sankey diagram. In theory we then
need to go through our file, identify all the unique nodes and format
the entire blob into JSON for use.
There must be a better way!
Well, I'm not going to claim that this is any
better since it's a little like cracking a walnut with a
sledgehammer. But to a man with just a sledgehammer, everything’s a
walnut.
So, let's use our newly developed MySQL and PHP
skills to solve our problem. In fact, let's make it slightly harder
for ourselves. Let's imagine that we don't even have a value
associated with our data, just a big line of source and target links.
Something like this;
source,target Barry,Elvis Barry,Elvis Frodo,Elvis Frodo,Elvis Frodo,Sarah Frodo,Sarah Barry,Alice Barry,Alice Elvis,Sarah Elvis,Sarah Elvis,Alice Elvis,Alice Sarah,Alice Sarah,Alice Sarah,Alice Sarah,Alice
First thing first, just as we did in the example
on using MySQL, import your csv file into a MySQL table which we'll
call `sankey1` in database `homedb`.
Now we want to write a query that pulls out all
the DISTINCT names that appear it the 'source' and 'target' columns.
This will form our 'nodes' portion of the JSON data.
SELECT DISTINCT(`source`) AS name FROM `sankey1` UNION SELECT DISTINCT(`target`) AS name FROM `sankey1` GROUP BY nameThis query actually mashes two separate queries together where each returns DISTINCT instances of each `source` and `target` from the source and target columns. By default, the UNION operator eliminates duplicate rows from the result which means we have a list of each node in the table.
Exxxeellennt....... (channelling Mr Burns)
Now we run a separate query that pulls out each
distinct 'source' and 'target' combination and the number of times
(COUNT(*)) that it occurs.
SELECT `source` AS source, `target` as target, COUNT(*) as value FROM `sankey1` GROUP BY source, targetThis query gets all the sources and all the targets and groups them by first the source and then the target. Each line is therefore unique and the `COUNT(*)` sums up the number of times that each unique combination occurs.
That was surprisingly easy wasn't it?
MySQL is good like that for the simple jobs, but
of course we're a long way from finished since at this stage all we
have is what looks like two tables in a spreadsheet.
So now we turn to PHP.
Remember from our previous exposure, we described
PHP as the glue that could connect parts of web pages together. In
this case we will use it to glue our MySQL database to our
JavaScript.
What we need it to do is to carry out our queries
and return the information in a format that d3.js can understand. In
this instance we will select JSON as it's probably the most
ubiquitous and it suits the format of our original manual data.
Let's cut to the chase and look at the code that
we'll use.
<?php $username = "homedbuser"; $password = "homedbuser"; $host = "localhost"; $database="homedb"; $server = mysql_connect($host, $username, $password); $connection = mysql_select_db($database, $server); $myquery = " SELECT DISTINCT(`source`) AS name FROM `sankey1` UNION SELECT DISTINCT(`target`) AS name FROM `sankey1` GROUP BY name "; $query = mysql_query($myquery); if ( ! $myquery ) { echo mysql_error(); die; } $nodes = array(); for ($x = 0; $x < mysql_num_rows($query); $x++) { $nodes[] = mysql_fetch_assoc($query); } $myquery = " SELECT `source` AS source, `target` as target, COUNT(*) as value FROM `sankey1` GROUP BY source, target "; $query = mysql_query($myquery); if ( ! $myquery ) { echo mysql_error(); die; } $links = array(); for ($x = 0; $x < mysql_num_rows($query); $x++) { $links[] = mysql_fetch_assoc($query); } echo "{"; echo '"links": ', json_encode($links), "\n"; echo ',"nodes": ', json_encode($nodes), "\n"; echo "}"; mysql_close($server); ?>Astute readers will recognise that this is very similar to the script that we used to extract data from the MySQL database for generating a simple line graph. If you haven't checked it out, and you're unfamiliar with PHP, you will want to read that section first.
We declare all the appropriate variables that we
will then use to connect to the database, then we connect to the
database and run our query.
After that we store the nodes data in an array
called `$nodes`.
Then we run our second query (we don't close the
connection to the database since we're not finished with it yet).
The second query returns the link results into a
second array called `$links` (pretty imaginative).
Now we come to a part that's a bit different. We
still need to echo out the data in the same way that was required for
our line graph, but in this case we need to add the data together
with the associated `links` and `nodes` identifiers.
echo "{"; echo '"links": ', json_encode($links), "\n"; echo ',"nodes": ', json_encode($nodes), "\n"; echo "}";(if you look closely, the syntax will produce our a JSON formatted output)
So lastly, we need to call this PHP script from
our html file in the same way that we did for the line graph. So
amend the html file to change the loading of the JSON data to be from
our PHP file thusly;
d3.json("php/sankey.php", function(error, graph) {And there you have it! So many ways to get the data.
Both the PHP file (sankey.php) and the html file
(sankey-mysql-import.html) are available in the downloads section on
d3noob.org.
Sankey diagram case study
So armed with all this new found knowledge on
building Sankey diagrams, what can you do?
Well, I suppose it all depends on your data set,
but remember, Sankey diagrams are good at flows, but they won't do
loops / cycles easily (although there has been some good work done in
this direction here and here).
So let's choose a flow.
In this case we'll selected the flow of data that
represents a view of global, anthropogenic greenhouse gas (GHG)
emissions. The diagram is a re-drawing of the excellent diagram on
the World Resources Institute and as
such my version pales in comparison to theirs.
However, the aim is to play with the technique,
not to emulate :-).
So starting with the data presented in the
original diagram, we have to capture the links into a csv file. I did
this the hard way (since there didn't appear to be an electronic
version of the data) by reading the graph and entering the figures
into a csv file. From here we import it into our MySQL database and
then convert it into sankey formatted JSON by using our PHP script
that we played with in the example of extracting information from a
MySQL database. In this case instead of needing to perform a
`COUNT(*)` on the data, it's slightly easier since the value is
already present.
Then, because we want this diagram to be hosted on
Gist and accessible on bl.ocks.org, we run the PHP file directly into
the browser so that it just shows the JSON data on the screen. We
then save this file with the suffix `.json` and we have our data (in
this case the file is named `sankeygreenhouse.json`).
Then we amend our html file to look at our new
`.json` file and voila!
Sankeytastic!
You can find this as a live example and with all
the code and data on bl.ocks.org.
The above description (and heaps of other stuff) is in the D3 Tips and Tricks document that can be accessed from the downloads page of d3noob.org (Hey! It's free. Why not?)