Wednesday, 20 February 2013

Sankey Diagrams: A Description of the d3.js Code


The following post is a portion of the D3 Tips and Tricks document which is free to download. To use this post in context, consider it with the others in the blog or just download the pdf  and / or the examples from the downloads page:-)
-------------------------------------------------------

Description of the code

The code for the Sankey diagram is significantly different to that for a line graph although it shares the same core language and programming methodology.

The code we’ll go through is an adaptation of the version first demonstrated by Mike Bostock so it’s got a pretty good pedigree. I will begin with a version that uses data that is formatted so that it can be used directly with no manipulation, then in subsequent sections I will describe different techniques for getting data from different formats to work.

I found that getting data in the correct format was the biggest hurdle for getting a Sankey diagram to work. I make the assumption that this may be a similar story for others as well. We will start off assuming that the data is perfectly formatted, then where only the link data is available then where there is just names to work with (no numeric node values) and lastly, one that can be used for people with changeable data from a MySQL database.

I won’t try to go over every inch of the code as I did with the previous simple graph example (I’ll skip things like the HTML header) and will focus on the style sheet (CSS) portion and the JavaScript.
The complete code for this will also be available as an appendix and in the downloads section at d3noob.org.

On to the code…
<style>
.node rect {
  cursor: move;
  fill-opacity: .9;
  shape-rendering: crispEdges;
}
.node text {
  pointer-events: none;
  text-shadow: 0 1px 0 #fff;
}
.link {
  fill: none;
  stroke: #000;
  stroke-opacity: .2;
}
.link:hover {
  stroke-opacity: .5;
}
</style>

<body>
<p id="chart">
<script type="text/javascript" src="d3/d3.v3.js"></script>
<script src="js/sankey.js"></script>
<script>

var units = "Widgets";

var margin = {top: 10, right: 10, bottom: 10, left: 10},
    width = 700 - margin.left – margin.right,
    height = 300 - margin.top – margin.bottom;

var formatNumber = d3.format(",.0f"),    // zero decimal places
    format = function(d) { return formatNumber(d) + " " + units; },
    color = d3.scale.category20();

// append the svg canvas to the page
var svg = d3.select("#chart").append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
  .append("g")
    .attr("transform", 
          "translate(" + margin.left + "," + margin.top + ")");

// Set the sankey diagram properties
var sankey = d3.sankey()
    .nodeWidth(36)
    .nodePadding(40)
    .size([width, height]);

var path = sankey.link();

// load the data
d3.json("data/sankey-formatted.json", function(error, graph) {

  sankey
      .nodes(graph.nodes)
      .links(graph.links)
      .layout(32);

// add in the links
  var link = svg.append("g").selectAll(".link")
      .data(graph.links)
    .enter().append("path")
      .attr("class", "link")
      .attr("d", path)
      .style("stroke-width", function(d) { return Math.max(1, d.dy); })
      .sort(function(a, b) { return b.dy - a.dy; });

// add the link titles
  link.append("title")
        .text(function(d) {
            return d.source.name + "" + 
                d.target.name + "\n" + format(d.value); });

// add in the nodes
  var node = svg.append("g").selectAll(".node")
      .data(graph.nodes)
    .enter().append("g")
      .attr("class", "node")
      .attr("transform", function(d) { 
          return "translate(" + d.x + "," + d.y + ")"; })
    .call(d3.behavior.drag()
      .origin(function(d) { return d; })
      .on("dragstart", function() { 
          this.parentNode.appendChild(this); })
      .on("drag", dragmove));

// add the rectangles for the nodes
  node.append("rect")
      .attr("height", function(d) { return d.dy; })
      .attr("width", sankey.nodeWidth())
      .style("fill", function(d) { 
          return d.color = color(d.name.replace(/ .*/, "")); })
      .style("stroke", function(d) { 
          return d3.rgb(d.color).darker(2); })
    .append("title")
      .text(function(d) { 
          return d.name + "\n" + format(d.value); });

// add in the title for the nodes
  node.append("text")
      .attr("x", -6)
      .attr("y", function(d) { return d.dy / 2; })
      .attr("dy", ".35em")
      .attr("text-anchor", "end")
      .attr("transform", null)
      .text(function(d) { return d.name; })
    .filter(function(d) { return d.x < width / 2; })
      .attr("x", 6 + sankey.nodeWidth())
      .attr("text-anchor", "start");

// the function for moving the nodes
  function dragmove(d) {
    d3.select(this).attr("transform", 
        "translate(" + (
            d.x = Math.max(0, Math.min(width - d.dx, d3.event.x))
        )
        + "," + (
            d.y = Math.max(0, Math.min(height - d.dy, d3.event.y))
        ) + ")");
    sankey.relayout();
    link.attr("d", path);
  }
});
So, going straight to the style sheet bounded by the <style> tags;
.node rect {
  cursor: move;
  fill-opacity: .9;
  shape-rendering: crispEdges;
}

.node text {
  pointer-events: none;
  text-shadow: 0 1px 0 #fff;
}

.link {
  fill: none;
  stroke: #000;
  stroke-opacity: .2;
}

.link:hover {
  stroke-opacity: .5;
}
The CSS in this example is mainly concerned with formatting of the mouse cursor as it moves around the diagram.

The first part…
.node rect {
  cursor: move;
  fill-opacity: .9;
  shape-rendering: crispEdges;
}
… provides the properties for the node rectangles. It changes the icon for the cursor when it moves over the rectangle to one that looks like it will move the rectangle (there is a range of different icons that can be defined here http://www.echoecho.com/csscursors.htm), sets the fill colour to mostly opaque and keeps the edges sharp.

The next block…
.node text {
  pointer-events: none;
  text-shadow: 0 1px 0 #fff;
}
… sets the properties for the text at each node. The mouse is told to essentially ignore the text in favour of anything that’s under it (in the case of moving or highlighting something else) and a slight shadow is applied for readability).

The following block…
.link {
  fill: none;
  stroke: #000;
  stroke-opacity: .2;
}
… makes sure that the link has no fill (it actually appears to be a bendy rectangle with very thick edges that make the element appear to be a solid block), colours the edges black (#000) and gives makes the edges almost transparent.

The last block….
.link:hover {
  stroke-opacity: .5;
}
… simply changes the opacity of the link when the mouse goes over it so that it’s more visible. If so desired, we could change the colour of the highlighted link by adding in a line to this block changing the colour like this stroke: red;.

Just before we get into the JavaScript, we do something a little different for d3.js. We tells it to use a plug-in with the following line;

<script src="js/sankey.js"></script>

The concept of a plug-in is that it is a separate piece of code that will allow additional functionality to a core block (which in this case is d3.js). There are a range of plug-ins available and we will need to source the sankey.js file from the repository and place that somewhere where our HTML code can access it. In this case I have put it in the js directory that resides in the root directory of the web page. 


The start of our JavaScript begins by defining a range of variables that we’ll be using. 

Our units are set as ‘Widgets’ (var units = "Widgets";), which is just a convenient generic (nonsense) term to provide the impression that the flow of items in this case is widgets being passed from one person to another.
We then set our canvas size and margins…
var margin = {top: 10, right: 10, bottom: 10, left: 10},
    width = 700 - margin.left – margin.right,
    height = 300 - margin.top – margin.bottom;
… before setting some formatting.
var formatNumber = d3.format(",.0f"),    // decimal places
    format = function(d) { return formatNumber(d) + " " + units; },
    color = d3.scale.category20();
The formatNumber function acts on a number to set it to zero decimal places in this case. In the original Mike Bostock example it was to three places, but for ‘widgets’ I’m presuming we don’t divide :-).

format is a function that returns a given number formatted with formatNumber as well as a space and our units of choice (‘Widgets’). This is used to display the values for the links and nodes later in the script.
The color = d3.scale.category20(); line is really interesting and provides access to a colour scale that is pre-defined for your convenience!. Later in the code we will see it in action.

Our next block sites our canvas onto our page in relation to the size and margins we have already defined;
var svg = d3.select("#chart").append("svg")
    .attr("width", width + margin.left + margin.right)
    .attr("height", height + margin.top + margin.bottom)
  .append("g")
    .attr("transform", 
          "translate(" + margin.left + "," + margin.top + ")");
Then we set the variables for our Sankey diagram;
var sankey = d3.sankey()
    .nodeWidth(36)
    .nodePadding(40)
    .size([width, height]);
Without trying to state the obvious, this sets the width of the nodes (.nodeWidth(36)), the padding between the nodes (.nodePadding(40)) and the size of the diagram(.size([width, height]);).

The following line defines the path variable as a pointer to the sankey function that make the links between the nodes to their clever thing of bending into the right places.;
var path = sankey.link();
I make the presumption that this is a defined function within sankey.js. Then we load the data for our sankey diagram with the following line;
d3.json("data/sankey-formatted.json", function(error, graph) {
As we have seen in previous usage of the d3.jsond3.csv and d3.tsv functions this is a wrapper that acts on all the code within it bringing the data in the form of graph to the remaining code.

I think it’s a good time to take a slightly closer look at the data that we’ll be using;
{
"nodes":[
{"node":0,"name":"node0"},
{"node":1,"name":"node1"},
{"node":2,"name":"node2"},
{"node":3,"name":"node3"},
{"node":4,"name":"node4"}
],
"links":[
{"source":0,"target":2,"value":2},
{"source":1,"target":2,"value":2},
{"source":1,"target":3,"value":2},
{"source":0,"target":4,"value":2},
{"source":2,"target":3,"value":2},
{"source":2,"target":4,"value":2},
{"source":3,"target":4,"value":4}
]}
I want to look at the data now, because it highlights how it is accessed throughout this portion of the code. It is split into two different blocks, ‘nodes’ and ‘links’. The subset of variables available under ‘nodes’ is ‘node’ and ‘name’. Likewise under ‘links’ we have ‘source’, ‘target’ and ‘value’. This means that when we want to act on a subset of our data we define which piece by defining the hierarchy that leads to it. For instance, if we want to define an action onto all the links, we would use graph.links (they’re kind of chained together).

Let me take this opportunity to apologise to all those programmers who actually know exactly what is going on here. It’s a mystery to me, but this is how I like to tell myself it works to help me get by :-)

So, now that we have our data loaded, we can assign the data to the sankey function so that it knows how to deal with it behind the scenes;
  sankey
      .nodes(graph.nodes)
      .links(graph.links)
      .layout(32);
In keeping with our previous description of what’s going on with the data, we have told the sankey function that the nodes it will be dealing with are in graph.nodes of our data structure.

I’m not sure what the .layout(32); portion of the code does, but I’d be interested know from any more knowledgeable readers. I’ve tried changing the values to no apparent affect and googling has drawn a blank. Internally to the sankey.js file it seems to indicate ‘iterations’ while it establishes computeNodeLinks, computeNodeValues, computeNodeBreadths, computeNodeDepths(iterations) and computeLinkDepths.

Then we add our links to the diagram with the following block of code;
  var link = svg.append("g").selectAll(".link")
      .data(graph.links)
    .enter().append("path")
      .attr("class", "link")
      .attr("d", path)
      .style("stroke-width", function(d) { return Math.max(1, d.dy); })
      .sort(function(a, b) { return b.dy - a.dy; });
This is an analogue of the block of code we examined way back in the section that we covered in explaining the code of our first simple graph.

We append svg elements for our links based on the data in graph.links, then add in the paths (using the appropriate CSS). We set the stroke width to the width of the value associated with each link or ‘1’. Whichever is the larger (by virtue of the Math.max function). As an interesting sideline, if we force this value to ‘10’ thusly…
      .style("stroke-width", 10)
… the graph looks quite interesting.

I have to admit that I don’t know what the sort line (.sort(function(a, b) { return b.dy - a.dy; });) is supposed to achieve. Again, I’d be interested know from any more knowledgeable readers. I’ve tried changing the values to no apparent affect.

The next block adds the titles to the links;
  link.append("title")
        .text(function(d) {
                return d.source.name + "" + 
                        d.target.name + "\n" + format(d.value); });
This code appends a text element to each link when moused over that contains the source and target name (with a neat little arrow in between and the value (which when applied with the format function adds the units.

The next block appends the node objects (but not the rectangles or text) and contains the instructions to allow them to be arranged with the mouse.
  var node = svg.append("g").selectAll(".node")
      .data(graph.nodes)
    .enter().append("g")
      .attr("class", "node")
      .attr("transform", function(d) { 
          return "translate(" + d.x + "," + d.y + ")"; })
    .call(d3.behavior.drag()
      .origin(function(d) { return d; })
      .on("dragstart", function() { 
          this.parentNode.appendChild(this); })
      .on("drag", dragmove));
While it starts off in familiar territory with appending the node objects using the graph.nodes data and putting them in the appropriate place with the transform attribute, I can only assume that there is some trickery going on behind the scenes to make sure the mouse can do what it needs to do with the d3.behaviour,drag function. There is some excellent documentation on the wiki (https://github.com/mbostock/d3/wiki/Drag-behavior), but I can only presume that it knows what it’s doing :-). The dragmove function is laid out at the end of the code, and we will explain how that operates later.

I really enjoyed the next block;
  node.append("rect")
      .attr("height", function(d) { return d.dy; })
      .attr("width", sankey.nodeWidth())
      .style("fill", function(d) { 
          return d.color = color(d.name.replace(/ .*/, "")); })
      .style("stroke", function(d) { 
          return d3.rgb(d.color).darker(2); })
    .append("title")
      .text(function(d) { 
          return d.name + "\n" + format(d.value); });
It starts off with a fairly standard appending of a rectangle with a height generated by its value  { return d.dy; } and a width dictated by the sankey.js file to fit the canvas (.attr(“width”, sankey.nodeWidth())`).
Then it gets interesting.

The colours are assigned in accordance with our earlier colour declaration and the individual colours are added to the nodes by finding the first part of the name for each node and assigning it a colour from the palate (the script looks for the first space in the name using a regular expression). For instance: ‘Widget X’, ‘Widget Y’ and ‘Widget’ will all be coloured the same even if the ‘Widget X’ and ‘Widget Y’ are inputs on the left and ‘Widget’ is a node in the middle.

The stroke around the outside of the rectangle is then done the the same shade, but darker. Then we return to the basics where we add the title of the node in a tool tip type effect along with the value for the node.

Then we add the titles for the nodes;
   node.append("text")
      .attr("x", -6)
      .attr("y", function(d) { return d.dy / 2; })
      .attr("dy", ".35em")
      .attr("text-anchor", "end")
      .attr("transform", null)
      .text(function(d) { return d.name; })
    .filter(function(d) { return d.x < width / 2; })
      .attr("x", 6 + sankey.nodeWidth())
      .attr("text-anchor", "start");
Again, this looks pretty familiar. We position the text titles carefully to the left of the nodes carefully. All except for those affected by the filter function (return d.x < width / 2;). Where if the position of the node on the x axis is less than half the width, the title is placed on the right of the node and anchored at the start of the text. Very neat.

The last block is also pretty neat, and contains a little surprise for those who are so inclined.
  function dragmove(d) {
    d3.select(this).attr("transform", 
       "translate(" + d.x + "," + (
                d.y = Math.max(0, Math.min(height - d.dy, d3.event.y))
            ) + ")");
    sankey.relayout();
    link.attr("d", path);
This declares the function that controls the movement of the nodes with the mouse. It selects the item that it’s operating over (d3.select(this)) and then allows translation in the y axis while maintaining the link connection (sankey.relayout(); link.attr("d", path);).

But that’s not the cool part. A quick look at the code should reveal that if you can move a node in the y axis, there should be no reason why you can’t move it in the x axis as well!

Sure enough, if you replace the code above with this…
  function dragmove(d) {
    d3.select(this).attr("transform", 
        "translate(" + (
            d.x = Math.max(0, Math.min(width - d.dx, d3.event.x))
        )
        + "," + (
            d.y = Math.max(0, Math.min(height - d.dy, d3.event.y))
        ) + ")");
    sankey.relayout();
    link.attr("d", path);
… you can move your nodes anywhere on the canvas.

I know it doesn’t seem to add anything to the diagram (in fact, it could be argued that there is a certain aspect of detraction) however, it doesn’t mean that one day the idea doesn’t come in handy :-). You can find a live version of this on Github via bl.ocks.org.

So, that’s the description for our basic Sankey diagram. From here we will look at different ways to get data formatted for use in them.


The above description (and heaps of other stuff) is in the D3 Tips and Tricks document that can be accessed from the downloads page of d3noob.org (Hey! It's free. Why not?)

25 comments:

  1. Hi, stumbled upon your work here and think it's great to be putting this kind of stuff out there! New to D3, but as a long-time programmer, I thought I'd provide answers for the two questions you posed:

    1) the layout function lets you set how many 'passes' are performed by an algorithm trying to optimally place the nodes so they don't overlap. The higher the number, the better the placement -- but the longer it takes to run.

    2) the sort function looks like it is choosing the order in which the pieces are drawn on the screen -- note that the mouse only highlights one element at a time when things are overlapping.

    Thanks for the blog/book! I'll try to get my company to pay for a copy for us :)
    -scott southworth

    ReplyDelete
    Replies
    1. Wow! Thanks for the info Scott. That's really good to know. I hope you enjoy D3. It certainly seems to be rising in popularity!

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi,
    It is nice work.Thank you for the article.
    I want to use sankey for generating visitors flow diagram as in google analytics.

    can you please help me how to go about that -- using 1 to many and repetitions

    Thanks
    Madhu
    (madhusudan.k70@gmail.com)

    ReplyDelete
    Replies
    1. Thank you for your kind comments. What you're trying to achieve would be a body of work that would consume more time than I have available I'm afraid. However, What I can recommend is that you use the D3 Google Groups forum and the StackOverflow Q and A facility to work through the process. Start by creating a Github account and getting you code up there and this will assist others to troubleshoot for you where you have problems. Sorry I couldn't be of more help, but my time is extremely limited and I know that the community has a wealth of people more experienced than me that can assist. Good luck and I look forward to seeing your results.

      Delete
  4. Hey! Great article!

    I've been trying to create a sankey diagram of some travel patterns, and every time I run it I get an error telling me that 'nodes' is undefined (despite the fact that my json is defining them). Thoughts?

    ReplyDelete
    Replies
    1. Possibly even though you json might be correctly formatted, if your code isn't injesting it and assigning it correctly you might have problems. If you get desperate post your code and a sample of the json onto stack overflow as a question. But as a first step, pare your json and your code down to a minimum to test or start with a known good example and then start integrating your json data and adapt from there. Good luck and let us know how you get on.

      Delete
  5. Thanks for laying all these examples out - I'm new to D3 and not a programmer, so it definitely helps to have your explorations!

    A question on the sankey diagram... is it fairly easy to disable the mouse drag?

    ReplyDelete
    Replies
    1. Glad you're having a go at d3!
      Disabling the mouse drag should be as easy as commenting out (or just remove) the section that calls the 'mousedrag' function
      .on("dragstart", function() {
      this.parentNode.appendChild(this); })
      .on("drag", dragmove))
      Get rid of that and see what happens. First and foremost though, have a play with the code! Make a tiny change to a part and see what happens. learn from that and then try something else. You'll be amazed what you learn!

      Delete
    2. oh, my head is hurting from all the learning! But yes, it's fantastic (when it works)!

      One more question ... I'm assuming that I can add additional information to both nodes and links - the data set your working with just has source, target, and value for links. Can I have additional key:value pairs in there, or will that break sankey? (I'm asking because sankey is choking on something and trying to consider what is different from my data and the example.

      Delete
    3. :-). I know what you mean!

      Yes, you can certainly add more information to the nodes and links. They could be used for tooltip info or perhaps now that I think about it, colours or opacity. (Hmmm.... I should save that thought for later).
      The json certainly does need to be correctly formed. The way I try to solve these problems when they occur is to parse the data down to a minimal an amount as possible (perhaps just a couple of nodes) and then confirm whether it's the code or the data. Try that (or using a known good data set.)

      Delete
  6. Hi,
    I'm new to D3, D3 i think is brilliant in putting across the data in a very effective visualization. And Noob thank you for breaking down the each script into chunks. This actually helped me a lot in understand what each script is doing

    Here i'm directly using a CSV data to generate the customer flow, what i'm trying to do is, add a hyperLink to the Nodes. For some reason, this is not working with the csv data. How ever i'm able to include a Hyperlink to the 'Links' between the nodes and it works fine

    Can some one help me out with this, how to add a hyperlink to the nodes of the sankey diagram

    Dataset schema
    [ source, target , value , linkurl , Nodeurl* ]

    ReplyDelete
    Replies
    1. Cool idea with the hyperlinks on the diagram. I would imagine that if you had the hyperlinks working for the links, they should be working for the nodes as well. Could post your question and code onto Stack Overflow (http://stackoverflow.com/questions/tagged/d3.js) so that I (and Others) can have a better look at what is going on? (Putting info into the comments in the blog doesn't work too well).
      Cheers

      Delete
  7. this is awesome! I will look at this closer … I tried to generate a JSON file from scratch, but working with IDs makes my head explode. With this csv file, I can use excel that can make sure the data is not duplicate.

    I will try to change the sankey behaviour to do »auto« links. I need to define a link to collect all the input and deliver it to the target, similar to what timelyportfolio did here in r: http://timelyportfolio.github.io/rCharts_d3_sankey/example_build_network_sankey.html

    I don't know how I can do that, but I'll try to go through your code and hopefully understand the sankey logic better to get this done. will post when I know more!

    ReplyDelete
  8. yeah, it took me *ages* to figure out where to manipulate the code and a lot of console.logs later, I figured out I only need to change sankey's computeNodeValues() and got this:
    http://bl.ocks.org/frischmilch/7667996

    Thank you for this great resource as I am a d3 noob too :)

    ReplyDelete
  9. Thank you for posting this! I've been looking for Sankey tutorial for a long time :)

    One question. After I created a separated JSON file with my data, and put it in the same folder as my sankey.html, I always got error message saying ''XMLHttpRequest cannot load file:///blah/blah/sankey.json. Cross origin requests are only supported for HTTP. "

    I don't know what's wrong with my code, so I googled it and some people said I need to build server and put my JSON file there because d3.json() only takes http format. I am not sure if it is the right way to do.

    http://stackoverflow.com/questions/10752055/cross-origin-requests-are-only-supported-for-http-error-but-im-loading-a-co

    Thanks for helping!

    ReplyDelete
    Replies
    1. I'm glad you're finding it useful.
      Unfortunately you have the reason for you problem correctly identified. You browser won't allow you to load the file because of Cross Domain Restrictions. You have a few options.
      1. You could try Firefox. I have used some versions that permit browsing local files and loading supplementary ones.
      2. You could use XAMPP or WAMPSERVER to use your normal desktop as a web server (this is the option I use) or...
      3. You could use the advice on the d3.js site and use the python simple server (https://github.com/mbostock/d3/wiki#installing) (I've actually never tried this)

      Delete
  10. Great stuff. Thanks very much for taking the time and trouble to share this with us. Much appreciated.

    ReplyDelete
    Replies
    1. You're welcome. I hope you're enjoying d3 :-)

      Delete
  11. 2 axis movement http://jsfiddle.net/csaladenes/jqRRY/

    ReplyDelete
  12. Nice work.. Could you please help me how to use rest call in creating sankey diagram

    ReplyDelete
    Replies
    1. Sorry, but I have never done any work in that area before and I wouldn't have the time to bring myself up to speed so that I could help. I would recommend that you make good use of Google groups and stack overflow and see what you can discover. Good luck

      Delete
    2. could you please share any example for single sankey diagram with multiple json and each link vary with different color..

      Delete
    3. I don't have one I'm afraid.
      This would make a good question on Stack Overflow though.

      Delete