For this ShipIt day I chose to implement HTML version diffs for Confluence.

Confluence shows the differences between versions in terms of the markup which produced each version of a page – this isn’t always very clear. Often a paragraph is duplicated, with additions in one copy and deletions in the other:


This looks better when the HTML is diffed. Or rather it would if the algorithm I chose hadn’t made some unfortunate choices (I converted the python difflib to Java, because the only HTML diff I found via Google was based on it – I think I would have been better off using jrcs, the library we use now, even though it seems to be a bit of an orphan.):


The strategy I use is:

  1. Tokenize each version of the HTML into tags and words (so <img width=200 height=100 src=”/xxx/foo.jpg”/> is a single token from the diff point of view, but <p>A Paragraph</p> is four).
  2. Run the diff algorithm and concatenate all the operations it produces – ‘equal’, ‘changed’ (which becomes an insert followed by a delete), ‘added’ and ‘deleted’. Any text tokens get surrounded with a <span> with an appropriate class, and some other tags (like IMG) do too.
  3. Turn the HTML produced by the previous step into a DOM tree and traverse it, marking block level constructs (like <P> and <TR>) with a class to indicate that part of their contents have changed – that produces the blue lines in the margin.
  4. Replace the <a … /> tags produced in the previous step by Neko with <a …></a>, because the former breaks Safari and IE (at least).

There’s still a lot to do:

  1. Try jrcs instead of my python difflib conversion.
  2. Apply the change anchors more sparingly, just once to each element which justifies a blue marker.
  3. Figure out what to do for lists which have just had indentation changes – these are not handled well at present.
  4. Look at more corner cases – for instance, what about a change which just changes the class of a DIV? How can we show that?

ShipIt V – HTML diff between Confluence page...