For this ShipIt day I chose to implement HTML version diffs for Confluence.
Confluence shows the differences between versions in terms of the markup which produced each version of a page – this isn’t always very clear. Often a paragraph is duplicated, with additions in one copy and deletions in the other:
This looks better when the HTML is diffed. Or rather it would if the algorithm I chose hadn’t made some unfortunate choices (I converted the python difflib to Java, because the only HTML diff I found via Google was based on it – I think I would have been better off using jrcs, the library we use now, even though it seems to be a bit of an orphan.):
The strategy I use is:
- Tokenize each version of the HTML into tags and words (so <img width=200 height=100 src=”/xxx/foo.jpg”/> is a single token from the diff point of view, but <p>A Paragraph</p> is four).
- Run the diff algorithm and concatenate all the operations it produces – ‘equal’, ‘changed’ (which becomes an insert followed by a delete), ‘added’ and ‘deleted’. Any text tokens get surrounded with a <span> with an appropriate class, and some other tags (like IMG) do too.
- Turn the HTML produced by the previous step into a DOM tree and traverse it, marking block level constructs (like <P> and <TR>) with a class to indicate that part of their contents have changed – that produces the blue lines in the margin.
- Replace the <a … /> tags produced in the previous step by Neko with <a …></a>, because the former breaks Safari and IE (at least).
There’s still a lot to do:
- Try jrcs instead of my python difflib conversion.
- Apply the change anchors more sparingly, just once to each element which justifies a blue marker.
- Figure out what to do for lists which have just had indentation changes – these are not handled well at present.
- Look at more corner cases – for instance, what about a change which just changes the class of a DIV? How can we show that?