Recently in the Confluence team we’ve been increasing the amount of code reviewing that we do before new code is released in the product. As the resident Confluence code review practice champion, this makes me happy. In fact in the three separate code reviews that I was involved with in one iteration, we prevented three fairly significant defects from making their way into Confluence 2.7 (and yes one of those was in my code).

Code reviews are great because not only can they prevent defects from shipping to customers, they are also a great learning tool for both author and reviewer. But why should the learning stop there, trapped in an obscure Crucible review or in a conversation that is quickly receding into the thick fog of time past? There are some things that I see quite often in a code review that would be best shared amongst all developers so that we can get a global reduction in certain problems within our code bases. Sharing the results of code reviews also serves as a meta-review of the reviewer’s findings as well – after all reviewers may have their own misconceptions and ill founded beliefs.

As part of the reviewing process in the Confluence team I plan to put together a regular news post highlighting any common problems or particularly interesting cases turned up in reviews performed within the team. So without any further ado I present you with the inaugural edition of the Monthly Code Smell.

Sets are great filters

I saw this quite often during a code test I administered for my position at my previous employer and happened to see it again incidentally while reviewing some changes to the JIRA issues macro. Given a task to filter a collection for duplicates much code will go to the effort of iterating over a collection checking for the existence of an element in another collection and acting accordingly.

This piece of code filtered out any columns that weren’t supported by the macro when the user specified their own list of columns.

private List prepareDisplayColumns(String columns)
{
if (columns == null || columns.equals(""))
{ // No "columns" defined so using the defaults!
return defaultColumns;
}
else
{
StringTokenizer tokenizer = new StringTokenizer(columns, ",;");
List list = new LinkedList();
while (tokenizer.hasMoreTokens())
{
String col = tokenizer.nextToken().toLowerCase().trim();
if (defaultColumns.contains(col) && !list.contains(col))
list.add(col);
}
if (list.isEmpty())
return defaultColumns;
else
return list;
}
}

Compare this with the following, functionally equivalent code:

private Set prepareDisplayColumns(String columns)
{
if (!TextUtils.stringSet(columns))
return defaultColumns; // An unmodifiable collection of course
Set columnSet = new LinkedHashSet(Arrays.asList(columns.split(",|;")));
columnSet.retainAll(defaultColumns);
return columnSet.isEmpty() ? defaultColumns : Collections.unmodifiableSet(columnSet);
}

A LinkedHashSet was employed because the specification order of the columns in the delimited string was significant for the table rendering order. The LinkedHashSet guarantees that the iteration order for the set is the same as the addition order. Passing a collection to the set constructor returns a set of all unique elements in the original collection, effectively replacing the while loop and condition checking of the original code. The Collection method retainAll is often overlooked as an excellent way of enforcing membership constraints.

Finally it’s worth noting that the JavaDoc for the StringTokenizer class discourages its use and recommends using String.split instead. It’s hard to find out exactly why Sun weakly discourages its use rather than just flat out deprecating and removing it. It would appear that this has happened because of a few idiosyncracies in how the class performs tokenisation which Sun would like to preserve for backwards compatibility. Next time you think StringTokenzier think String.split instead (or java.util.Scanner if you are using Java 5)

In Soviet Russia, InputStreams abstract you

InputStream handling is something that is frequently done badly. One of the main reasons for this is that clients are often so abstracted from the actual source of bytes that they can’t really understand the ramifications of certain actions or non-actions on their behalf.

Consider the following interface for a flexible resource reading strategy:

public interface ResourceReader
{
String getName();
String getContentType();
InputStream getStreamForReading();
}

What can I as a client of this interface assume about the stream returned by getStreamForReading? Nothing really. It could be a ByteArrayInputStream which I can get away with not closing when I’m finished, or it could be a FileInputStream which will keep an operating system file handle opened until I call close().

There are other questions I as a consumer of this interface would like to know:

  1. Whose responsibility is it to close the stream returned?
  2. If I call getStreamForReading multiple times do I get a fresh InputStream ready for reading from the start or do I get the same instance as the first call which may have been read already?
  3. Is it actually legal for me to call getSteamForReading more than once on the same instance of this interface?

The answers to these questions all have ramifications on how code interacts with this interface and the InputStream it returns. Authors of such abstractions should always address these questions when writing any public interface that returns a stream. The answer to question one is almost always that the client should close the input stream when it is finished with it but there are cases where this won’t be the case.

It is pretty hard in this case for the interface author to make any assertions about what type of InputStream will be returned by implementations. It might be a fully memory resident stream, a file stream or a network stream.

The point to remember is this: getting an InputStream can be a very expensive operation. In Confluence the expense of getting an InputStream of attachment data is dependent on the attachment storage configuration. When attachments are stored on the file system, getting an InputStream involves opening a FileInputStream onto the attachments disk location, counting as one file handle opened for the process. This is fairly inexpensive by itself (but could still put needless stress on the process file handle limit on a busy instance). However, when attachments are stored in the Confluence database, as is required in clustered instances of Confluence, there is a major expense in acquiring an attachment InputStream.

Streaming an attachment from a database requires holding a database connection, of which there are only a limited number available, for the duration of the streaming operation. The duration of this operation is dependent on two parameters:

  • The size of the attachment
  • The available bandwidth between the client and the database.

Given a large attachment and/or a slow client, the associated database connection could be out of action for a significant amount of time. To mitigate this problem, when a client asks for a database attachment stream, Confluence will first spool the attachment data to the local filesystem and return a FileInputStream onto the attachment data once spooling is complete. However this operation is still a significant investment of time, disk space and I/O effort.

That’s why the following code was flagged as a defect:

InputStream is = resourceReader.getStreamForReading();
if (LastModifiedHandler.checkRequest(httpServletRequest, httpServletResponse,
resourceReader.getLastModificationDate()))
{
IOUtils.closeQuietly(is);
return null;
}

The lesson: sometimes you really can’t predict what kind of effort goes into constructing a live InputStream so you should only ever obtain one if you are certain you need it.

String.getBytes() is a dirty trap

ackbar.jpg

Here’s a riddle for all the punters: What character encoding is used to convert a string into bytes when I call String.getBytes()? Is it:

  1. UTF-8 because it’s ubiquitous, capable and efficient?
  2. UTF-16 because it’s how Java encodes strings internally?
  3. UCS-2 because that’s how Java originally encoded strings internally?
  4. Something that is entirely dependent on the configuration of the platform that the runtime is on (and isn’t necessarily a Unicode encoding)?
  5. EBCDIC because I said there was a dirty trap and that would certainly be one?

I would anticipate the following set of reactions:

  • #1 because that’s a reasonable contract.
  • #2 because that’s a reasonable contract.
  • #3 because Sun probably wanted to preserve 100% backwards compatibility.
  • It must be #4 because that’s the way any answer list like this would be arranged for dramatic effect and #5 is obviously the obligatory joke answer.
  • #4 because it sounds like a trap.
  • #4 because I looked at the JavaDoc or already knew.
  • #5 because I’m incredibly cynical and/or didn’t realise you were trying to be funny.

It wouldn’t surprise me if a significant proportion of you assumed either UTF-8 or UTF-16; it certainly would be an arguably better implementation than what we have at the moment. The sad reality is however that the bytes returned by getBytes() depend on the configuration of the host operating system – a surprising result I think for an environment that strives to do things as consistently as possible across varied platforms.

This problem is particularly insidious in that, for most cases, everything will work fine, including the unit tests which exercise anything using this method. Problems will only start to rear their ugly head when getBytes() is used on a string that contains characters outside of the basic Latin script block.

Let’s have a look at some examples of different encodings of the string “latin script!” :

Encodings of the string “latin script!”
Encoding Bytes
ISO-8859-1 6c 61 74 69 6e 20 73 63 72 69 70 74 21
windows-1252 6c 61 74 69 6e 20 73 63 72 69 70 74 21
MacRoman 6c 61 74 69 6e 20 73 63 72 69 70 74 21
US-ASCII 6c 61 74 69 6e 20 73 63 72 69 70 74 21
UTF-8 6c 61 74 69 6e 20 73 63 72 69 70 74 21
UTF-16 fe ff 00 6c 00 61 00 74 00 69 00 6e 00 20 00 73 00 63 00 72 00 69 00 70

As you can see this string is encoded as exactly the same bytes by the majority of encoders; the sole exception being the double byte UTF-16. The night that Ken Thompson sat down in a diner and designed the ASCII-like UTF-8 encoding on a placemat is legendary in Computer Science circles and indeed many other encodings of Latin scripts were also designed to be compatible with ye olde 7-bit ASCII, that incredibly ubiquitous encoding of antiquity. The reasons should be obvious: encoding designers still wanted the most commonly used script in software-land to remain compatible with the vast amount of ASCII understanding software still being used. This is why we see results like these.

Now what happens if we ask these encoders to encode some characters from an extended Latin script, such as those used by just about every non-English European alphabet? For example if we add diacritics to some of the characters to get the string “látin scrïpt!” and encode this string we get these results:

Encodings of the string “látin scrïpt!”
Encoding Bytes
ISO-8859-1 6c e1 74 69 6e 20 73 63 72 ef 70 74 21
windows-1252 6c e1 74 69 6e 20 73 63 72 ef 70 74 21
MacRoman 6c 87 74 69 6e 20 73 63 72 95 70 74 21
US-ASCII 6c 3f 74 69 6e 20 73 63 72 3f 70 74 21
UTF-8 6c c3 a1 74 69 6e 20 73 63 72 c3 af 70 74 21

This case is more interesting because characters that are not supported by ASCII are mostly represented differently by the various encodings. When looking at these results you could be tempted into thinking that ISO-8859-1 and windows-1252 are the same encodings – but they aren’t; some characters are in fact represented differently, just none of the ones in this example.

You can also see that the ASCII encoder actually does return an encoding for this string. But you said the string contained characters not supported by ASCII! I did and it doesn’t. The encoder will happily map any character it can’t encode to ASCII code 3f, the quizzical question mark. That’s right: the ASCII encoder will silently mangle your internationalised strings.

It turns out that this behaviour isn’t just limited to the ASCII encoder either. On my machine all of the encodings do this when using String.getBytes(). All of the non-Unicode encodings will map a Greek sigma to the question mark for example. Notice that I said on my machine; the truth is that the JavaDoc for the getBytes() method says that the result of passing in a string with unsupported characters is undefined. That leaves the door open for some implementations to return null or throw an unchecked exception or handle it some other ghastly way. Nice!

Even if the system default encoding supports Unicode fully, you still might run into problems when transporting the encoded bytes from one VM to another, as is commonly done in networking protocols. The most frequently used way of converting an array of bytes to a String is the String constructor String(byte[]). This is String.getBytes() evil brother which will also use the platform default encoding to translate the provided bytes to characters. If the two VMs involved in the marshaling and unmarshaling of the encoded bytes use different default encodings, problems are more than likely to follow. You only have to imagine trying to transport the “bàsic latïn!” string from one VM which uses ISO-8859-1 by default to one that uses MacRoman: the string isn’t going to be the same when it is reconstituted on the other side.

The lesson here is to avoid String.getBytes() and its ugly sibling at all costs. Most of the time getBytes("UTF-8") and new String(bytes, "UTF-8") is what you are going to be looking for. This introduces the mildly annoying inconvenience of having to catch UnsupportedEncodingException even though it is mandatory for all JREs to support UTF-8. Maybe it’s time for the introduction of a PrettyMuchImpossibleException to wrap any exception raised while doing this.

Fin

Phew, that was a fair blog post. I hope to do more bite sized posts in the future; suggestions for future topics and other discussion welcome!

Fresh ideas, announcements, and inspiration for your team, delivered weekly.

Subscribe now