I have written before about the need for a standard binary compression technique for XML. It's an issue that divides the XML community somewhat. Some very well known names in the XML world, people I get along with very well socially, are strongly and publicly against the use of binary encodings for XML. To be honest, I think that they are wrong, albeit for some for the right reasons.
As I've mentioned from time to time, there is no such thing as best practices. There are only practices which are good practices in a particular context, in a particular set of circumstances. The problem I find, when people argue against binary encodings for XML, is that they are typically espousing “best practices”, without giving sufficient thought to what the appropriate contexts are.
So, let me start by putting things in context. The fact that XML's default encoding is text (in some character encoding) is a good thing in many contexts. It means that you can look at the contents of an XML file with just a simple text editor, and sometimes that is enough to be able to understand the contents of the file. You don't get that same convenience with binary encodings. Also, some people are concerned (and with good reason) about document archiving, where you need to be able to access the contents of electronic documents decades or centuries into the future. For that context, textual encoding makes sense, as binary encodings require you to have the software to decode them, and to have the operating system that runs the decoding software, and to have the hardware that runs the operating system, etc. You get the point. These are good reasons for encoding XML as text. Indeed, I never advise someone to use a binary encoding of XML unless they had a specific need to do so.
When would you want to use a binary encoding of XML? Well, binary encoding compresses the XML document into a smaller size, sometimes just a few percent of its original size. Occasionally that might be useful for storage, if you deal with enormous data files (for example, large data sets for numerical simulations). More commonly, though, binary encodings are used when transmitting files across networks. Why? Well, size is an obvious reason, to reduce bandwidth. However, another reason is speed of parsing. In probably the majority of cases, the time taken to parse XML documents is not significant in the context of the whole business process they are used within. People often worry too much about the processing overhead of parsing XML, without looking at the bigger picture of their processes and how long they take from end to end. However, there are contexts in which the parsing time is genuinely significant.
When you have large volumes of documents to process, you can sometimes manage that processing by spreading the work out over a “farm” of processors. Each if each document is processed relatively slowly, the overall throughput can be high. However, that only works if you aren't sensitive to the time taken to process each individual document. In the modern finance world, algorithmic (computer) share trading means that a difference of just 20ms (1/50 second) in processing a message can be the difference between getting the shares at the price you wanted, or missing out. This means that the time taken to process each individual message is important. You can make an XML message smaller by using gzip or such, but you still have to unzip it into text and then parse it. With a good binary encoding, you should be able to decode the XML file directly as an event stream (e.g. like SAX), without ever having to reconstruct the XML document as text. That can make a very significant difference to the process in time-sensitive contexts like the one I've outlined.
Arguably, in these days of XML messaging and Web Services, the majority of XML documents are short-lived. They don't need to be archived, and the lifetime of the messages is much shorter than the lifetime of the software that processes or encodes them. When this is the type of XML that you work with, there is no need to treat it like archival XML.
I started trying to push XML compression for these types of banking/financial contexts back in 2001, when there was the MPEG-7 specification. This was for digital video, but with compressed XML to provide metadata. It was a pretty good XML compression scheme, but it never took off, in spite of everyone repeatedly complaining about the size and verbosity of XML messages. Why not? In my mind, there were a couple of key reasons. The W3C didn't recommend any kind of compression at that time, and people take the W3C very seriously and religiously as a “brand”. Also, there wasn't a large choice of compression/decompression software, and consequently large companies were genuinely worried about being locked in to particular software tools that would play a fundamental part in their messaging systems (of course, they put up with this for relational databases, but that doesn't mean they want to repeat the same lock-in for every piece of software that they use). So they took the pain of textually encoded XML, or used alternative binary formats; bandwidth has not grown as quickly as some people thought it would, not as compared to the growth in the volume of information distributed between financial organisations.
Indeed, although some people view this question as “should I binary encode XML or not?”, the real question for users with high-volume and low-latency requirements is “which binary encoding should I use?” The dominant financial protocol for pre-trade and trade activities is the FIX protocol, which is a binary protocol. There is also a matching XML variant of the messages, FIXml, but it's the binary protocol that is overwhelmingly used, because the XML variant mostly just makes the messages larger and slower to process. The FIXml schemas are generated from the same model (held in a relational database) that the binary message definitions are generated from.
At this point, some people would say “well, that's OK, XML isn't good for everything, sometimes you just need to use binary encodings, that doesn't mean that you need binary encoding for XML”. However, people often use XML because it is quicker to create Schemas than to create binary encoding formats (and prove that those binary formats are completely unambiguous in their encoding). XML is familiar to many developers now, and there is a whole ecosystem of great tools for XML. Why not make the most of those XML tools, even if you are working in a context where it makes sense to go binary for transmission? Why should all of the tools that you use with binary encodings be different to the tools you use for XML? It's not helpful to be forced to use completely different tool sets.
I've been thinking about all of this because last week, I was phoned up by John Schneider, the CTO of Agile Delta. The W3C has actually been working on binary XML for a couple of years now, in spite of snide remarks from some of the XML old guard. In particular, they have been measuring and comparing a number of alternative approaches to compression and binary encoding of XML. The tests measured not only the amount of compression, but also the time taken to decode and parse binary encoded XML documents. There are also performance results on the Agile Delta site. John contacted me because Agile Delta's “Efficient XML” approach has been chosen as the best all-round approach of those tested, and is now the focus of the W3C's efforts.
I'm really very pleased to see this work progressing. The W3C's support is very welcome, as I noted, because it's such a powerful brand in the Web/XML world. We don't have support for Efficient XML yet, to the best of my knowledge, from that other powerful brand, Apache. However, there are products available right now that implement Efficient XML, and it seems very likely that this number will grow over the next year or two because of the W3C's support, so that there will be sufficient and genuine choice between vendors.
If you have a context that could genuinely benefit from it, particularly a high-volume and low-latency context, you should do yourself a favour and check out Efficient XML.
Recent Comments