I've written before [#1, #2] about the W3C's “Efficient XML” specification which defines a “binary” XML format that
compresses your XML (with or without a Schema, but it can do a better job with a Schema);
can be parsed more quickly than the equivalent textual XML can be parsed, i.e. the compressed XML is effectively pre-parsed (and pre-validated if a Schema was used).
Let's be clear, for those of you who don't like the idea of “binary XML”, I think 80% of world is perfectly happy 80% of the time with XML in its normal textual format, and I'm not suggesting the 80% should change. For some XML users (e.g. in banking/finance, where I do most of my work for clients), textual XML is a barrier to some of their current volume requirements and per-transaction processing time requirements. So some people need this, even if you yourself don't.
Anyway, I noticed during the week that numerous reference compressed files have been posted to the Web. While the files themselves are probably most of interest to people working on implementations of Efficient XML, the file sizes are interesting more generally, in my mind. Sadly, they didn't post a table of file sizes, so I downloaded the FpML (Financial Products Markup Language) files for IRD (Interest Rate Derivatives), which is an area of interest to me (there are lots of examples, though, some of which may be more relevant to what you do).
Each of the FpML IRD example XML files was compressed multiple ways using different compression options, using the Agile Delta implementation of Efficient XML. I looked at the best compression for each example, and found that the average was about 85%. The actual compression varied about 70% and 90%, with a graph of compression versus file size suggesting that compression improves for larger files. The largest of the FpML examples was about 20k, and you would really need larger files again to test whether the compression genuinely improves with the uncompressed file size.
The problem, at the moment, in trying to convince people to look at this technology is that Agile Delta's implementation seems to be the only one available. At least, I can't find another; if you have an implementation, or are planning one, please add a comment to this post with the details. I agree with the adage that an open standard with only a single implementation is no better than a proprietary standard, so I really hope we will see some other implementations before too long. Something from Apache/IBM/Microsoft/Sun would be nice, with smaller vendors keeping them honest. Here's hoping, anyway.