I spend much of my professional life working with XML. I often design XML structures that are used to represent financial information as it is transmitted between computers, but I also spend quite a bit of time looking at tools to support this usage of XML.
To be honest, I'm often a bit horrified by how people set up their software to work with XML. An all too common setup for a system to read incoming XML is:
- Compile the XML Schemas into Java or .NET using a Schema binding tool (problem #1 - the XML is your decoupling mechanism, to help isolate your systems and stop them from being too tightly coupled to each other, but when you compile a Schema, you tightly couple your code to the XML that is supposed to be your decoupling mechanism. If you use Schema compilers and every change to your XML's structure over time causes you headaches, you might need to reconsider your use of a Schema binding tool);
- Create a staging relational database, and write code to store the information from your Java/.NET objects into that database (problem #2 - it can be a lot of work to create a relational model of a hierarchical document, like an XML message. It is also a lot of work to write the code to do the storage. Do you really want to spend that time? Also, people often model all of the XML in the staging database, even if it won't all be used. That means the system ends up coupled to all changes in the XML, even those that would otherwise not have impacted the system - not a good plan.);
- Write some more code to copy the information from the staging database into a separate application database, often with a different structure (so the code that you have to write isn't trivial).
How would I do it differently? It is much less effort to
- Store the XML in an XML database (i.e. in a semi-structured database). In particular, use an XML database that can store arbitrary XML, not one that requires you to have a Schema for your XML;
- Query information out of the XML database as your application requires it (typically using XQuery, although your XML database may also support JDBC/ODBC views of the information, to simplify integration with existing apps).
Notice the difference - no time spent writing staging database schemas, no time spent writing code to populate the staging database. You simply write the queries that you need to get the information out of the XML database, and you should pull the minimum amount of data possible when doing that so that you minimise the coupling between the application and the XML.
In fact, it's worth noting that using an XQuery (or XPath) to pull the information from your XML database provides the same level of decoupling that SQL and views provides when working with a relational database; that use of a query to provide decoupling is so common now with relational databases that people think of it as always having been that way, but it wasn't. There were software tools in the 80s that wrote C++ code that was tightly bound to the database schema, and that code had all the same problems as we have now with code generated from XML Schemas. The relational database folks learned that you always need a SQL query to providing decoupling from the physical storage structure; the XML world has been a bit slower to realise the same thing, but it's an important lesson - use XQuery/XPath/XSLT to decouple yourself from the physical structure of your XML and you take a lot of pain out of your application development.
It can be hard to get people to move away from the relational database mindset, in spite of the extra costs of development. However, various people in the finance world seem to be looking seriously now at semi-structured databases (which includes XML databases), and the one that gets mentioned most often is MarkLogic Server. I should say that I know quite a few people now who work for MarkLogic (they've been hiring of late, and picking up some really top class people).
So, I've decided to set up a proper MarkLogic development environment for myself (I've only toyed with it in the past). You can download a development version from the MarkLogic developer site, where there is also a lot of developer documentation. It doesn't run on Macs yet, but it does run under Windows, Linux and Solaris. I couldn't get it to install under Windows 7 (Vista is what is on the supported list), so I installed it under Ubuntu. Ubuntu isn't officially supported but my friends at MarkLogic told me how to install it; if you want to know, let me know. It's pretty easy, and it's working well for me.
With MarkLogic Server, your installation can run multiple app servers (HTTP, WebDAV, or XDBC [direct code connection]). Each app server uses certain databases, and each databases uses one or more 'forests' of documents. The documents are stored compressed on disk, so it doesn't waste space like some XML databases that store them on disk as uncompressed text. The documents are automatically indexed, but you can also manually adjust how they are indexed (like any database, there is an art to doing that well).
The default setup has a 'Docs' app server which used a 'Documents' database, and these are what you should use in your initial experiments. The first thing you will want to do is load some documents and try out some queries. This can be a bit daunting, it isn't as obvious as it should be how you do this. I believe MarkLogic are working on improving this, but also it's because MarkLogic Server is a secure, industrial-strength database (used by the U.S. security services among others), and so you have to learn to negotiate the permissions. In general, the quick answer is that for your initial learning with MarkLogic Server, use your admin login for everything (i.e. while you are just experimenting, and don't have anything too valuable stored in the database). I spent too much time trying to set up a non-admin user for development work, but it isn't a good way to get to know MarkLogic, so don't make my mistake.
The 'Docs' app server uses the 'Documents' database. Get started by reading the admin guide and create a WebDAV server and an XDBC server for the 'Documents' database (it's pretty easy, I won't describe it here). This allows tools to connect to the database. In particular, I have two working now:
- MarkLogic has an Eclipse plugin that sets it up for XQuery development (available from the MarkLogic developer site). It runs with the Galileo 3.5 release of Eclipse, not with the latest Helios release (trust me, I tried, it doesn't). This gives you syntax support for XQuery, including a knowledge of MarkLogic-specific XQuery functions (MarkLogic Server extends XQuery in the same way PL/SQL or TransactSQL extend SQL - to turn it from a query language into a full programming language. However you can also run pure standard XQuery 1.0 if you want). You can run your XQueries from within Eclipse just as you would run a piece of code;
- oXygen's XML editor, a favourite of mine, also does a great job with MarkLogic Server. It can validate your XQueries using MarkLogic's validator (so it picks up knowledge of MarkLogic-specific functions, etc.), and it can run those queries on MarkLogic Server and return the results, just as easily as Eclipse.
Either of these tools is a great way of getting up and running with MarkLogic Server. You can also install 'CQ' (http://developer.marklogic.com/code/cq) which is a neat little Web UI for running XQueries. It remembers your queries, which is nice, but it doesn't have the syntax checking and colouring that Eclipse and oXygen provide. CQ is well worth installing, but a tool like Eclipse or oXygen is really necessary to be productive (and that is what this is all about, after all, being productive in development, avoiding the effort and cost of unnecessary development).
To load files in a database, go to the admin console (port 8001) and look at the details for that database. There is a 'Load' tab that is the simplest way to get content in. However, it doesn't do what I want, to load content from a whole directory tree while filtering on file names. I am working on an XQuery for that, I will post it later when it is working. Otherwise you can also use WebDAV to drag-and-drop files and directories into MarkLogic Server. I'm going with the XQuery because it will allow me to set things like permissions as the files are imported, and I think that will prove useful, if a bit more involved.
I will write more in the future. I have outlined what I'm doing now. Please feel free to post questions (I know I've skimmed over a lot of details). I will try to answer any questions, or I'll ask one of my MarkLogic friends if they have an answer.