In my previous blog post, I wrote about getting set up for doing development with MarkLogic Server, which is a database and app server for semi-structured data like XML. This time, I want to focus on getting content into MarkLogic Server. When you start developing with MarkLogic Server, or any similar database, you are likely to want to pre-load some information into it. In my case, I wanted to load FpML Schemas, validation rules and examples (FpML is the Financial Products Markup Language, http://www.fpml.org/). It's taken me longer than I expected to load these files the way that I want them in the database, as I get used to what approaches do and don't work with MarkLogic Server, so I want to discuss what I found by experience (and would welcome comments on anything I've done that could have been done more easily another way).
One thing about semi-structured databases is that when you load a block of information (e.g. a message or document), the content tends to stay together in the database (with some exceptions for huge documents). So, if you load an XML document into an XML database, you can extract the same document at a later time. This differs from what happens when you put semi-structured information into a relational database. In the case of FpML, if you use a tool to create an equivalent relational schema for you, you end up with a couple of thousand database tables, which is a bit of a nightmare for even experienced database developers to work with. Additionally, while the tools are able to "burst" your FpML message and populate the tables with the information, I'm not aware of any simple way to request that the database return to you all of the information from a particular message. That's sometimes an important thing to do — an FpML message typically represents the details of a trade of an OTC derivative, and there are real business reasons for wanting to retrieve all of those trade details together. The SQL required to do it is a significant and expensive piece of work to write by hand. Indeed, many teams don't bother, they store the original message separately to the database, because they know that the relational database is going to make hard work of it. While that kind of archiving is OK for ad-hoc auditing purposes, it doesn't help if one of your use cases is to be able to present screens showing the details for particular trades. Where there are good reasons to keep particular sets of related data items together and not have them shredded into separate tables, a semi-structured database can save a lot of development effort.
Here is what you get with an FpML release (they are free to download, but you have to register). I'm taking FpML 4.7 (2nd recommendation) as my example):
When unzipped, there are top-level
'documents
', 'html
', 'pdf
' and
'xml
' directories. We'll mainly be interested with the
'xml
' directory. It contains the FpML XML Schemas, and its
sub-directories contain FpML examples, including some invalid examples for
negative testing. However, the other interesting directory for use with
MarkLogic is 'html/validation-rules
'. Here there are XQueries
(*.xq
) which implement extra validation rules for FpML, extra
rules that XML Schemas can't check.
Before you import a large number of files into a database like MarkLogic, it's a good idea to think about how you want to organise things. One of my traditional criticisms of databases, and the way people use them, is that it can be too easy to "lose" documents or data, lose it in a sea of other information. With MarkLogic, documents are stored in 'Forests', and one or more forests can be part of a 'Database'. An app server uses particular databases. So app servers use databases, databases use forests, forests contain documents. However, there are some other things to think about too.
Every
document in the database has a URI to identify it. These are sometimes
proper URLs that start with 'http://...
', but they can also
be just paths like '/AAA/BBB/CCC
' (so they aren't constrained
to be proper URIs, in spite of what they are called in the documentation).
Documents are organised in a hierarchy of directories, like a file system,
but each document can also be part of any number of 'collections'.
Collections allow you to group any set of files together, regardless of
where they are located in the database. So, when importing files, it's
worth spending a bit of time thinking about how you want the document URIs
to be structured (i.e. how you want your database's directories to be
structured), and also what collections you should create to group related
files to each other (just to make it easy to find those files later, as a
related group).
Additionally, there are relationships between databases. For each database, you can specify which database contains the XML Schemas for validating the XML (it can be the same database, by the way). Additionally, for each app server you can specify a database for 'modules' (XQueries), although you can also choose to store your XQueries in the file system where MarkLogic Server is installed. What this means is
- you need to decide whether your XML will be store in one or more forests, and in one or more databases;
- you need to decide whether your XML Schemas will be in the same database as your XML files, or a different one;
- you need to decide whether your modules (XQueries) will be stored in a database, or in the file system.
Let look at doing the following:
- creating a new forest and database for the FpML 4.7 XML Schemas;
- importing the FpML 4.7 Schemas into that database.
(In later posts, we will look at importing the other files, and doing some validation.)
In the MarkLogic admin console (usually port
8001), go to the 'Forests
' section:
Click on the 'Create
' tab, and create a
new forest called 'FpML-Schemas
'. You only need to fill in
the name (for the purposes of this exercise), then you can press
'ok
'.
You will now have an empty forest for the FpML XML
Schemas. Now go to the 'Databases
' section of the
console:
Click on the 'Create' tab, and create a new database called 'FpML-Schemas' (it's OK to have the same name as the forest):
After you press 'ok
', you will find that
you are prompted to select one or more forests for the
database:
Click on the 'Database->Forests
' link
and select 'FpML-Schemas
' as the forest for the database
(remember, we gave them both the same name):
You now have a database with a forest, ready for
loading your FpML XML Schemas. The quickest way to load the FpML XML
Schemas is to go to the admin page for the 'FpML-Schemas
'
database, and click on the 'Load
' tab. Put in the path for
your FpML 4.7 'xml
' directory, and use '*.xsd
'
as the filter. This will load all of the XML Schemas in the
'xml
' directory (but not in any sub-directories).
When you press 'ok
', you are shown a
page with the files that will be loaded, and the database URI that will be
assigned to each. You should find that the URIs are identical to the file
paths. Press 'ok
' to start loading the Schemas. In a few
moments, the FpML XML Schemas are loaded. Press 'ok
' again.
The status page for the 'FpML-Schemas' database should now show that you
have some documents in it:
How do you see which XML Schemas have been loaded? How do you control the URIs that are assigned to them when they are loaded in the database? How do you put them in collections? These are things we will look at in the next installment.