In my previous blog post, I wrote about getting
set up for doing development with MarkLogic Server, which is a
database and app server for semi-structured data like XML. This time, I
want to focus on getting content into MarkLogic Server. When you start
developing with MarkLogic Server, or any similar database, you are likely
to want to pre-load some information into it. In my case, I wanted to load
FpML Schemas, validation rules and examples (FpML is the Financial
Products Markup Language, http://www.fpml.org/). It's taken me
longer than I expected to load these files the way that I want them in the
database, as I get used to what approaches do and don't work with
MarkLogic Server, so I want to discuss what I found by experience (and
would welcome comments on anything I've done that could have been done
more easily another way).
One thing about semi-structured databases
is that when you load a block of information (e.g. a message or document),
the content tends to stay together in the database (with some exceptions
for huge documents). So, if you load an XML document into an XML database,
you can extract the same document at a later time. This differs from what
happens when you put semi-structured information into a relational
database. In the case of FpML, if you use a tool to create an equivalent
relational schema for you, you end up with a couple of thousand
database tables, which is a bit of a nightmare for even
experienced database developers to work with. Additionally, while the
tools are able to "burst" your FpML message and populate the tables with
the information, I'm not aware of any simple way to request that the
database return to you all of the information from a particular message.
That's sometimes an important thing to do — an FpML message typically
represents the details of a trade of an OTC derivative, and there are real
business reasons for wanting to retrieve all of those trade details
together. The SQL required to do it is a significant and expensive piece
of work to write by hand. Indeed, many teams don't bother, they store the
original message separately to the database, because they know that the
relational database is going to make hard work of it. While that kind of
archiving is OK for ad-hoc auditing purposes, it doesn't help if one of
your use cases is to be able to present screens showing the details for
particular trades. Where there are good reasons to keep particular sets of
related data items together and not have them shredded into separate
tables, a semi-structured database can save a lot of development
effort.
Here is what you get with an FpML release (they are free to
download, but you have to register). I'm taking FpML 4.7 (2nd
recommendation) as my example):
When unzipped, there are top-level
'documents
', 'html
', 'pdf
' and
'xml
' directories. We'll mainly be interested with the
'xml
' directory. It contains the FpML XML Schemas, and its
sub-directories contain FpML examples, including some invalid examples for
negative testing. However, the other interesting directory for use with
MarkLogic is 'html/validation-rules
'. Here there are XQueries
(*.xq
) which implement extra validation rules for FpML, extra
rules that XML Schemas can't check.
Before you import a large number
of files into a database like MarkLogic, it's a good idea to think about
how you want to organise things. One of my traditional criticisms of
databases, and the way people use them, is that it can be too easy to
"lose" documents or data, lose it in a sea of other information. With
MarkLogic, documents are stored in 'Forests', and one or more forests can
be part of a 'Database'. An app server uses particular databases. So app
servers use databases, databases use forests, forests contain documents.
However, there are some other things to think about too.
Every
document in the database has a URI to identify it. These are sometimes
proper URLs that start with 'http://...
', but they can also
be just paths like '/AAA/BBB/CCC
' (so they aren't constrained
to be proper URIs, in spite of what they are called in the documentation).
Documents are organised in a hierarchy of directories, like a file system,
but each document can also be part of any number of 'collections'.
Collections allow you to group any set of files together, regardless of
where they are located in the database. So, when importing files, it's
worth spending a bit of time thinking about how you want the document URIs
to be structured (i.e. how you want your database's directories to be
structured), and also what collections you should create to group related
files to each other (just to make it easy to find those files later, as a
related group).
Additionally, there are relationships between
databases. For each database, you can specify which database contains the
XML Schemas for validating the XML (it can be the same database, by the
way). Additionally, for each app server you can specify a database for
'modules' (XQueries), although you can also choose to store your XQueries
in the file system where MarkLogic Server is installed. What this means
is
- you need to decide whether your XML will be store in one or more
forests, and in one or more databases;
- you need to decide whether your XML Schemas will be in the same
database as your XML files, or a different one;
- you need to decide whether your modules (XQueries) will be stored
in a database, or in the file system.
Let look at doing the following:
- creating a new forest and database for the FpML 4.7 XML
Schemas;
- importing the FpML 4.7 Schemas into that database.
(In later posts, we will look at importing the other files, and
doing some validation.)
In the MarkLogic admin console (usually port
8001), go to the 'Forests
' section:
Click on the 'Create
' tab, and create a
new forest called 'FpML-Schemas
'. You only need to fill in
the name (for the purposes of this exercise), then you can press
'ok
'.
You will now have an empty forest for the FpML XML
Schemas. Now go to the 'Databases
' section of the
console:
Click on the 'Create' tab, and create a new database
called 'FpML-Schemas' (it's OK to have the same name as the
forest):
After you press 'ok
', you will find that
you are prompted to select one or more forests for the
database:
Click on the 'Database->Forests
' link
and select 'FpML-Schemas
' as the forest for the database
(remember, we gave them both the same name):
You now have a database with a forest, ready for
loading your FpML XML Schemas. The quickest way to load the FpML XML
Schemas is to go to the admin page for the 'FpML-Schemas
'
database, and click on the 'Load
' tab. Put in the path for
your FpML 4.7 'xml
' directory, and use '*.xsd
'
as the filter. This will load all of the XML Schemas in the
'xml
' directory (but not in any sub-directories).
When you press 'ok
', you are shown a
page with the files that will be loaded, and the database URI that will be
assigned to each. You should find that the URIs are identical to the file
paths. Press 'ok
' to start loading the Schemas. In a few
moments, the FpML XML Schemas are loaded. Press 'ok
' again.
The status page for the 'FpML-Schemas' database should now show that you
have some documents in it:
How do you see which XML Schemas have been loaded?
How do you control the URIs that are assigned to them when they are loaded
in the database? How do you put them in collections? These are things we
will look at in the next installment.
Recent Comments