“Data-centric” vs. “Document-centric” XML

Tagged:  •  

I was in a meeting this week where the topic of “Data-centric” vs. “Document-centric” XML arose. These concepts aren’t immediately obvious, and it took me a reasonable amount of time to understand them. So here’s the deal…

Data-centric XML is that which has record structure as its focus. Data-centric XML serves a similar function to a database; a set of fields are pre-defined, and records (think records here, not documents) must conform to that structure. Data-centric metadata schemas define a set of features, expressed as metadata elements, that are deemed important to record for the type of resource for which the standard is intended to be used. It is truly meta-data – not a document, a resource, a library holding, but a surrogate for that document/resource/holding created for the purpose of discovery or access. Standards such as MODS and Dublin Core represent data-centric XML.

Document-centric XML is that which has the document (the text, something pre-existing with its own structure) as its focus. XML Schemas or DTDs for document-centric XML can be thought of more as markup languages than metadata formats. As such, they provide means to explicitly indicate when parts of an existing document have structure or meaning; for example, when a section of a document represents a paragraph, a stanza of a poem, a title, a name, or a geographic location. Document-centric XML does not have a predictable structure; rather, elements representing structure and document features appear as needed in a given document. Encoding using standards such as TEI and EAD produces document-centric XML.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

"XML in a Nutshell" (Harold & Means; O'Reilly; 3d ed 2004) which I am slowly working through, refers to these concept as "Narrative-like XML documents" and "Record-like XML documents." [Note that they refer to both kinds of XML, um, 'files', as 'documents', instead of reserving 'document' for narrative-like ones as you do. Note also that an XML document/file doesn't neccesarily _need_ a schema/DTD at all, but will still be either narrative-like or record-like.] They're talking about the same idea.

Other examples of narrative-like XML applications are OpenOffice and, of course, xHTML.

I'm not sure it's true that record-like XML is _neccesarily_ metadata, neccesarily a surrogate, like you say. Anything you might want to make a database record for, like you say, you might want a record-like XML document for. But not anything in a database is a surrogate for something else, a record can just be it's own data too. No? Record-like XML can also be used for low-level communication-type applications, like SOAP (or like XML's use in AJAX generally), which aren't exactly 'metadata'.

Harold & Means make the point that the designers of XML actually originally envisoned it for use with narrative-like documents; but it's use for record-like documents has perhaps become more popular (although depending on how you measure 'popularity', xHTML might push narrative-like uses into 'most popular'). (I think this is in part their own prejudice too, the authors are clearly more excited about XML for narrative-like uses.)

"XML is first and foremost a document format. it was always intended for web pages, books, scholarly articles, poems, short stories, reference materials, tutorials, textbooks, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications such as order processing, object serialization, database exchange and backup, and electronic data interchange is mostly a happy accident."

Note that they (in 2004) don't even include 'metadata' as one of the "record-like" uses, instead focusing on more technical data-interchange type applications. Also note that "meant for human beings to read" vs. "meant as a computer data format" often go along with "narrative-like" vs "document-like".

I think this is an important thing to keep in mind when designing and using an XML application, and if you aren't clear about which of these categories your XML application falls into, you may be headed for trouble. For instance, which is EAD? Meant for human beings to read, or meant as a computer data format? Narrative-like, or record-like?

Jonathan wrote: "I'm not sure it's true that record-like XML is _neccesarily_ metadata, neccesarily a surrogate, like you say."

Ah, yes, you uncovered the bias that leaks in from my particular perspective on this issue. (I am a "Metadata Librarian" after all!) That's absolutely true: XML can encode data just as well as metadata, and does for many applications. I would say that in libraries, XML that's dealt with as XML rather than inside an application is more frequently for metadata (surrogates) than for data.

I'll also throw in here that the dividing line between data and metadata (and even metadata and meta-metadata) isn't always clear. How's that for uncertainty in an area we'd really like to be certain and prescriptive?

"I'll also throw in here that the dividing line between data and metadata (and even metadata and meta-metadata) isn't always clear."

Very true. "Metadata" is, of course, _always_ data. That's why it's called "meta-" data. So all metadata is data, but probably not all data is metadata.

And whether XML is dealt with "as XML" rather than "inside an application" depends on how dirty you need to (or can) get your hands inside the 'black box', right? Whether there's an application to do what you want, or whether you need to get the toolbox out and try to put something together yourself. I suggest that libraries and librarians will have to only more increasingly get their hands dirty in technology, which is maybe my own particular bias (I'm a systems librarian). But the fact that we're talking about XML in the first place (and that this blog exists) seems to point in that direction too.

But I see what you're saying if you're saying that librarians and libraries are traditionally in the business of dealing with surrogates and metadata, and that's a lot of what we do in general. Certainly true. That's our expertise.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.