Extensible Markup Language (XML)

Extensible Markup Language (XML)

The last technique presented here isn’t really a data modeling language at all. Rather is a way of representing data structure in text, using specially defined "tags" or labels to describe the structure of text. The data being described could be either from an entity/relationship model or from a database design.

The Extensible Markup Language (XML) is similar to the Hypertext Markup Language (HTML) that is used to describe pages to the World-wide Web. XML and HTML are both sub-sets of something called "Standard Generalized Markup Language", or SGML. This is a sophisticated tag language, which, "due to [its] complexity, and the complexity of the tools required," as the Object Management Group has so delicately put it, "has not achieved widespread uptake."[XML 1997]

In each case, a set of "tags" are inserted into a body of text. In the case of HTML, the tags are pre-defined to be interpreted by a standard piece of software called a browser. The browser then uses the tags to determine how various parts of the document should be displayed.

XML, on the other hand, allows tags to be defined by users, and is not concerned with display at all. Rather, the tags can be defined to describe a data structure, and data can be transmitted over the Internet in that structure.

Because tags are defined by users, there is no existing software that will automatically understand the tags. Software can read the definitions of tags and insure that data transmitted using them follows them, but it cannot provide more interpretation to the structure unless it is specifically written to do so.

This means that XML is most useful when within a community that defines a set of tags in common for its purpose. For example, the chemical industry has set up an XML-based Chemical Markup Language, and astronomers, mathematicians and the like have similarly defined sets of tags for describing things in their respective fields.

What is it?

Figure 9 shows an example of XML used to describe a data record that might be presented in a document.

<?XML version="1.0"?>

<PURCHASE_ORDER>

<ISSUED_TO_PARTY>

234553

Acme Sporting Goods

Organization

Get America moving

</ISSUED_TO_PARTY>

743453

12 November, 1999

<LINE_ITEM>

64.75

<product_service_indicator>

product

</product_service_indicator>

X-23

Nike sneakers

75.00

</PRODUCT>

</LINE_ITEM>

<LINE_ITEM>

64.75

<product_service_indicator>

service

</product_service_indicator>

x-87

Walking the dog

12.00

</SERVICE>

</LINE_ITEM>

<LINE_ITEM/> </PURCHASE_ORDER> Figure 9: An XML Document

Note a few interesting things about this example.

First of all, as with HTML, each tag is surrounded by less than and greater than brackets (<>), and is usually followed by text. The text is in turn followed by an end tag, in the form </...>. A tag may have no content, in which case either the end tag follows immediately upon the tag (as in <surname></surname>), or the tag itself ends with a forward slash (as in <LINE_ITEM/>). Unlike with HTML, however, the end tag is always required.

A second thing to note is that, in this case, following the tag for purchase_order, a set of related tags follow, describing characteristics (columns, in this case) of purchase_order. In this particular case, the tag <PURCHASE_ORDER> has been defined such that it must be followed by exactly one tag for <ISSUED_TO_PARTY>, one for <po_number>, and so forth. You can’t see this from the example, but the tag <corporate_mission> is optional. In addition, the tag for line_item is also optional, and there may be one or more occurrences of it.

Although it is optional, all XML documents should begin with <?XML version="1.0"?> (or whatever version number is appropriate.)

Note that the structure is hierarchical, so that an element can be under only one other element, and there can be only one hierarchy in a document.

Comments are in the form  Note that the double hyphens must be part of the comment. Note also that, unlike HTML, XML lets you use a comment to surround lines of code that you want to disable.

The meaning of a tag is defined in a document type declaration (DTD). This is a body of code that defines tags through a set of elements. It is the DTD that allows you to specify a data structure. While an XML document contains data, the DTD contains the model of those data.

It is the DTD that is the analogy to the modeling techniques we have seen in this article.

Entities and Attributes

The DTD for the above example is shown in Figure 10.

<!DOCTYPE PURCHASE_ORDER [

<!ELEMENT PURCHASE_ORDER (ISSUED_TO_PARTY, po_number, order_date, LINE_ITEM*)>

<!ELEMENT ISSUED_TO_PARTY (party_id, name, party_type, surname?, corporate_mission?)>

<!ELEMENT party_id (#PCDATA)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT party_type (#PCDATA)>

<!ELEMENT surname (#PCDATA)>

<!ELEMENT corporate_mission (#PCDATA)>

<!ELEMENT po_number (#PCDATA)>

<!ELEMENT order_date (#PCDATA)>

<!ELEMENT LINE_ITEM (line_number, quantity, price,

product_service_indicator, PRODUCT?, SERVICE?)>

<!ELEMENT line_number (#PCDATA)>

<!ELEMENT quantity (#PCDATA)>

<!ELEMENT price (#PCDATA)>

<!ELEMENT product_service_indicator (#PCDATA)>

<!ELEMENT PRODUCT (product_code, description,

unit_price)>

<!ELEMENT product_code (#PCDATA)>

<!ELEMENT description (#PCDATA)>

<!ELEMENT unit_price (#PCDATA)>

<!ELEMENT SERVICE (service_id, description, rate_per_hour)>

<!ELEMENT service_id (#PCDATA)>

<!ELEMENT description (#PCDATA)>

<!ELEMENT rate_per_hour (#PCDATA)>

] Figure 10: An XML Data Type Definition

The DTD for an XML document can be either part of the document or in an external file. If it is external, the DOCTYPE statement still occurs in the document, with the argument "SYSTEM -filename-", where "-filename-" is the name of the file containing the DTD. For example, if the above DTD were in an external file called "xxx.dtd", the DOCTYPE statement would read:

<!DOCTYPE PURCHASE_ORDER SYSTEM xxx.dtd>

The same line would then also appear as the first line in the file xxx.dtd.

Note that the name specified in the DOCTYPE statement must be the same as the name of the highest level ELEMENT.

Each element in the specification refers to a piece of information. XML doesn’t care whether it is an entity or an attribute in your data model. What it does care about in some cases is that the element may be defined by one or more predicates. A predicate is simply a piece of information about an element. This may be either an attribute or an entity in your data model. In the example above, PURCHASE_ORDER has as predicates ISSUED_TO_PARTY, po_number, order_date, and LINE_ITEM.

Cardinality/optionality

Relationships are represented by the attachment of predicates to elements. In the absence of any special characters, this means that there must be exactly one occurrence of each of the predicate for each occurrence of parent element. If the predicate is followed by a "?", then the predicate is not required. If it is followed by a "*" it is not required, but if it occurs, it may have more than one occurrence. If it is followed by a "+" at least one occurrence is required and it may have more than one.

In the example in Figure 10, each purchase_order must have an issued_to_party, a "po_number" and an "order_date". In addition, a purchase_order may or may not have any line_items, but it could have more than one.

Each of the predicates is then defined in turn in one of the lines that follow. At the bottom of the tree in each case, "#PCDATA" means that the element will contain text that can be parsed by browsing software.

Names

Names in XML may not have spaces. XML is case sensitive. XML keywords are in all uppercase. The case of a tag name in an element definition must be the same as was used if the element appeared as a predicate, and the case of an element used an XML document must be the same as in its DTD definition.

Note that there is nothing in XML to prevent you from specifying multi-valued attributes, but in the interest of coherence for the data structure, following the rules of normalization is strongly recommended. By convention in the above example, elements that would be entities in an entity/relationship model appear in upper case. Elements that would appear in that model as attributes are in lower case. Actual naming conventions will vary.

Unique identifiers

XML has no way to recognize unique identifiers.

Sub-types

XML has no way to recognize sub-types and super-types. Note in the example above, that the attributes of issued_to_party had to include both attributes of person and attributes of organization from our other models. The attribute "product-service-indicator" was included to determine which case was involved. Software would be required to enforce this.

Constraints between relationships

XML has no way to describe constraints between relationships.

Comments

As noted above, XML isn’t really a data modeling language. It is not very sophisticated in its ability to represent the finer points of data structure. It shares the limitations of a relational database, for example, with no ability to recognize sub-types or constraints. It is being recognized, however, as a very powerful way to describe the essence of data structures, and to be used as a template for transmitting data from one place to another.

While the tag structure does seem to be a good vehicle for describing and communicating database structure, the requirement for discipline in the way we organize data is more present than ever. XML doesn’t care if we have repeating groups, monstrous data structures, or whatever. If we are to use XML to express a data structure, it is incumbent upon us to do as good a job with the tool as we can. (This is of course true of any modeling technique.)

Following in the tradition of the chemists and astronomers mentioned above, the Object Management Group (OMG) has settled on a set of XML tags they call the XML Metadata Interchange (XMI) as a way to describe in standard terms the structure of data about data ("metadata"). This is useful in communicating between CASE tools, and in describing a "metadata repository". Along the same lines, a group of companies are in the process of defining a Common Warehouse Metadata Interchange (CWMI) that comprises a subset of the XMI tags to support data warehouses.

This means that there are actually two ways that a database structure can be described in XML:

First, an application database can be described in the DTD of an XML document. In this case the operational data contained in the described database could be placed between sets of the described tags. The DTD could, for example, be generated by one CASE tool and read by another one as a way of communicating data structure from one to the other.

A second approach is to make the table and column definitions data that appear between tags of an XMI metamodel. This is a little more arcane, since the XMI metamodel is very abstract, but using the XMI metamodel allows for description of much more than tables and columns.)

Note, however, that the issue in defining a metadata repository or communicating between CASE tools is not the use of XML or any other particular language. The issue is the database structure and its semantics. The important question is not how a universal metadata repository will be represented. It could as easily be represented by a set of relational tables or an entity/relationship diagram. The questions are, what’s in it and what does it mean? XML by itself does not answer that question. Which objects are significant and should be described? That is the harder question. Having a new language for describing them doesn't seem to contribute to that conversation.

Indeed, in recognizing that XML is a good vehicle for describing database structure, the issue that seems most obvious is that this will put greater responsibility on data administrators to define data correctly. XML will not do that. XML will only record whatever data design (good or bad) human beings come up with.

As Clive Finkelstein has said, the advent of XML is going to make data modelers and designers even more important than they are now. "After fifteen years of obscurity, data modelers can finally become overnight successes." [Finkelstein, 1999]