Comparison of Techniques

A Comparison of Data Modeling Techniques

David C. Hay

[This is a revision of a paper by the same title written in 1995. In addition to stylistic updates, this paper replaces all the object modeling techniques with the UML – a new technique that is intended to replace at least all these.]

Peter Chen first introduced entity/relationship modeling in 1976 [Chen 1977]. It was a brilliant idea that has revolutionized the way we represent data. It was a first version only, however, and many people since then have tried to improve on it. A veritable plethora of data modeling techniques have been developed.

Things became more complicated in the late 1980’s with the advent of a variation on this theme called "object modeling". The net effect of all this was that there were now even more ways to model the structure of data. This was mitigated somewhat in the mid-1990's, with the introduction of the UML, a modeling technique intended to replace at least all the "object modeling" ones. As will be seen in this article, it is not quite up to replacing other entity/relationship approaches, but it has had a dramatic effect on the object modeling world.

This article is intended to present the most important of these and to provide a basis for comparing them with each other.

Regardless of the symbols used, data or object modeling is intended to do one thing: describe the things about which an organization wishes to collect data, along with the relationships among them. For this reason, all of the commonly used systems of notation fundamentally are convertible one to another. The major differences among them are aesthetic, although some make distinctions that others do not, and some do not have symbols to represent all situations.

This is true for object modeling notations as well as entity/relationship notations.

There are actually three levels of conventions to be defined in the data modeling arena: The first is syntactic, about the symbols to be used. These conventions are the primary focus of this article. The second defines the organization of model diagrams. Positional conventions dictate how entities are laid out. These will be discussed at the end of the article. And finally, there are conventions about how the meaning of a model may be conveyed. Semantic conventions describe standard ways for representing common business situations. These are not discussed here, but you can find more information about them in books by David Hay [1996] and Martin Fowler [1997]

These three sets of conventions are, in principle, completely independent of each other. Given any of the syntactic conventions described here, you can follow any of the available positional or semantic conventions. In practice, however, promoters of each syntactic convention typically also promote at least particular positional conventions.

In evaluating syntactic conventions, it is important to remember that data modeling has two audiences. The first is the user community, that uses the models and their descriptions to verify that the analysts in fact understand their environment and their requirements. The second audience is the set of systems designers, who use the business rules implied by the models as the basis for their design of computer systems.

Different techniques are better for one audience or the other. Models used by analysts must be clear and easy to read. This often means that these models may describe less than the full extent of detail available. First and foremost, they must be accessible by a non-technical viewer. Models for designers, on the other hand must be as complete and rigorous as possible, expressing as much as possible.

The evaluation, then, will be based both on the technical completeness of each technique and on its readability.

Technical completeness is in terms of the representation of:

Entities and attributes
Relationships
Unique identifiers
Sub-types and super-types
Constraints between relationships

A technique’s readability is characterized by its graphic treatment of relationship lines and entity boxes, as well as its adherence to the general principles of good graphic design. Among the most important of the principles of graphic design is that each symbol should have only one meaning, which applies where ever that symbol is used, and that each concept should be represented by only one symbol. Moreover, a diagram should not be cluttered with more symbols than are absolutely necessary, and the graphics in a diagram should be intuitively expressive of the concepts involved.. [See Hay 98.]

Each technique has strengths and weakness in the way it addresses each audience. As it happens, most are oriented more toward designers than they are toward the user community. These produce models that are very intricate and focus on making sure that all possible constraints are described. Alas, this is often at the expense of readability.

This document presents seven notation schemes. For comparison purposes, the same example model is presented using each technique. Note that the UML is billed as an "object modeling" technique, rather than as a data (entity/relationship) modeling technique, but as you will see, its structures is fundamentally the same. This comparison is in terms of each technique’s symbols for describing entities (or "object classes", for the UML), attributes, relationships (or object-oriented "associations"), unique identifiers, sub-types and constraints between relationships. The following notations are presented here.

At the end of the individual discussions is your author’s argument in favor of Mr. Barker’s approach for use in requirements analysis, along with his argument in favor of UML to support design.

Peter Chen’s original entity/relationship diagrams

Information Engineering

Richard Barker’s notation (used by the Oracle Corporation)

IDEF1X

Object Role Modeling (ORM)

The Unified Modeling Language (UML)

Extensible Markup Language (XML)

Recommendations