Data Modeling Recommendations

Recommendations

Because the orientation and purposes of data modeling are very different when supporting analysis than they are when supporting design, no one modeling technique currently available is appropriate for both. Those with the best aesthetics don’t describe as many aspects of the issue as others, which are much less accessible.

The one exception to this is Object Role Modeling, which is both rich in detail, and is relatively easy to read. This technique is radically different from the other modeling approaches, and has therefore been less successful in gaining acceptance.

Among those using the more common entity/relationship view of the world, Richard Barker’s notation is clearly superior as a vehicle for discussing models with prospective system users, and the UML has advantages in supporting design – particularly object-oriented design.

For Analysis – Richard Barker’s Notation

There are several arguments in favor of Mr. Barker’s data modeling syntax for use in requirments analysis:

Aesthetic simplicity

This notation is the easiest to present to a user audience. It is the simplest and clearest among those that are as complete. By using fewer kinds of symbols, Barker’s technique keeps drawings relatively uncluttered, and fewer kinds of elements have to be understood. Simpler, less cluttered diagrams are more accessible to non-technical managers and other end-users.

It uses a line in two parts, each of which may be dashed or solid, to convey the entire set of optional or mandatory aspects of the relationship pair. The presence or absence of a crow’s foot is all that is necessary to represent the upper limit of a relationship. The single symbol of a split line which is either solid or dotted, plus the presence or absence of a crow’s foot, is aesthetically simpler than say, James Martin's notation which requires combinations of four separate symbols to convey the same information.

In Barker’s notation, the "dashedness" or solidness of a line (its most visible aesthetic quality) represents the optionality of the relationship, which is its most important characteristic to most users. IDEF1X, on the other hand, uses "dashedness" to represent the extent to which a relationship is in a unique identifier.

Other systems of notation add symbols unnecessarily: Chen’s notation uses different symbols for objects that are implementations of relationships and objects that are tangible entities; Chen also uses separate symbols for each attribute; IDEF1X also distinguishes between "dependent" entities and "independent" ones. IDEF1X also uses different symbols at the different ends of relationships. The UML designates certain kinds of relationships ("part of" and "member of"), by either of two special symbols, depending on the referential integrity constraint in effect.

In each case, the additional symbols merely add to the complexity of a diagram and make it more impenetrable, without communicating anything that is not already contained in the simpler notation and names of Barker’s notation.

James Martin’s version of Information Engineering is the only one other than Barker’s notation that represents sub-types inside super-types, thereby reinforcing the fact that it is a sub-set, and saving diagram space in the process.

Also, other techniques introduce extra complexity by allowing relationship lines to meander all over the diagram. Barker’s notation calls for a specific approach to layout which keeps relationship lines short and straight.

Completeness

Most of the techniques show the same things that Barker’s notation technique does, although some are more complete than others. Each of them doesn’t have something that Barker’s notation has.

Information Engineering does not show attributes; IDEF1X does not show constraints; only Mr. Martin’s version of Information Engineering shows sub-types within super-types. Mr. Chen’s notation, Information Engineering, and UML do not show unique identifiers. Only ORM has all of the same features that the Barker method has, but with its external attributes and sub-types it uses way too much space on the diagram.

In fairness, some of the techniques do things that Barker’s does not. IDEF1X, ORM, and the UML show non-exhaustive sub-types, where the sub-types do not represent all occurrences of the super-type. (Barker’s technique deals with this only indirectly — by defining a sub-type called other . . .). The UML also shows non-exclusive sub-types, where an occurrence of the super-type can be an occurrence of more than one sub-type. Information Engineering and the UML also show non-exclusive constraints between relationships, not available in Barker’s technique.

These are all useful things.

The addition of processing logic to data models in the manner of object-modeling techniques (including behavior in the model) is also a very powerful idea. Clearly provision for describing the behavior of an entity is something that could be added to Barker’s notation. Whether it is more appropriate to extend this notation, in the manner of the UML, or to use separate models, such as entity life histories and state/transition diagrams, remains to be seen.

Language

Barker’s notation requires the analyst to describe relationships succinctly and in clear, grammatically sound, easy to understand English. As mentioned above, where all the other techniques use verbs and verb phrases as relationship names, Barker’s notation uses prepositional phrases. This is more appropriate, since the preposition is the part of speech that describes relationships. Verbs describe not relationships but actions, which makes them more appropriate for function models than data models. To use a verb to describe a relationship is to say that the relationship is defined by actions taken on the two entities. It is better simply to describe the nature of the relationship itself.

Using verbs makes it impossible to construct a clean, natural English sentence that completely describes the relationship. "Each party sells in zero, one or more purchase orders," is not a sentence one would normally use in conversation.

Moreover, finding the right prepositional phase to capture the precise meaning of the relationship is often more difficult than finding a verb that approximately gets the idea across. The requirement to use prepositions then adds a level of discipline to the analyst’s assignment. The analyst must understand the relationship very well to come up with exactly the right name for it.

(The Hitchhiker’s Guide to the Galaxy was reported to have once been sued for saying that "Ravenous Bugblatter Beasts often make a very good meal for visiting tourists," when it should have said that "Ravenous Bugblatter Beasts often make a very good meal of visiting tourists."^[Adams 1982] Using exactly the right word is important.)

Correctly naming relationships often reveals that in fact there is more than one.

This requirement for well built relationship sentences, then, improves the precision of the resulting model. In each modeling technique, Mr. Barker’s naming conventions could be used, but analysts are not encouraged to do so.

For Design – The UML

While Mr. Barker’s notation is preferred as a requirements analysis tool, UML is more complete and detailed, and therefore the most suited to support design – particularly object-oriented design.

The method for annotating optionality and cardinality are much more expressive of different circumstances than any of the other techniques. It can specifically say that an occurrence of an entity is related to 1, 7-9, or 10 occurrences of another entity.

The UML can describe many more constraints between relationships than can other notations. With proper annotation, it can describe both exclusive and inclusive or relationships, or any other that can be named.

For business rules that are not simple relationships between two associations, UML introduces a small flag that can include text describing any business rule.

Attributes can be described in more detail than in other notations.

Overlapping and incomplete configurations of sub-types are allowed.

"Multiple inheritance", where a sub-type may be one of more than one super-types, is permitted, as are multiple type hierarchies. While these may not be desirable in analysis models, they could be useful as solutions to particular design problems.

In an object-oriented environment, the extra symbols address specific object-oriented situations.

Summary

The ideal CASE tool, then, will be one which supports Mr. Barker’s techniques for doing requirements analysis, then has the facilities for converting entity definitions into either 1) table definitions or 2) class definitions that can be used by C++ or a similar language. It would then have the ability to represent these design artifacts in the UML for further refinement.