Advanced Data Model Patterns

Advanced Data Model Patterns

David C. Hay

The book Data Model Patterns: Conventions of Thought

[1] describes a set of standard data models that can be applied to standard business situations. These patterns, it turns out, occur on several levels. At the basic level are models of the things seen in business. The patterns in the book are a bit more abstract than conventionally seen, but they do describe things that are easily recognizable to anyone: people and organizations, products, contracts, and so forth.

There is a more abstract level of modeling, however, which is necessary when the things being modeled don't fall into these tidy categories. This level, also described in the book, is the subject of this paper. (For help in reading these models, press here.)

The Basic Model

Before getting into the more exotic models, it is useful to be sure we understand the basic patterns that will apply to nearly all organizations. Each real organization will have variations on this model, but here you will find the elements that will be present in nearly every one. Figure 1, for example, shows that the entity PARTY encompasses PERSON and ORGANIZATION. That is, a PERSON and an ORGANIZATION are each things of significance, and if you want to refer to either, you can refer to a PARTY.

PARTIES may be related to each other, as shown by the entity PARTY RELATIONSHIP. This is simply the fact that one PARTY has a specified relationship with another, as in a reporting structure, employment, marriage, membership in a club, etc. That is, each PARTY RELATIONSHP must be from one PARTY and to another PARTY.

A PARTY may have more than one address. Each address is shown in this model as a SITE, where each PARTY may be located via one or more PARTY PLACEMENTS in a SITE . Each SITE must be in one or more GEOGRAPHIC AREAS, such as a city or region.

Figure 1: People and Organizations

Figure 2 shows the "stuff" a company deals with. Here it is called PRODUCT TYPE and PRODUCT INSTANCE . It could be called "asset type" and "asset", "item type" and "item occurrence", or something similar. Note the distinction between PRODUCT INSTANCE, a physical example of the product, and PRODUCT TYPE, which is the definition of it, such as you would see in a catalogue. Each PRODUCT INSTANCE must be an example of one and only one PRODUCT TYPE, while each PRODUCT TYPE may be embodied in one or more PRODUCT INSTANCES.

A PRODUCT STRUCTURE ELEMENT is the fact that one PRODUCT TYPE may have another PRODUCT TYPE as a component. Each PRODUCT STRUCTURE ELEMENT, then must be the use of one PRODUCT TYPE in another PRODUCT TYPE. Thus an assembly may have three sub-assemblies as components, and this would be represented by three PRODUCT STRUCTURE ELEMENT occurrences where the assembly is the assembly in and each sub-assembly is the component in each PRODUCT STRUCTURE ELEMENT, respectively.

Note that a PRODUCT INSTANCE may be either a DISCRETE ITEM which is kept track of individually, or an INVENTORY which is a collection of items. In either case, each product instance must be at a SITE.

Figure 2: Product Types

Figure 3 shows AGREEMENT, where an AGREEMENT is any formal relationship between two PARTIES. Typically, this is a purchase order or a sales order, but it may encompass other kinds of agreements as well. Invariably, our ORGANIZATION is one of the PARTIES – either the buyer in the AGREEMENT if it is a purchase order, or the seller in the AGREEMENT if it is a sales order.

Each AGREEMENT must be composed of one or more LINE ITEMS, where each line item is for a PRODUCT TYPE.

Figure 3: Agreements

ACTIVITIES are the things the organization does to carry out its business. This is shown in Figure 4. As with PRODUCT TYPES and PRODUCT INSTANCES, there is a distinction drawn between ACTIVITY TYPES (the definition of what is to be done) and ACTIVITIES(the actual doing of it). Attributes of an ACTIVITY TYPE include its description and a standard length of time it is expected to require, while attributes of ACTIVITY include the actual date it occurred and the actual time it took.

Actual ACTIVITIES consume people's time (recorded in TIMESHEET ENTRY), and other resources (recorded in RESOURCE USAGES). Each TIMESHEET ENTRY must be by a PERSON, and charged to an ACTIVITY. Each RESOURCE USAGE must be of either a PRODUCT INSTANCE or a PRODUCT TYPE, and must be charged to an ACTIVITY.

ACTIVITIES may be grouped into WORK ORDERS for various purposes, but a common one is to produce a PRODUCT TYPE . The definition of the standard steps required to produce a PRODUCT TYPE is a set of ROUTING STEPS, where each ROUTING STEP must be the use of an ACTIVITY TYPE to make the PRODUCT TYPE.

Figure 4: Activities

Parameters

The above model is a good start, but it is not adequate to describe certain common situations. For example, there is a problem with PRODUCT TYPE and PRODUCT INSTANCE. For each of these to be an entity suggests that the attributes for all occurrences of each are the same. This simply is not true.

The attributes of a compressor are quite different from the attributes of a computer or a barrel of crude oil. We would like to have a single concept for "Product", but that concept has many different flavors.

We could define a sub-type for each PRODUCT TYPE, but new product types are being invented all the time, and the data management task would be impossible.

To address this, we introduce the entity PARAMETER, as shown in Figure 5. A PARAMETER is a characteristic that is used to define a PRODUCT TYPE.

A PARAMETER ASSIGNMENT is the fact that a particular PARAMETER is used to define a particular PRODUCT TYPE. To wit: each PARAMETER ASSIGNMENT must be from a PARAMETER to a PRODUCT TYPE. For example, the PARAMETER "capacity" might be used to describe a boiler, while the PARAMETER "interest rate" might be used to define a savings account.

(Yes, one of the advantages of this approach is that it works as well for banks as it does for nuclear power plants.)

Note that the PARAMETER may be expressed in a UNIT OF MEASURE. That boiler "capacity" for example, might be in "cubic feet". The UNIT OF MEASURE that is the term for a PARAMETER ASSIGNMENT can override the default UNIT OF MEASURE of the PARAMETER by itself. When "capacity" is applied to a disk drive, for example, the UNIT OF MEASURE would be "megabytes".

Figure 5: Parameter Assignments

Note that in Figure 6 three kinds of parameters are shown: A DISCRETE LIST is a PARAMETER that can take only one of a specified set of PARAMETER ALLOWABLE VALUES. For example a "pharmacological category" for a pharmaceutical would have a just such a list of legal values. A DERIVED PARAMETER is calculated from one or more other PARAMETERS and/or constants. This is by means of one or more PARAMETER DERIVATIONS, where each PARAMETER DERIVATION represents a formula of some kind. The formula, in turn, must be composed of one or more PARAMETER DERIVATION ELEMENTS, where each PARAMETER DERIVATION ELEMENT may be the use of another PARAMETER or the use of a constant.

Other PARAMETERS simply describe the PRODUCT TYPE. If numeric, these could be constrained by a "high value" and a "low value". Within these constraints, a PARAMETER ASSIGNMENT could have its own "high value" and "low value".

Figure 6: Parameters

A set of PARAMETER ASSIGNMENTS defines the nature of a PRODUCT TYPE. Any PRODUCT INSTANCE that is an example of the PRODUCT TYPE is then evaluated with values for the PARAMETERS assigned to its associated PRODUCT TYPE.

Figure 7 shows this. Here a PARAMETER VALUE is the fact that a particular PRODUCT INSTANCE takes a specified "value" of a PARAMETER. Note that the arc here is less about the fact that some are of a PARAMETER and some are of a PARAMETER ASSIGNMENT, than it is about the fact that you can model it either way. If you specify that the value is of a PARAMETER ASSIGNMENT you are keeping PARAMETERS from being specified that were not previously assigned to PRODUCT TYPES. This is a partial business rule, although it still does not require (as a business rule should) that the PRODUCT TYPE and PARAMETER that the PARAMETER VALUE is for represent a legal combination as expressed by PARAMETER ASSIGNMENTS.

If the PRODUCT TYPE "Model 770 ThinkPad®", for example, had assigned to it the PARAMETER "processor speed", the corresponding PARAMETER VALUE for the particular one I am looking at could be "233" (unit of measure: mhz).

Figure 7: Parameter Values

Your author discovered this structure when doing work for a bank. Sometime thereafter he was working for a lumber products company that needed a model for its laboratory. Fortunately he had been doing the bank work, so he was fully prepared, coming with the following variation:

The laboratory does tests on product samples. In this case (unlike others I came across later), the company knows the product type it is trying to make. The tests are simply to determine whether the sample is or is not that product. For this reason, it is possible to ascertain what the expected characteristics are to be.

In Figure 8 it can be seen that each PRODUCT TYPE may be evaluated in terms of one or more EXPECTED OBSERVATIONS each of which is of a particular VARIABLE. That is, the PRODUCT TYPE is considered to be within specifications if the value of a variable is between a "high value" and "low value" specified in an EXPECTED OBSERVATION of that VARIABLE.

The laboratory process begins with a SAMPLE being taken from a PRODUCT INSTANCE (which is an example of the PRODUCT TYPE in question). This SAMPLE is then subject to one or more LABORATORY TESTS. Each LABORATORY TEST, in turn, is the source of one or more OBSERVATIONS – each on a VARIABLE.

If you rename VARIABLE to PARAMETER, EXPECTED OBSERVATION to PARAMETER ASSIGNMENT, and OBSERVATION to PARAMETER VALUE, and if you then collapse SAMPLE and LABORATORY TEST, you have the model shown above in Figure 7.

Figure 8: The Laboratory

Clinical Research

This parameterization idea got stretched even further when applied to the collection of data from clinical pharmaceutical trials.

Pharmaceutical research is an example of a particularly messy modeling problem: Clinical data are captured on "case report forms" (CRFs), which, depending on the study – indeed, depending on the part of the study – have a variable number of sections, where each section could have one or several numbers, pieces of text, or even drawings. There is no fundamental, underlying structure here. The only way to address the problem is to go up one level of abstraction.

Figure 9 shows how a clinical STUDY is defined to be composed of one or more VISIT SPECIFICATIONS which will be the basis for actual VISITS by PEOPLE. Each VISIT SPECIFICATION describes the information to be collected in the corresponding actual VISITS . This information is organized into STANDARD BLOCKS, such as "personal information", "hematological information", "cardio-vascular information", and so forth. Each STANDARD BLOCK is defined in terms of the BLOCK VARIABLES it is composed of, where a BLOCK VARIABLE is the use of a VARIABLE as part of a STANDARD BLOCK.

Each STANDARD BLOCK may be tailored (to some extent) to each study. Variations are embodied in a VISIT BLOCK, which is part of a VISIT SPECIFICATION. Each VISIT BLOCK then may have its own definitions of which VISIT BLOCK VARIABLES are part of it. (A business rule defined by the research company determines the extent to which a VISIT BLOCK must conform to the specifications of its corresponding STANDARD BLOCK.)

Once the CRFs have been defined as to what VISIT BLOCKS and VISIT BLOCK VARIABLES each VISIT SPECIFICATION contains, data may be collected. Each element on the CRF is an OBSERVATION,which is for an actual VISIT at a specific date by a PERSON. Each OBSERVATION may be either text or numeric.

Figure 9: Clinical Research

It may be argued that, while this is the most orderly way to capture all these data, it makes them a little difficult to get at. To correlate measurements of two variables it is necessary to construct a query that asks for all values of a particular variable and the circumstances of their collection, in conjunction with all values of another variable, where the circumstances of their collection are matched with the circumstances of the first. This is hard.

To address this, the pharmaceutical companies that have taken this approach have devised a table structure derived from this one. (This was the original "data mart" before that word became fashionable.) The idea is that what the statisticians want to see is all the data of a certain kind together.

It turns out that the "block" structure described above gives us the opportunity to "de-abstract" the data into something a little more manageable. It is possible to write a single utility program that takes the observation data and reorganizes it into a single table for each VISIT BLOCK, with the VARIABLES showing up as columns in this table.

This appears in Figure 10. Each table represents a VISIT BLOCK, and the columns allow statistical analysis of correlations between similar variables. Even correlations between variables in different tables is easier that it was in the original observations table.

Figure 10: "De-Abstracted" Clinical Data

Mapping Legacy Systems

Data modeling is not done in a vacuum. It's often done in conjunction with a major project. These days, that project is as likely as not to build a "data warehouse" – a repository that is supposed to hold all of a company's data and make them available to management for inspection and analysis.

The problem with building a data warehouse is that, while a data model is valuable in defining it's architecture, it doesn't help much in dealing with all those old "legacy" systems that are going to be the source of the data. The designers of those systems often were not very cooperative in clearly identifying exactly what each datum means and where it fits into the larger scheme of things.

The data model does help, in that it provides a road map of what kind of data have to be in there somewhere. What is needed next, though, is some sort of mapping from the columns and tables (fields and files?) of the old systems to the attributes and entities of the model.

In one sense, this is not a logical data modeling problem. After all, the legacy database designs are physical structures, not logical ones. The assignment, however, is to make these logical structures useful, and it is our job to do so.

So, it is necessary to look at the model of our "metadata repository" that is keeping the "model of our models". The legacy system consists (for the sake of argument – we will not get into more complex legacy systems) of TABLES, each of which is composed of one or more COLUMNS. Our model, on the other hand, is made up of