Data Modeling Problems and Constraints

Contact Martin Modell Table of Contents

As the use of data modeling moved from an implementation tool of the physical file designers to a design tool of the systems analysts, systems designers, and more importantly the user, many of the problems inherent in the process began to become apparent.

The meaning of the record data model as a representation of the physical file design was apparent. It represented a schematic of record interconnections that could readily be translated into code which instructed the DBMS how to construct the files. Data record content was determined by the analyst and designer and only they needed to understand the logic behind the placement of data in specific records. The model was a tool of physical file design and a method of communicating design decisions between the designer and the implementation team.

The use of the analysis data model during the design phase as a communication mechanism between designer and user was less effective. Its meaning was also less apparent than that of the record model. The analysis model attempted to address the notion that data from the user perspective were descriptors (attributes) of things (entities), and that by identifying those things, their attributes and their relationships one would be identifying the subjects of data required by the corporation. The activities to produce these new models were positioned at the beginning of the development life cycle before the beginning of the data element identification process and well before the steps of assembling those elements into records.

Until the introduction of these models any data analysis activities that were performed attempted to develop the file and record designs by going from the detail (elements) to the general (records and files). The new models took a general to specific approach and added a new component to the model the concept of data subjects. This allowed the designer to proceed with the design logically, beginning with identification of subjects, then developing conceptual data groupings, populating those groupings with data elements and then taking the resultant model and translating it into physical records and files. The specific record layouts and file structures would be dependent upon the specific DBMS or data management technique chosen for implementation.

Areas of difficulty

The differences between the record models and the ER models stemmed from their content, from their perspective (what they were intended to represent) from their method of construction and from their use. Differences also arose from the terminology used by each. The following is a summary of these areas of difference,

Changes in perspective
The first area of difference arose from the change in perspective of the modelers. Prior to the analysis model, data file design tended to follow the pattern set by the forms that preceded them. That is the files were designed to contain data obtained from earlier forms. These forms tended to be utilitarian and rather restrictive or limited in the amount, kind and variety of data they collected. They were also highly standardized. A given form was designed to collect a certain kind of data, for a certain purpose. Any data not anticipated in that form was not intended to be included. Where unanticipated data was added to existing forms, or where form fields designated for one kind of data was used to another kind, confusion occurred, and mis-communication resulted.

The data collected by the forms and later by the automated files that replaced them did not allow for variation. Data was strictly formatted, strictly coded, and all data indicated on the form had to be collected. Because forms represented the data to be collected for a specific purpose it was codified for that purpose and rarely for any other. Customer information on an account opening form, was rarely of much use to the marketing department, since it was usually designed by the sales department, sometimes in consultation with the credit department, to determine the credit worthiness of the client and to determine what kind of account was to be opened for that customer.
Changes in content assumptions
The record models assumed that all data represented by the model had to be collected, although some variation was allowed. Another major assumption of these models was that each occurrence or entry in the file, each set of records was relatively uniform. These types of models allowed for few variations in the kinds of subjects their data represented. They also assumed that for the most part all occurrences would be processed in a similar manner.

There was one additional assumption which was key to the record model. Each symbol on the model represented a collection of data elements (segment, row or record) and each line (where they were used) represented the physical address or pointers which connected the records to each other. The only additional information contained on these models (in most cases) represented whether a given record occurred once or more than once within a "logical" set of records.
Changes in terminology
Each of the many different types of data model, both record and conceptual, introduced terminology and definitions intended to explain the concepts inherent in these models, and in some respects to differentiate the model from all other models. Each model had proponents who explained, clarified and extended the model beyond its original scope. Many of these models also engendered large groups of practitioners who met regularly to exchange ideas and experience. From these meetings originated more new terminology and concepts. Because the models were used primarily for DBMS file construction, and because most companies tended to standardize on a single DBMS, practitioners tended to be well versed in the terms and concepts of their particular model but relatively ignorant of the terms and concepts of any other model.

As firms began to use more than one DBMS, the terminology became confusing, because the same words meant different things for each. Records, segments, rows or tuples were all terms used to represent how the DBMS "saw" the data, but in reality they all represented a set of data items from the designer's perspective, although not always from the programmer or user's perspective.

This difference between what the designer built and what was presented to the programmer or user led to the rise of a distinction between the "physical" records and the "logical" records, between "physical" database and "logical" database.

The conceptual, analytical or logical models models shifted the focus of the designers and analysts from the question "how should the data be grouped and stored?" to the question "what information does the business require and what data must be collected and maintained to provide that information?"

During design, the analyst and user attempt to solidify concepts and business rules into data requirements and to represent those data requirements in some form which could ultimately become the physical file design.

The analytic model became a tool for stimulating ideas, identifying requirements and organizing those requirements in a coherent fashion. This use of the model however introduced many new concepts and perspectives which were only vaguely understood, and directed the analysis into areas which were equally vague and ill-understood. Further confusion was introduced when designers attempted to translate concepts from the record models into the analytic models.

The analytic models, because they portrayed concepts, became vague and muddled. This was because in most cases the concepts themselves were vague and muddled. The intent of the model was no longer clear nor was it clear in many cases what the model was trying to communicate.

The primary form of the mew or analytic models was the entity-relationship model. As originally proposed it was to be a link between the real world things that the user worked with and the record models that the automated system implementation teams worked with. It was intended to be DBMS independent, and focused on things about which data was collected rather than on how data about things was to be stored. These new models were part of a new approach to design, an approach from the user perspective. These models depicted business entities and their relationship with each other.

The concept was stark in its simplicity. All business data is collected and maintained about people, places, things or concepts of interest to the firm. Thus if we model those things we have a strong foundation for determining what data is needed about these things. This new approach however developed still another set of terms to describe its components and to explain its concepts. It was strongly grounded in the relational model, and incorporated many of the concepts of normalization associated with the relational model.

Although data is collected about people, places, things and concepts files are created about groups or collections of people, places, things and concepts. Since it is cumbersome to repeat the phrase people, places, things or concepts each time when talking about these groups in general, data modelers use the term generalized entity. Thus an entity is any group of persons, places, things or concepts about which the firm must collect and maintain records (records in this sense are business records).

Many of the analysts and designers familiar with the record models saw the new models as a new way to portray the record relationships, rather that as a different kind of model. The developers of the CASE tools, also used the ER model in a similar manner. Unlike the record model where the symbols or model components represented a record, which was a representation of a group of data elements, the ER model symbols, or components represented entities. The term entity as defined in the dictionary means the fact of being, existence. There is no representation, implied or otherwise as to the subject of that existence. In other words, without further detailed explanation, the term entity means absolutely nothing.

Even if the ER models were limited to people, places and things they would probably have not been easy to build and understand. The greatest difficulty however is engendered by the inclusion of concepts in the definition. First, by including concepts (and sometimes even events) within the definition of the term entity, we have in effect made an entity equate to anything we want it to be, since the span of the definitions includes everything within our perception. Thus, not only can records become entities, but also roles, and even entities themselves can become part of what is being modeled by the ER models. The latter can be seen most clearly in data dictionary (also called repositories, and encyclopedias) models when data processing personnel attempt to create entry types to document entities from their analytic models. The inclusion of concepts and events in the definition, caused many modelers to find themselves in the curious position of having to discuss the "entity entity."
Intermediate versus final products
Another difference between record model and analytic model also caused confusion. The record model described a record structure. It was implicit in the model that each structure or schematic of records represented the data about a given subject and that each iteration of the structure, or each iteration of records represented data about a given instance. The record model however was the final product, it did not decompose into any other type of model, nor did the records described decompose into different kinds of records.

The entity-relationship model is a decomposition model. That is it is not intended to be used as a final form, but as an intermediate product. It was a transition product in much the same manner that the requirements documentation of the procedural analysis was not a final product but a reference point for the analysis phase and and the starting point for the detail design phase.
Semantic versus structural models
The entity-relationship model however was also substantially different from all other models then in use in the analysis and development cycle. That difference arose from the fact that all other models were either schematic (such as the data record models) or strict decomposition models such as the functional and process decomposition models of the analysis and design processes. It was also different from the flow models as typified by the data flow diagrams used by structured analysis advocates. The difference was that the ER model was what was called a semantic model. Semantics is defined as pertaining to meaning, as in language. The semantic model attempted to translate the meaning of the words used in requirements statements into pictorial representations for analysis. Specifically the semantic modelers attempted to translate the business rules which governed how, when and why the objects of business interest (entities) related to each other.

Previous data models looked at data record relationships, and portrayed the paths that had to be established between one set of records and another. These paths were analogous to the indices and cross references established between sets of paper form files. They attempted to automate the means of associating data in one file with data in another file.

The new models were an attempt to step back from the record models and look not at the data about things, but at the things about which the data was collected, and the relationships that had to exist between them based upon the statements of business requirements.
Conceptual ambiguity
This step back forced both analyst and user to look at and clearly define those things, those entities. The development of the ER models was begun at a higher level and from a different perspective than most previous models. The higher level took it outside a specific business area, outside most operational areas and into the realm of the managerial, and sometimes even into the strategic levels of the firm. The focus shifted from ascertaining the data content of forms to trying to understand, or come to some common agreement about some of the fundamental concepts of the firm - those of product, customer, location, and sometimes even function and process. The new models highlighted many areas of ambiguity, misunderstanding and sometimes disagreement, problems which were masked by lower level models because they did not have to deal with it.

These models also highlighted hidden redundancy of concept and data by looking at the business rules which were the foundation of the business processing. In many cases these business rules were never clearly defined, or were defined in several different ways. The models also highlighted another and perhaps more disturbing problem, that of conceptual inaccuracy. Many employees, in most cases highly experienced employees did not have an accurate understanding of those business rules, nor of the legislative and regulatory rules under which the firm must operate. In still other cases, the models highlighted objects of business interest about which data was not being collected, or for which data was being collected in an incorrect manner. These omissions could if not detected severely handicap data collection activities and impede management's visibility into the firm's operations.
The meaning of relationships
Another area of problem could be viewed as technical rather than analytical. Record models, because they were built to support the construction of DBMS files conformed to DBMS rules. One of those rules was that the DBMS could not support what were known as "many-to-many" relationships. A given record could be related to many other records, or to one other record, or even to no other records. However, the DBMS could not support the concept of many records being related to many other records. In building record models, great care was used to avoid or remove these many-to-many relationships. The ER model had no such restrictions, because the real world had no such restrictions.

Another relationship related problem had to do with the number of relationships portrayed by the model. In the record model, the relationships that were incorporated into the model were record to record relationships and represented for the most part, pointers or record address relationships. For the most part, although a given record could be related to multiple other records (particularly in the network models) there tended to be only a single, or at most two relationships between a given pair of records. In the real world, and in the ER model which portrayed the real world, multiple, independent relationships can exist, and often do exist, between any two entities.

Again, in the record model a relationship means only one thing, two records are related. Because these relationships represent pointers or record addresses, there is no data associated with them. Only in the case where the designer must resolve (eliminate) many-to-many relationships is a data record created to take the place of the relationship, and even in those cases these created records are related to their parent records by pointers. In the ER model, relationships represent business rules, and even one-to-one relationships may have data associated with them.

The concept of the relationship name is familiar to the network modeler (called set names in that model) but is alien to the hierarchic and relational modeler. This concept is integral to the ER model.
The introduction of roles
As mentioned previously, the ER model is not strictly speaking a decomposition model. However, the level to level model creation is based upon the principles of decomposition. The process of decomposing entities when going from level to level usually involves the separation of role entities from the base entity. This separation causes another "entity" to be created. This new entity may be real or not, depending on how important the role is to the firm. If the new role entity was completely independent from the original entity there would probably be no difficulty, however in most cases there is a strong overlap, complicated by the fact that the people, places and things represented by the entities play multiple simultaneous roles, and the firm must deal with them somehow. This is sometimes portrayed in ER models as the "is-a" relationship, as in the employee also is a customer. This treatment of the problem probably causes as much confusion as it resolves.

Part of this confusion stems from the fact that record models decompose according to different rules. When a record is decomposed two or more new records are produced and the old record disappears. There is no need to maintain the identity with the parent record. In the ER model there is a strong need to maintain this identity, and in fact the parent entity may and often does remain in the model along with its offspring.
The empty box
Designers and analysts familiar with the record models know that the box on the schematic represented some number of data elements, even if those data elements had not yet been assigned. The box on the ER model represents an idea or a concept, even when the label on the box is something familiar such as customer.

This notion of boxes in data models without data is disturbing and many modelers, and most CASE tools make the assumption that the box does in fact represent data elements. They also treat the relationships between those boxes more like the pointer indicators in the record model that as representations of a business rule.
The meaning of the term entity
The definition of an entity itself has caused several problems for modelers attempting to use the ER models. Among them are that in unqualified form, the term "entity":
- can be used as both a group and singular term, sometimes within the same context. Models must be consistent. The symbols represent concepts and the concepts must be clearly understood. If a box is labeled people then it represents many people and must be treated as if it represents the group. If the model is labeled person, than it represents a single individual. Since the ER models are semantic models,each pair of entities and each relationship represent the basic components of a simple sentence, subject and object(the entities) and verb (the relationship).Just as we would not intentionally switch from singular to plural when speaking about the same thing we should also not switch from singular to plural when discussing the entities or using the entities in the model.
- can be used to refer to the whole group and a portion of the same group, sometimes within the same context. This is particularly important when we deal with roles and decomposition of the entities. An entity called employee can also be split into temporary and regular. Since these are mutually exclusive groups (an employee cannot be both at the same time) one group has become two by adding a qualifier. However, when employee is split into manager and employee it is not so simple, because managers are also employees, and managers are managed by other managers. Thus portraying manager and employee in the same model is confusing at best.
- can be used as a general term when it is impractical or cumbersome to use the phrase "person, place, thing, or concept ...". The term entity is the most general term we can use when referring to something. It includes things,and even objects. In fact, it includes everything. When we refer to entities in our models we are in effect saying that the term can be replaced by any other term and still mean the same thing. In most cases, thus is not true, since we are usually referring to something more specific.
- can be used to refer to many different levels of aggregation and to conceptually different components at the various design levels (conceptual, logical, physical). Employee as an entity in one model refers to all employees,whereas in another model we may be referring to a specific employee. This is of particular importance when we begin to discuss relationships. The reference 'some,' or 'all,' as part of the relationship statement must be included just as the terms 'sometimes' and 'always' must be included. Although it is rarely used, the term 'never' is is also a possibility in a relationship statement.When multiple levels of model are produced, it is the designer's option to repeat the entities from one level to the next. In most cases this is not done, but the same names are carried forward when the entity does not change from level to level. An entity at the conceptual level -representing the perspective of the firm - may not be the same entity at the logical level - representing the perspective of a specific business area even though the same name may be used. At the physical or record level the entity becomes a record which is dissimilar from the entities at each of the other levels, and yet they are all referred to as entities.
- has been borrowed and used by physical designers and by CASE Tool vendors to refer to the records in the data base structure models. Even though entities represent people, places and things, and records are collections of data the CASE vendors borrowed both the term and the form when they provided data modeling capabilities with their tools.Although it is possible to produce a standard ER model, when one looks at the definition of the entity it is replete with data elements as part of its description.
Within the ER model the term attribute is used when discussing a property of the entity. The entity, strictly speaking is nothing more than the label of something, devoid of properties or attributes. An attribute is a property, aspect, descriptor, identifier or characteristic of either an entity or a relationship. The entity is in fact the sum of its attributes. Entities are described in terms of their attributes. The term attribute has also been used as the name for data elements and data groups.

The difficulties with this definition are that these terms are not synonymous and thus the term "attribute":
- is used as both a group and singular form, sometimes within the same context
- is used to refer to a group of data elements, a portion of a group of elements, or a single element, sometimes within the same context.
- is used to refer to many different levels of aggregation and to conceptually different components at the various design levels (conceptual, logical, physical).
- has been borrowed and used by physical designers and by CASE tool vendors to refer to the data elements in the data base structure models.
- is used to refer to the properties (descriptors) and the characteristics of entities and of data elements.
In some data models attributes have been elevated to the status of entities. Many models, and most CASE tools have only two constructs, the entity, and the attribute. An attribute is a single data element, and an entity is anything that has more than one attribute.

Because of these differences in concept, different terms should be used. In later chapters, we will introduce some new terms (at least new to data modeling) and will suggest how they should be used, and why.

Contact Martin Modell Table of Contents

Data Analysis, Data Modeling and Classification
Written by Martin E. Modell
Copyright © 2007 Martin E. Modell
All rights reserved. Printed in the United States of America. Except as permitted under United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a data base or retrieval system, without the prior written permission of the author.