More on Data Organization

Atomization, Organization and Recombination

Information is broken down into small units within a database that implements the OCHRE ontology. The many different entities and relationships that scholars distinguish in the course of their research—and the divergent observations and interpretations of the same entities by different scholars—are all represented as separately addressable units of information with unique "keys." Indeed, scholars themselves are represented by distinct units of information in order to attribute observations and interpretations to particular people.

For example, an archaeological artifact is represented as a spatially situated unit of observation in a hierarchy with other such units, and its properties are represented by means of links to taxonomic units (variables and values) which are themselves organized in a taxonomic hierarchy that constrains the possible pairs of variables and values allowed in particular contexts. Likewise, a text that is an object of study in its own right is represented as a hierarchy of epigraphic units broken down to the level of individual graphic signs, and the text's epigraphic hierarchy is cross-linked with a separate hierarchy that represents the text's discourse units, down to the level of individual words and perhaps even morphemes.

This high degree of atomization prevents the inefficient and error-prone duplication of the same information at different places in the database. Moreover, it exposes in an explicit manner all the relationships and dependencies among separate units of information. As a result, it permits great flexibility in the use of the information. Atomized units of information can be dynamically recombined and presented in many different ways, depending on the needs of a particular user at a particular time.

Atomization and dynamic recombination of information have long been the hallmarks of well-designed relational database systems. The process of structuring relational table schemas and the linkages between them in a way that maximizes the benefits of atomization is called normalization. In the case of XML databases, which are well suited to implementing the OCHRE ontology, a similar process yields an atomized design that achieves the same purpose. Thus, the XML document schemas discussed below in Chapter 20 specify the structure of a normalized database—using this term, which is derived from relational database theory, in a nontechnical sense.

However, a high degree of atomization requires specialized software that can retrieve and recombine information from the data­base to meet particular needs. An OCHRE database is atomized to the point where each descriptive property that is used in the database is itself an independent, addressable unit of information that exists in only one place and is linked to all the entities that it describes. Thus, for example, the character string "diameter" that names a descriptive property exists in only one place in the database. Any change to this character string will be instantly available in any view of the data in which this property is used.

This has many advantages with respect to maintaining the accuracy, integrity, and flexibility of the database. However, a specialized database client application, such as the Java user interface described in this book, is needed to follow the links in order to reconstitute from the atomized units a readable version of the information in response to an end-user's actions.

In some cases, people may wish to search and display OCHRE data without the aid of a specialized database client application, relying only on a Web browser and a string-matching search engine. For this reason, it is possible to export from an OCHRE database a "denormalized" archive consisting of plain-text XML files, in which the information is much less atomized than it is in the database (see Chapter 18). To generate such an archive from the database, the software will follow the links among atomized units in order to reconstitute the information in a human-readable form that is easy to display but is characterized by a great deal of redundancy and a lack of explicit, machine-readable internal structure.