A Warehouse for "Big Data"

OCHRE as a Warehouse for "Big Data"

To capture a record of the past, OCHRE gathers into coherent structures many kinds of content including archaeological, lexical, textual, historical, and scientific. Cultural heritage data in all of its forms is supported:  descriptions of artifacts and monuments, transcriptions and translations of ancient texts, living dictionaries of long-dead languages. This is supplemented with research data (e.g. field notes, text editions, interpretations, bibliography, high-quality images) recording what has been said about such objects of study over time, and by whom.There are two main approaches to integrating data from disparate sources (in this case, from different research projects). Datasets are sometimes left in their original formats and combined on the fly, at the time a user submits a query. This approach is called "data mediation." The other approach is to combine the data ahead of time within a central repository or "data warehouse."

The Benefits of Centralization

OCHRE uses the second approach. It serves as a central repository that combines data from multiple sources. At present, it does this by means of a single database server, although the OCHRE database could easily be distributed across multiple servers, if necessary. To OCHRE users, the physical location of the data is not relevant. OCHRE will always appear as a single repository that integrates data from participating projects.

The choice to use a "data warehouse" approach instead of a "data mediation" approach reflects the research purpose OCHRE is intended to serve. Integrating cultural heritage data within a comprehensive, well-indexed database structure permits much faster and more sophisticated searching of the data. The process of importing a project's data into OCHRE also helps to ensure internal consistency and correctness.

Moreover, unlike business users whose data changes minute by minute, academic researchers do not need on-the-fly mediation of rapidly changing data sources. Indeed, most cultural heritage projects do not have the technical and financial resources required to keep their data permanently accessible on the Internet and thus available for on-the-fly mediation, so some sort of central repository is desirable. Moreover, live updates from disparate local sources are much less important for researchers than careful integration within a comprehensive structure, within which the data can be searched using powerful and reliable queries.