Archiving Data

Archiving data is an issue that every organisation needs to address. Whether your needs are to store information for regulatory purposes near term, or you need access to organisational research materials for the next 20 years.

There are many methods of data archiving, and the decision to use any particular method depends on the type of information, the length of time the organisation needs to be able to access the data, and any specific organisational requirements.

Long term data archives

With long term data archives the priority should be to provide data in a format that is self-describing to users in the foreseeable future. In this case one of the most important issues is the data format. Whether you are archiving Oracle databases, Microsoft Office documents or Lotus SmartSuite documents you should not expect this software to be usable in the future. More generally you should not expect that the systems and their respective data formats that you are using today will be those used in the distant future. Even today, organisations like NASA are struggling with the myriad data formats used in their different research groups as well as the archaic information still being transmitted back to earth from satellites that were launched 20 years ago. In the case of NASA they have gone so far as to create an organisation whose goal is to maintain information indefinitely. This organisation continuously updates data into new systems. While this approach provides for the highest fidelity of their data, it is an enormous effort in both time and money. Additionally, data is still lost and not maintained as they have only limited resources to update and advance their data, which means a great deal of their time is also spent determine which information it is acceptable to lose.

For organisations which cannot afford to expend these resources, the most robust approach to ensure their data is accessible is to use the simplest data format as possible, and preferably a format that is self-descriptive. Self-descriptive data structures reduce dependence on any external systems that can become lost in time. If data structures require software or documentation to define how they can be recreated, this information may not stay with the actual data, or it may simply be destroyed. The simplest cases are employees leaving a company and no current employee has been given the knowledge to access their data. This loss occurs immediately.

With our approach here we are attempting as robust an approach as possible to the issue of data archives. Our goal is to ensure the highest degree of access to data, now and in the indefinite future. To that end we believe a system which provides this robustness is one which provides existing information in multiple formats. An original copy of a particular document would exist, as would a PDF version of the document which would provide a user with a high fidelity copy of the original. The third version of the document would be as HTML based text. We've chosen HTML in this instance because it provides the users with a high degree of flexibility while still allowing them some direct categorization capability with their documents. We've also chosen HTML over other text data formats like XML because for the foreseeable future this allows a user to directly browse documents. Using HTML removes the dependence on the ability to execute other applications and still provides for hierarchal data access. With HTML metadata, external software can still be used to provide indexing and rapid access to information where necessary.

If an organisation were to choose a knowledge-ware application as their data storage method, they are making a commitment to maintaining these systems, software and user expertise of these systems which is a significant expense.

Domino based data storage

For organisations that use Domino as their current document warehouse, the application and databases have a hierarchal structure inherent in their implementation. An additional advantage to working with Domino is that it allows for a centralised system to manage existing documents, attachments, and embedded objects. With Domino, an organisations databases can be extended to provide an automated system to archive documents without having to depend on every documents creator.

Regardless of the documents data format, the methods of archiving this data can be the same. The staff at Kafka Adaptive have worked with IBM helping them deliver data conversion applications that they use internally for SmartSuite documents. In this particular system, users were able to submit documents to a server for conversion and specify the data format. These servers would convert documents to the requested format and submit them back to the user.

A similar system can be developed to automate this process, converting Domino database documents and their attachments when necessary, and creating archives of the material for user access. This archival process can be driven by database administrators, allowing them to specify how often this process is performed as well as controlling which documents may be archived. In addition, because of the extensive search capabilities of Domino, the archived data that is generated provides a greater detail of indexing than an organisation would find with other products. This search capability also provides for greater ability to reuse archived documents with minimal processing. Documents that have been archived can be brought back into Domino databases, and the systems search and indexing capabilities would be able to provide a great degree of usability than you might find in other systems


When Data Architectures Become the Solution

Clinical IP Commercialization Program Brings
IT Ideas to Clinics