Utah History Research Center Utah State Archives
 

find info in guides, inventories, registers, lists, articles more [?]

Expand All - Collapse All

Encoded Archival Description (EAD) Project

The Utah State Archives supports the ongoing efforts of the archival community in the creation and implementation of descriptive standards for finding aids. To that end, we began our own project in April 2000 to convert all of our existing record series inventories and agency histories to the format specified by the Library of Congress, in association with the Society of American Archivists' EAD Roundtable. This project was completed in August 2000, with a total of 614 series inventories and 125 agency histories converted to XML. New finding aids are being added to that number as records are processed.

Methodology

About eight members of the Archives staff took the SAA EAD training (using XMetaL) in mid April 2000. Up until that time, no one was familiar with XML or stylesheets. Two staff members had experience with the web and HTML, one of whom also had limited programming experience. At that time the decision was made that the webmaster would create the stylesheets and do all of the conversion of the existing finding aids to EAD/XML; another member of the web staff would create the template of our rendition of EAD to be used in XMetaL, and validate all legacy documents after they were marked up; and the remaining staff would concentrate on creating new finding aids in EAD.

Separating Each Series Inventory

All of our finding aids have been available on the web since 1996, contained in a Folio infobase, which essentially is one very large SGML document that is converted to HTML on the fly when a portion of the infobase is called by a browser. The first step in moving this information to XML was to remove the data from Folio and into separate documents—one for each finding aid, named by series number. The Folio software has a utility that automatically separates the data into discrete files, based upon the structure (similar to a table of contents) that infobase creators embed into it. This worked well for us, as the data were saved in Rich Text Format. All of the new documents did need to be renamed by hand. Some minimal data cleanup also needed to be done in each.

About two-thirds of the series inventories included container lists—most very short, maybe a table of ten rows, but some very long, with the table extending 100 or more pages. For purposes of this project, the container lists were removed from the descriptive text and placed in a separate document temporarily.

The documents with the descriptive data were then copied to a separate folder and renamed to have a .txt extension, though they were still in Rich Text Format. When opened in Notepad, all of the formatting coding was then visible. The coding made the documents look very messy, but it provided a way to run search-and-replace commands that would replace the coding with XML markup. The great part was that the RTF coding was distinctive enough that the search and replace function could distinguish a <p> from a </p>.

Search and Replace

To run the search and replace commands, the software chosen was Allaire's HomeSite. This is an inexpensive HTML text editor that happens to come bundled with Macromedia's Dreamweaver, the software we use to develop our web pages. HomeSite was flexible enough to easily work with text files and XML files, and the search and replace commands could apply to every document contained in a folder, which was a real time saver.

HomeSite was also customizable with regard to tags it recognizes. The "snippet" feature was used to create the various EAD XML tags that could be wrapped around text with the click of the mouse.

Doing the search and replace commands, and adding the extra necessary XML tags by hand to the descriptive text of the 614 series inventories, took three weeks. After those were completed, they were individually validated against the EAD DTD with XMetaL, which took about a day. At the time of the project, the Archives only had one copy of XMetaL, and it was not on the PC (indeed, in a different building across town) of the staff member doing the conversion. That was partially why HomeSite was chosen, because it was already available.

Stylesheets and Parsing to HTML

As each series inventory was completed, an XSL stylesheet was used to transform it to HTML with James Clark's XT XML parser. To create the stylesheet, instructions were followed from Elliotte Rusty Harold's book XML Bible, published by IDG Books in 2000. The stylesheet examples from the SAA's EAD Roundtable were also useful. The stylesheet itself was fairly simplified, but met our needs. We had one in place in about 3-4 days.

The stylesheet was written so that the HTML included all of the formatting and comment fields necessary for our Dreamweaver software to think the HTML was created using a Dreamweaver template. This was done so that when we update the design of our web page, all we need to do is update the template, and all associated documents formed with that template will be changed accordingly.

Container Lists

The container lists, which up to then had been patiently waiting for attention in separate documents, were still a problem. They were formatted in tabbed columns, a quasi-table. The style they had been written in specified that the container numbers (box, folder, etc.) be listed once and implied dittoes fill up the rows and cells until the numbers changed. To some extent, this style was also used in the column that described the container contents. With EAD, we realized that all of the implied dittoes needed to be filled in with real data. That was done as each was converted to XML.

We discovered that the easiest way to convert our container lists to XML was to use WordPerfect. When you highlight tabbed columns and then click on "insert table", the contents of those columns are placed into a WordPerfect table. We were careful to have only one tab between columns and for the columns to all line up together from one row to the next. Any variation to this rule caused problems with the next step.

Then, with the data in a table, we told WordPerfect to select the table (use the Edit drop-down menu for this), then hit the delete key. The software then pops up a menu challenging you about what you actually want to delete. At the very bottom is a feature to convert the contents of the table to a merge document, using the table header as field names. Select that choice. This creates a data file that corresponds with a merge form. We ended up needing about 20 merge forms, each a variation on a theme of <container>, <unittitle>, and <unitdate> tags within <c01> and <did> tags. When you run the merge, the data from the table have the EAD tags perfectly wrapped around them. The resulting data were then cut/pasted inside the <dsc> tags in the XML documents with the series descriptions.

Although there were fewer container lists than series descriptions, this part of the project took longer, about four weeks. In contrast, the agency histories took about four days to complete all 125 of them. The EAD coding for those was rather minimal since EAD does not yet support all of the note fields we currently use.

Perfect Printed Copies

Having all of our inventories and agency histories available to the public (and search engines) in HTML only solved one part of the problem that finding aids create for us. The other problem was the creation of a printed copy that looks clean and professional, with table headers, page numbers, and series numbers printed on each page. This is the copy used in our Research Room by both staff and researchers. Previously, all of the finding aids had always had a master copy in WordPerfect, and another copy used for electronic distribution (without headers, footers, etc.), and several duplicate copies in people's personal directories. The hope with EAD is to make the XML copy the master copy and eliminate any other "master" or duplicate copy which may exist. All updates should only be done in XML using XMetaL, then run through a stylesheet.

XML, then, has to be all things to all people. How could a good Research Room copy be made? After some investigation, a new stylesheet was created using XSL formatting objects. (A formatting object stylesheet, incidently, is a completely different animal than the HTML stylesheet, and took four solid days to figure out how to make it work, and has continued to be tweaked ever since.) The formatting objects file was then converted to a PDF file using Apache's FOP software, which is still in development. One of the advantages of using the PDF file is that the very large series inventories we have (those over 1 megabyte in size) can be updated in XML and made available on the web without necessitating the creation of multiple HTML documents that divide up the contents into reasonable size for download (50 k per page being our state standard). The resulting PDF file is compressed, so downloading is much easier.

Conclusions

The process we chose for converting our finding aids worked well for us. Perhaps a better/faster way exists that we could have used, but this process was within the skills of existing staff. The largest of our finding aids (which continue to grow) are difficult to manage using XML. These may be better housed in a database and made available online through web reporting software. Still, creating an XML document once and then using a stylesheet to transform it into something else is very useful and has made the EAD project worthwhile.

Print PagePrint Page | This page was last updated March 18, 2002.

For research questions, contact the Research Center. For comments about this website, contact the webmaster.