Encoded Archival
Description (EAD) Project
The Utah State Archives supports
the ongoing efforts of the archival community in the creation and implementation
of descriptive standards for finding aids. To that end, we began our own
project in April 2000 to convert all of our existing record series inventories
and agency histories to the format specified by the Library
of Congress, in association with the Society of American Archivists'
EAD Roundtable.
This project was completed in August 2000, with a total of 614 series
inventories and 125 agency histories
converted to XML. New finding aids are being added to that number as records
are processed.
Methodology
About eight members of the
Archives staff took the SAA EAD training (using XMetaL) in mid April 2000.
Up until that time, no one was familiar with XML or stylesheets. Two staff
members had experience with the web and HTML, one of whom also had limited
programming experience. At that time the decision was made that the webmaster
would create the stylesheets and do all of the conversion of the existing
finding aids to EAD/XML; another member of the web staff would create
the template of our rendition of EAD to be used in XMetaL, and validate
all legacy documents after they were marked up; and the remaining staff
would concentrate on creating new finding aids in EAD.
Separating
Each Series Inventory
All of our finding aids have
been available on the web since 1996, contained in a Folio infobase, which
essentially is one very large SGML document that is converted to HTML
on the fly when a portion of the infobase is called by a browser. The
first step in moving this information to XML was to remove the data from
Folio and into separate documentsone for each finding aid, named
by series number. The Folio software has a utility that automatically
separates the data into discrete files, based upon the structure (similar
to a table of contents) that infobase creators embed into it. This worked
well for us, as the data were saved in Rich Text Format. All of the new
documents did need to be renamed by hand. Some minimal data cleanup also
needed to be done in each.
About two-thirds of the series
inventories included container listsmost very short, maybe a table
of ten rows, but some very long, with the table extending 100 or more
pages. For purposes of this project, the container lists were removed
from the descriptive text and placed in a separate document temporarily.
The documents with the descriptive
data were then copied to a separate folder and renamed to have a .txt
extension, though they were still in Rich Text Format. When opened in
Notepad, all of the formatting coding was then visible. The coding made
the documents look very messy, but it provided a way to run search-and-replace
commands that would replace the coding with XML markup. The great part
was that the RTF coding was distinctive enough that the search and replace
function could distinguish a <p> from a </p>.
Search and
Replace
To run the search and replace
commands, the software chosen was Allaire's HomeSite.
This is an inexpensive HTML text editor that happens to come bundled
with Macromedia's Dreamweaver,
the software we use to develop our web pages. HomeSite was flexible enough
to easily work with text files and XML files, and the search and
replace commands could apply to every document contained in a folder,
which was a real time saver.
HomeSite was also customizable
with regard to tags it recognizes. The "snippet" feature was
used to create the various EAD XML tags that could be wrapped around text
with the click of the mouse.
Doing the search and replace
commands, and adding the extra necessary XML tags by hand to the descriptive
text of the 614 series inventories, took three weeks. After those were
completed, they were individually validated against the EAD DTD with XMetaL,
which took about a day. At the time of the project, the Archives only
had one copy of XMetaL, and it was not on the PC (indeed, in a different
building across town) of the staff member doing the conversion. That was
partially why HomeSite was chosen, because it was already available.
Stylesheets
and Parsing to HTML
As each series inventory was
completed, an XSL stylesheet was used
to transform it to HTML with James Clark's XT
XML parser. To create the stylesheet, instructions were followed from
Elliotte Rusty Harold's book XML Bible, published by IDG
Books in 2000. The stylesheet examples from the SAA's EAD Roundtable were
also useful. The stylesheet itself was fairly simplified, but met our
needs. We had one in place in about 3-4 days.
The stylesheet was written
so that the HTML included all of the formatting and comment fields necessary
for our Dreamweaver software to think the HTML was created using a Dreamweaver
template. This was done so that when we update the design of our web page,
all we need to do is update the template, and all associated documents
formed with that template will be changed accordingly.
Container
Lists
The container lists, which
up to then had been patiently waiting for attention in separate documents,
were still a problem. They were formatted in tabbed columns, a quasi-table.
The style they had been written in specified that the container numbers
(box, folder, etc.) be listed once and implied dittoes fill up the rows
and cells until the numbers changed. To some extent, this style was also
used in the column that described the container contents. With EAD, we
realized that all of the implied dittoes needed to be filled in with real
data. That was done as each was converted to XML.
We discovered that the easiest
way to convert our container lists to XML was to use WordPerfect. When
you highlight tabbed columns and then click on "insert table",
the contents of those columns are placed into a WordPerfect table. We
were careful to have only one tab between columns and for the columns
to all line up together from one row to the next. Any variation to this
rule caused problems with the next step.
Then, with the data in a table,
we told WordPerfect to select the table (use the Edit drop-down menu for
this), then hit the delete key. The software then pops up a menu challenging
you about what you actually want to delete. At the very bottom is a feature
to convert the contents of the table to a merge document, using the table
header as field names. Select that choice. This creates a data file that
corresponds with a merge form. We ended up needing about 20 merge forms,
each a variation on a theme of <container>, <unittitle>, and
<unitdate> tags within <c01> and <did> tags. When you
run the merge, the data from the table have the EAD tags perfectly wrapped
around them. The resulting data were then cut/pasted inside the <dsc>
tags in the XML documents with the series descriptions.
Although there were fewer container
lists than series descriptions, this part of the project took longer,
about four weeks. In contrast, the agency histories took about four days
to complete all 125 of them. The EAD coding for those was rather minimal
since EAD does not yet support all of the note fields we currently use.
Perfect Printed
Copies
Having all of our inventories
and agency histories available to the public (and search engines) in HTML
only solved one part of the problem that finding aids create for us. The
other problem was the creation of a printed copy that looks clean and
professional, with table headers, page numbers, and series numbers printed
on each page. This is the copy used in our Research Room by both staff
and researchers. Previously, all of the finding aids had always had a
master copy in WordPerfect, and another copy used for electronic distribution
(without headers, footers, etc.), and several duplicate copies in people's
personal directories. The hope with EAD is to make the XML copy the master
copy and eliminate any other "master" or duplicate copy which
may exist. All updates should only be done in XML using XMetaL, then run
through a stylesheet.
XML, then, has to be all things to all people. How could
a good Research Room copy be made? After some investigation,
a new stylesheet was created
using XSL formatting objects. (A formatting object stylesheet,
incidently, is a completely different animal than the HTML
stylesheet, and took four solid days to figure out how to
make it work, and has continued to be tweaked ever since.)
The formatting objects file was then converted to a PDF file
using Apache's FOP
software, which is still in development. One of the advantages
of using the PDF file is that the very large series inventories
we have (those over 1 megabyte in size) can be updated in
XML and made available on the web without necessitating the
creation of multiple HTML documents that divide up the contents
into reasonable size for download (50 k per page being our
state standard). The resulting PDF file is compressed, so
downloading is much easier.
Conclusions
The process we chose for converting
our finding aids worked well for us. Perhaps a better/faster way exists
that we could have used, but this process was within the skills of existing
staff. The largest of our finding aids (which continue to grow) are difficult
to manage using XML. These may be better housed in a database and made
available online through web reporting software. Still, creating an XML
document once and then using a stylesheet to transform it into something
else is very useful and has made the EAD project worthwhile.