By Nicholas Woodward, Sr. Software Engineer, Texas Digital Library
I’ll start with a little backstory.
Texas Digital Library was formed in 2005 among four ARL libraries in Texas in order to pool resources to build capacity for preserving, managing, and providing access to unique digital collections of enduring value.Among other things, we host institutional repositories, a consortial data repository, Open Access journals, and ETD management tools.
Mission statement of Texas Digital Library
TDL currently has 23 institutions as members, and we are adding a 24th member in Fall 2019. Seventeen of our members use our repository hosting service, which includes:
21 hosted DSpace installations
10 TB of content and approximately 195,000 items
TDL services: usage and stats
Our newest member using our repository hosting service is the University of Texas Health Science Center. There is no such thing as a “typical” repository migration, but UNTHSC’s onboarding and migration was more challenging than others because their content was stored on a bepress repository.
The case study below outlines TDL’s process for migrating from UNTHSC’s content from bepress to a TDL-hosted DSpace. I originally presented this case study at the DSpace North American User Group hosted by the University of Minnesota Libraries in September 2019.
TDL Members at the DSpace North American User Group Meeting (left to right) Andy Weidner, University of Houston; Nicholas Woodward, Texas Digital Library; Colleen Lyon, UT Austin; James Creel, Texas A&M University; Taylor Davis-Van Atta, University of Houston; Edward Warga, Texas A&M – Corpus Christi
The University of North Texas Health Science Center (UNTHSC) approached the Texas Digital Library (TDL) in Spring, 2018 about migrating their scholarly repository hosted on the Digital Commons platform from bepress to a new DSpace 6.3 repository instance hosted by TDL.
UNTHSC and TDL agreed to collaboratively develop a workflow and timeline for the migration that incorporated existing DSpace tooling, custom code and shared documents.
In case you’re unfamiliar with bepress, here is a screenshot from UNTHSC’s former repository home page.In the image below, you can see the hierarchy of communities/collections that we knew we would need to recreate in DSpace.
Below is a typical item view page in bepress with the standard metadata and option to download. Notice below the Download link the count of the number of times the document has been downloaded.
At the onset of the project we worked with UNTHSC to develop a migration workflow that we could test from beginning to end. We began by working through each step of the process with a subset of the repository to create a minimum viable product, or MVP, of the migration. From there we could then iteratively develop greater capabilities until every stop of the migration worked for the entire repository.
Here’s what that workflow looked like:
Transfer digital objects along with their metadata
Generate communities and collections in DSpace
Create Simple Archive Format packages
Ingest the packages into DSpace
Customize the look-and-feel and configuration for UNTHSC
Step 1 | Transfer Digital Objects Along with Their Metadata
Step one involved working with UNTHSC’s bepress Archive:
Complete up-to-date backup of the repository
AWS S3 storage
Accessible with AWS credentials
Includes metadata-only items
Below is an example of the metadata files that are in the bepress Archive. One thing to notice is that there are no namespaces, so it is not validated against any schemas. The other thing, and you can’t see it in this graphic, is that the XML is encoded in ISO Latin 1.
Step 2 | Generate Communities and Collections in DSpace
As with all good software projects, we began with a spreadsheet (see image below). The metadata in the bepress Archive links items to collections in a sort of roundabout fashion. Additionally, UNTHSC wanted to rearrange their repository, creating several new communities/collections and moving some existing items around.
It all starts with a spreadsheet.
Below is a closer view where you can see instances of what will become top-level communities, subcommunities, and collections in DSpace. We needed a way to specify the end nodes of the tree, meaning the collections. And we settled on the pipe character that would eventually serve a second purpose.
If you look after the pipe characters in the spreadsheet, you’ll see there are paths that correspond to the digital objects in the bepress Archive. In this way we could match item with collection, even in cases where the items would be moving to a different collection.
The image below shows the Ruby code that is used to transform the spreadsheet into an input XML file containing the communities and collections that DSpace uses to create them in the repository.
Ruby code to transform the spreadsheet into a hierarchy
And below you can see the output of the Ruby code that the DSpace command line job uses as input. Two things to notice here: the DSpace job gives us the new handles of the collections, and we’ve stored the bepress Archive paths in the short description field of the collection.
DSpace command line
Step 3 | Create Simple Archive Format Packages
In order to import the digital objects from bepress into DSpace we quickly settled on a DSpace standard for representing items. The standard contains the metadata in a standardized format, the digital objects themselves and list of collection(s) they are in.
DSpace Simple Archive Format
Back to the metadata stored in the bepress Archive… one thing to notice is there are no namespaces, HTML entities have been converted, and most importantly, the encoding is specified as ISO Latin 1.
Metadata mapping: before
Thankfully we have some experience with mapping metadata to a range of standards and formats and metadata application profiles. We repurposed some existing code to map the bepress archive metadata into both unqualified Dublin Core and the thesis namespace for UNTHSC’s electronic theses and dissertations.
The end result is metadata in a format that DSpace is expecting for the Simple Archive Format packages.
Step 4 | Ingest SAF Packages into DSpace
By matching the bepress Archive path in the metadata with the short description of the corresponding collection, we can match the two in the SAF package.
Adding items to a collection
The metadata stored in the bepress Archive is almost complete, but early on in the process we discovered that there were a handful of fields that weren’t present in S3 but were available via the OAI-PMH feed. These included the dc.type and dc.format fields. We were able to associate the items in the bepress Archive with their metadata records in OAI-PMH and add those fields to the final metadata.
Additionally, UNTHSC wanted to store the download statistics for their items in Digital Commons. This information is not stored in the bepress Archive or the OAI-PMH feed, but it is available via custom metadata exports in bepress. So again, we found a way to associate all of the download statistics with an item, and after UNTHSC determined they wanted to store that info dc.provenance.legacyDownloads we added it to the mix:
OAI-PMH metadata feed | dc.type and dc.format fields
Custom metadata reports from bepress | Legacy download statistics – dc.provenance.legacyDownloads
Step 5 | Customize Configuration for UNTHSC
In order to put it all together, we:
Utilized Ansible playbooks to launch DSpace 6.3 instance with Shibboleth authentication, ORCID integration, etc.
Executed a final sync process to get UNTHSC’s repository from bepress Archive
Generated communities/collections hierarchy from the spreadsheet
Built SAF packages for all items
Executed a DSpace import command line job to ingest all repository content Solr indexing, and media filtering
Enabled GUI customization
Ta da! Have a look at the final product: A TDL-hosted repository for UNTHSC.
UNTHSC Scholar: DSpace home page
UNTHSC Scholar: Communities
UNTHSC Scholar: item view
UNTHSC Scholar: legacy stats
SUCCESSES AND CHALLENGES
Our successes included:
Migrated ~3,700 items w/metadata and supporting objects
Developed workflow using existing DSpace tooling, spreadsheets, and custom Ruby to “glue” the steps together
Achieved repository alignment with an open, supportive consortium of Texas Digital Library
Our biggest challenges were:
Text encoding issues with the ISO Latin 1 format
New items and metadata edits were delayed to bepress Archive
Metadata in the bepress Archive lacked key fields like type and format
UNTHSC’s new DSpace repository was launched on September 16. We finished DSpace services setup, including Google Analytics, RDF, etc, and completed the remaining GUI and configuration customization. We will soon publish our workflow Ruby code to Texas Digital Library’s GitHub repository.
Production release was October 21st. You can view the UNTHSC Scholar repository at https://unthsc-ir.tdl.org/.