1. Trang chủ
  2. » Ngoại Ngữ

Minnesota Digital Library and HathiTrust Image Preservation Prototype Project Report

85 1 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 85
Dung lượng 843 KB

Cấu trúc

  • 0.1. Executive Summary (4)
  • 0.2. Background (6)
    • 0.1.1. Selected Acronyms (7)
  • 0.3. Project Participants (9)
  • 1. Execution (10)
    • 1.1. Process (11)
      • 1.1.1. Content Selection (12)
      • 1.1.2. Extraction (12)
      • 1.1.3. Reformatting (13)
      • 1.1.4. Packaging (14)
      • 1.1.5. Transfer (16)
      • 1.1.6. Ingest (18)
      • 1.1.7. Display & Retrieval (18)
    • 1.2. Technology (19)
      • 1.2.1. Stage One: Simple Contone (19)
      • 1.2.2. Stage Two: Compound Objects (19)
      • 1.2.3. Stage Three: Mixed JPEG Images from MHS (20)
      • 1.2.4. Catalog metadata (20)
    • 1.3. Costs (21)
    • 1.4. Project Management (23)
  • 2. Lessons (24)
    • 2.1. Technical Issues (25)
      • 2.1.1. Master of illusion (25)
      • 2.1.2. Paper cuts can draw blood (26)
      • 2.1.3. The missing fix (26)
      • 2.1.4. Existing metadata is all over the map (27)
      • 2.1.5. Assumed identities (27)
      • 2.1.6. Revisiting code (28)
      • 2.1.7. TMI (28)
      • 2.1.8. Images are big (28)
      • 2.1.9. Distribution sensitivity (29)
      • 2.1.10. Metadata manipulation (30)
      • 2.1.11. Projects do end (30)
      • 2.1.12. No free lunch (30)
    • 2.2. Archival Issues (32)
      • 2.2.1. Format requirements (32)
      • 2.2.2. Metadata requirements (33)
        • 2.2.2.1. Descriptive Metadata (34)
        • 2.2.2.2. Technical Metadata (36)
      • 2.2.3. Rights management (37)
        • 2.2.3.1. MN Collections (37)
        • 2.2.3.2. HT philosophy (39)
        • 2.2.3.3. Future considerations (39)
      • 2.2.4. Overall Archival Findings (40)
    • 2.3. Governance Issues (41)
      • 2.3.1. Issues and Needs (42)
  • 3. Alternatives (45)
    • 3.1. Research institution-based external service providers (46)
      • 3.1.1. Chronopolis (text by Katherine Skinner and David Minor) (46)
    • 3.2. Open Source solutions for internal service hosting (48)
      • 3.2.1. DAITSS (text by Katherine Skinner and Priscilla Caplan) (48)
      • 3.2.2. LOCKSS (50)
    • 3.3. Collaborative solutions for community-based preservation services (53)
      • 3.3.1. MetaArchive Cooperative (53)
  • 4. Appendices (58)
    • 4.1. Governance Models for Collaborative Preservation (59)
    • 4.2. MDL-HT Hot Spots (65)
    • 4.3. MDL-HT Image Ingest Prototype Guidelines (67)
    • 4.4. MDL-HT Specifications for Reflections Continuous Tone Images (71)
    • 4.5. MDL-HT Specifications for Reflections Compound Objects (74)
    • 4.6. MDL-HT Specifications for MHS objects (78)
    • 4.7. Sample MDL METS file for HT (82)

Nội dung

Executive Summary

Between September and December 2010, the Minnesota Digital Library (MDL) collaborated with HathiTrust (HT) to integrate content from Minnesota Reflections and the Minnesota Historical Society (MHS) into HathiTrust as a preservation archive Before Christmas, nearly 50,000 images from Minnesota Reflections were successfully transferred to HathiTrust.

2010 with another 8,000 MHS images due to be shipped soon after the holiday HT plans to ingest these images and their metadata in early January.

While building this preservation archive MDL and HT learned a number of lessons

Many local institutions' master images do not meet HT's formatting standards, necessitating transformation and the inclusion of embedded metadata Additionally, numerous master images lack essential fixity checks, making it challenging for potential participants to comply with HT's stringent format requirements.

• Items in the preservation archive require unique identifiers and that identifier namespace needs continual careful attention as new collections are added.

Mapping metadata from local systems presents challenges in establishing a routine, necessitating continuous focus for effective long-term preservation Additionally, the process of properly packaging items for handling and transfer can be quite time-consuming The varying viewpoints of metadata developers and librarians further complicate this task.

The pilot project has shed light on discrepancies in directional intent related to metadata, with some issues being resolved and others requiring further review before a long-term program can be implemented effectively.

• A programmer would be required in any long-term effort to integrate new collections, building scripts to do metadata mapping and packaging of objects.

Establishing a trusting relationship with clearly defined responsibilities is essential for effective data transfer solutions This is particularly important as the Master Data Library (MDL) may receive more information than necessary for its archiving tasks.

The image data from Minnesota Reflections is significantly larger than the page images of books in the HT collections, posing challenges for data transfer and package ingestion.

• Local institutions may be more sensitive about image dissemination than HT expects Rights issues have posed key challenges for MDL and HT during this pilot

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Executive Summary 1 project The project partners currently hold different expectations and requirements regarding rights and display

• Once descriptions are ingested into HT, only the catalog information can usually be changed.

• This model cost about $1 per image up front and $0.10 per image in ongoing maintenance.

Delays in packaging requirements have pushed the ingest timeline to late December and early January, beyond the report's preparation period The team will evaluate the lessons learned from the ingest process once it is completed A key takeaway for the team is that projects must reach a conclusion, regardless of whether goals are met Both MDL and HT staff intend to continue working on ingest and retrieval into January, even in the absence of the project manager or preservation consultant.

Preservation services span a range from basic bit-level maintenance to comprehensive preservation solutions that include a fully operational access-oriented content catalog, with HT positioned at the highest end of this spectrum In a recent pilot project, MDL investigated the necessary processes and workflows for various MN institutions to engage in HT preservation services As MDL and HT consider a potential long-term partnership, sharing the pilot's findings with representatives from diverse institutions could be beneficial in aligning their needs and capabilities with the preservation service model.

MDL has successfully coordinated and hosted Minnesota Reflections, collaborating informally with various Minnesota-based institutions Now, MDL aims to introduce a new program that will provide preservation services to enhance the safeguarding of cultural and historical resources.

The Sponsors Group of the Minnesota-based constituency project has determined that to provide effective preservation services, MDL must establish a more formal governance structure, policies, and documentation than what was used in the Minnesota Reflections project.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Executive Summary 2

Background

Selected Acronyms

ACHF: Arts and Cultural Heritage Fund, also called “Legacy” funds

AIP: Archive Information Package of the Open Archival Information System reference model

DIP: Dissemination Information Package of the Open Archival Information System reference model

JP2: JPEG2000 is an image compression standard

MDL: Minnesota Digital Library Coalition

METS: Metadata Encoding and Transmission Standard, a kind of XML wrapper for all kinds of information about objects

PREMIS: PREservation Metadata: Implementation Strategies, a strategy for describing events in an objects history using XML, and in our case included as part of the METS file

SIP: Submission Information Package of the Open Archival Information System reference model

TPT: Twin Cities Public Television

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Background 4

XML: Extensible Markup Language, a syntax for the exchange of structured data

XMP: Extensible Metadata Platform, for storing information relating to the contents of a file

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Background 5

Project Participants

The project was sponsored by notable figures including John Butler from the University of Minnesota-Twin Cities, Bill DeJohn of Minitex, Bob Horton from the Minnesota Historical Society, and John Weise representing HathiTrust at the University of Michigan.

The report was prepared by project manager Eric Celeste, a consultant based in Saint Paul, Minnesota, alongside digital preservation consultant Katherine Skinner, who serves as the Executive Director of Educopia Institute.

Bill Tantzen, a software developer at the University of Minnesota Libraries, was the primary developer of this project, with Jason Roy, the Digital Content and Software Development Coordinator, serving as a key member of the team Additionally, Greta Bahnemann contributed support to the development efforts.

As the project progressed, a number of staff from the University of Michigan

Libraries joined the effort to represent the interests of HathiTrust These included: Aaron Elkiss, Shane Beers, Tim Prettyman, Jeremy York, and Chris Powell.

As we embarked on the Minnesota Historical Society stage of the project, Karen Lovaas and Jane Wong of MHS also joined the team.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Project Participants 6

Execution

Process

The project timeline was highly ambitious, aiming to condense the entire prototype effort into the final months of 2010 MDL sought to establish a workflow for transferring digital image data and its associated metadata from Minnesota.

HathiTrust is collaborating to transfer a specific collection of images into its system while establishing suitable display options for these images The project is scheduled to take place from September until the end of the designated timeframe.

In November, the objective was to achieve specific goals, with December designated for progress reporting Although the timeline was delayed by a few weeks and not all goals were met, valuable lessons emerged from the experience.

The project team settled on a workflow with six stages of processing:

(1) extracting master images and metadata from current repositories,

The process involved several key steps: first, images were reformatted to meet HathiTrust requirements; second, binaries and their associated metadata were packaged for ingestion; third, these packages were transferred to HathiTrust; and fourth, the ingestion of these packages was initiated While the initial four steps were successfully completed prior to this report, the final two steps—ingesting the packages and enabling the display and retrieval of images from HathiTrust—remain unfinished.

MDL aimed to explore various content types during the prototype, ranging from simple continuous tone images to complex compound objects featuring a series of structurally related images, as well as images with text and associated optical character recognition (OCR) text The demonstration content would include approximately 50,000 images from Minnesota Reflections and a 10,000-image subset from the MHS collection management system Although newspaper data was initially considered, it was ultimately excluded due to time constraints The project manager outlined three stages for the effort: transferring simple continuous tone images in stage one, moving on to compound objects in stage two, and finally transferring images from the Minnesota Historical Society's content management system in stage three Each stage would provide valuable lessons for the developers and HT staff to enhance the subsequent phases.

After a brief time in September during which the program manager shared and revised the workplan and guidelines (available in an appendix) with project team

The MDL-HT Image Preservation Prototype Report, dated December 31, 2010, outlines the workflow of a single stage in the process initiated in October 2010 Instead of detailing the chronological sequence of events, this report focuses on how the workflow for that specific stage developed over time.

The initial phase of each stage involved precisely identifying the content designated for transfer For the two Minnesota Reflections stages, this meant clarifying how the programmer would differentiate between simple continuous tone images and those associated with compound objects Meanwhile, the project manager at MHS dedicated time to assess existing MHS content to pinpoint materials that could be utilized within the limited timeframe available.

The Minnesota Reflections project was a selective initiative with a narrow focus, utilizing well-defined and easily accessible resources from the Minnesota Historical Society While the project team chose some of the least complex and most thoroughly described materials from the MHS Collections Online, it is important to recognize that a broader preservation effort would involve a wider variety of diverse content not included in this specific project.

Before each project stage, the project manager created a detailed specification document to guide programmers and inform the HT team about expectations These documents, samples of which are included in the appendices, outlined the definitions and development of identifiers and metadata, as well as the structure of HT packages Each specification was based on a set of guidelines established by the team at the project's outset, emphasizing the need for pragmatic approaches to meet deadlines while recognizing that certain shortcuts acceptable for prototypes may need reevaluation for long-term projects.

To complete the prototype mission required both the master images from participating collections, and enough descriptive metadata to serve the needs of the

HT catalog The first two stages involved images stored at the University of

Minnesota Libraries streamline the retrieval process for the UMN-based MDL programmer by offering a straightforward solution The Minnesota Historical Society (MHS) opted to deliver its images on a hard drive, providing all master images from its Content Management System (CMS) to simplify the process.

The MDL-HT Image Preservation Prototype Report from December 31, 2010, highlights the successful collection of approximately 8,000 images necessary for the project Participants' local presence facilitated the image gathering process, and MDL was entrusted to securely erase the hard disk post-project, ensuring that the remaining images would not be misused.

Extracting metadata from Minnesota Reflections posed challenges due to its reliance on the OAI-PMH protocol, which offers Dublin Core descriptive metadata The unqualified Dublin Core provided by CONTENTdm lacked many nuances present in the original descriptions, complicating the metadata extraction process Additionally, acquiring data for compound objects was problematic, as these were often excluded from the OAI harvest Despite the project programmer and the MDL staff working within the same organization, close coordination was essential to successfully conduct the necessary harvests without disrupting Minnesota Reflections.

The Minnesota Historical Society (MHS) did not have accessible OAI Dublin Core data, leading staff to collaborate with the project manager to review their system They concluded that the XML output from the cataloging side of their content management system (CMS) would yield the most effective descriptive data for mapping to Dublin Core However, this mapping process resulted in the loss of some nuances in MHS data, akin to the challenges faced with Reflections data.

The Minnesota Reflections project faced challenges in assigning unique identifiers for over 3,000 items, as many had duplicate identifiers, and none existed for compound objects Although the Minnesota Historical Society (MHS) did not have duplicates, they discovered two different identification schemes in use Deciding on the appropriate identifier and creating new ones became crucial steps in the workflow, as these identifiers are essential for future ingestion and retrieval within the system.

The primary challenge for the preservation project was the inability to preserve existing master images The team discovered that HT mandated the conversion of continuous tone TIFF images to lossless JPEG2000 format Additionally, bitonal TIFF and JPEG images needed updated XMP data embedded within the binary files during subsequent stages.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Process 10

Technology

As it turned out, each stage of the project made different demands on the MDL programmer and he ended up developing separate, though related, approaches for each.

The process of creating the METS packages for the contone images can be broken down into three distinct steps.

The initial phase involves collecting metadata for CONTENTdm hosted content, utilizing data obtained through an OAI harvest The programmer developed the process in Java, leveraging an OCLC library that supports OAI-PMH v2.0 The harvested results were stored in a MySQL database, structured with rows consisting of identifier, field, and value attributes The fields included Dublin Core elements and a "local_identifier" field, which facilitated the retrieval of images linked to each asset.

The second step involved preparing image derivatives by utilizing the Image::ExifTool libraries to extract metadata from TIFF master files This extracted metadata was then transformed into XMP data, which was subsequently attached to the compressed JPEG2000 files generated using the free Kakadu demo programs.

The final step involved generating a METS file using the metadata from the MySQL database created in the first step This file, along with the compressed image derivative from the second step, was organized into a directory The directory was then zipped and compressed to create the final package.

When collecting data from various sources, the initial step may require modification, especially if the metadata is stored in formats like Excel spreadsheets or Access databases Although the conversion of this data into MySQL tables and rows will vary, the subsequent steps in the process should remain consistent.

The database for compound objects utilized a unique parent-child relationship among individual images Alongside the OAI-harvested metadata, it was necessary to extract data from CONTENTdm that detailed the relationships between the records forming a compound object.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Technology 16

CONTENTdm PHP API was used to expose data that OCLC does not make available through OAI-PMH.

The code utilized in step 2 of stage one remained unchanged, while step 3 presented a variation due to the METS being significantly different, as it represented a compound rather than a straightforward record.

1.2.3 Stage Three: Mixed JPEG Images from MHS

The MHS data utilized a consistent database structure, with changes only in the data import process Metadata was provided in a single XML file from the MHS EMu system, which was then parsed to insert data using the same series of identifiers, fields, and values.

Step 2 here differed in that there was no need to create image derivatives; the master images were already lossy JPEGs and HT agreed with MDL to leave them as they were rather than converting them to JP2s However, the same code base and procedure was used to create the required XMP data to attach to each JPEG.

Step 3 differed slightly from the code for the stage two compound objects because the compound objects from MHS were not composed of individual records associated in a parent-child manner For MHS, compound records were simply records with multiple images.

Each collection followed a consistent three-step process, although the specifics varied due to differences among the collections Despite these variations, the underlying template remained unchanged, with the programmer estimating that approximately 80% of the code was consistent across stages As the number of collections increases and recognizable patterns emerge, the programmer anticipates that these patterns can be abstracted, potentially leading to a unified set of programs and scripts for the second and third steps.

In the process of preparing the SIPs for HT, the programmer was tasked with capturing all Dublin Core descriptive metadata from the METS files and creating a distinct catalog metadata file for HT This catalog metadata serves as the essential foundation for accessing the material at HT and allows for future reloading if MDL identifies necessary corrections.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Technology 17

Costs

The project costs were notably reduced as HT did not impose any charges on UMN for their involvement in the prototype initiative Apart from the consulting fees for the project manager and preservation consultant, which are not covered in this report, the expenses primarily included minor technology costs, a significant commitment of staff resources from MDL (entirely UMN personnel), and additional time contributed by MHS.

The technology expenses were minimal, primarily consisting of the price for a 2Gb hard disk and the shipping costs incurred for sending it to and from Michigan multiple times Additionally, the workstations and local storage utilized were largely ancillary to the ongoing activities at the University of Minnesota (UMN).

During the final stage of the project, MHS staff were minimally involved, with requirements intentionally kept straightforward Nonetheless, two staff members attended weekly developer meetings in the project's last three weeks and collaborated with the project manager to review content at MHS and select records for the prototype Ultimately, MHS staff were responsible for preparing both metadata and binary exports from EMu, enabling MDL to create SIPs for HT.

The project's largest expense, apart from HT's commitments, was the staff time dedicated by MDL Three UMN staff members consistently participated, serving dual roles as information producers by exporting data from Minnesota Reflections and as information aggregators and transformers by preparing SIPs for HT.

The programmer spent full time on this effort Given his salary scale and the occasional encroachment of other duties, this can be estimated to be a roughly

The project incurred a weekly expense of $2,000, with the initial four weeks focused on extracting data from Minnesota Reflections into a database sandbox for producing SIPs Subsequently, the programmer's remaining time, extending into December and January as HT requirements became clearer, is estimated to account for another 14 weeks dedicated to the aggregator and transformer role.

The metadata assistant spent approximately 100 hours at roughly $20 per hour on data remediation related to the extraction of data from Minnesota Reflections This

The MDL-HT Image Preservation Prototype Report, dated December 31, 2010, outlines the efforts to address 3,000 duplicate identifiers unintentionally added to Minnesota Reflections Additionally, the project involved managing data export options for all 120 collections within Reflections All the time spent on these tasks can be categorized as producer time.

The manager at UMN dedicated approximately 40 hours to remediation tasks, which can be categorized as producer time Additionally, he spent another 30 hours supervising the programmer and engaging in developer meetings in his roles as aggregator and transformer For prior MDL projects, he billed his time at around $70 per hour.

Time as ProducerCost as ProducerTime as AggregatorCost as AggregatorProgrammer672h 0m 0s2352h 0m 0sMetadata Assistant100h 0m

The cost to MDL for its role as a producer of content to be archived was roughly

$13,000 Much of this effort would also prepare MDL for including other

The archiving process of CONTENTdm sites revealed significant shortcomings in existing metadata practices, such as duplicate identifiers and the absence of compound object identifiers, contributing to the overall costs of the project.

MDL incurred approximately $30,000 for its role as a data aggregator and transformer, developing SIPs that comply with HT requirements This investment has provided MDL with valuable experience in CONTENTdm systems However, it is advisable to anticipate an additional cost of around $10,000 in programming time for the integration of any new system into the preservation process.

Given the roughly 49,000 images transferred before Christmas, the $42,900 in staff expenses at UMN cost about $0.86 per image transferred.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Costs 19

Project Management

Shortly after the project commenced, the manager at UMN established a Basecamp instance at https://biomed.basecamphq.com/projects/5528974 to centralize team activities The project extensively utilized Basecamp, providing a comprehensive record of the team's efforts and collaboration.

The project manager organized weekly meetings with key staff from Minnesota and Michigan, alongside the digital preservation consultant, to discuss the progress of each project stage, address procedural concerns, and align the practices of MDL and HT.

The project manager organized monthly meetings with project sponsors to address developer concerns, project progress, and governance issues related to a comprehensive digital preservation initiative These meetings were inclusive, allowing participation from other MDL and project staff.

The tools and meetings effectively supported the project, providing HT and MHS staff who joined later with a strong foundation to review previous work and access the ongoing discussions developed by the team in earlier weeks.

Basecamp posed some challenges, particularly in helping new participants subscribe to ongoing conversations, which may not be immediately clear However, it ultimately served as a valuable centralized hub for the team, consolidating messages, files, examples, and meeting notes effectively.

The team plans to continue holding weekly meetings and utilizing the Basecamp repository, even after the project manager's departure from the prototype phase.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Project Management 20

Lessons

Technical Issues

The team encountered numerous technical challenges while transforming images and creating packages, leading to an increase in HT staff involved in discussions from one to six Although these issues were sufficiently resolved to advance the prototype, they warrant further evaluation before proceeding with a long-term digital preservation initiative.

The prototype effort revealed a crucial insight: the concept of a "master image" is more of an illusion than a reality This realization is unsettling, especially since the master image is central to the preservation initiative It raises an important question: what exactly should be preserved if the master image is not the definitive reference?

The master images from Minnesota were not eligible for preservation in HT due to two main reasons: certain formats were not accepted by HT, and all images lacked essential metadata required by HT Although some concessions were made, such as HT accepting JPEG format, the MDL ultimately had to remediate each master image before submission to HT.

The Minnesota Reflections master images include both continuous tone (contone) TIFFs and binary (black and white) TIFFs While the binary TIFFs met HT's acceptable format, HT does not accept contone TIFFs due to their uncompressed nature, which takes up unnecessary space Consequently, HT required the transformation of these contone TIFFs into “lossless” JPEG2000 (JP2) images, ensuring no degradation in image quality This process necessitated MDL to create new JP2 master images for sharing with HT, which were not originally required by MDL.

HT mandates that all images in the archive must contain essential descriptive and technical metadata embedded as XMP metadata within the image file While generating XMP metadata is relatively straightforward, incorporating it alters the original image file, resulting in a modified version.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Technical Issues 22 of “master image” gives way to the practical requirements of the preservation archive

The technical requirements for transforming master images prior to packaging them for preservation archiving present a considerable challenge, making it difficult for many organizations to engage in such projects Consequently, they often need to rely on intermediary services equipped to handle these complex tasks effectively.

2.1.2 Paper cuts can draw blood

While HT accepted the pragmatic approach laid out at the launch of the project, new

As the project advanced, HT staff raised concerns about previous decisions, leading to numerous iterations between MDL and HT Although each individual issue seemed minor, the cumulative effect became overwhelming, ultimately causing significant delays and hindering overall progress.

Some of this delay may be reduced in the future as MDL staff internalize more of the

To enhance fluency in XML, METS, and PREMIS standards, HT requirements must be addressed However, the delays in prototype development indicate that ongoing collaboration between HT and MDL staff will be essential for long-term success As MDL plans to incorporate new data sources into the preservation archive, significant adjustments to scripts, tools, and specifications will be necessary, as evidenced during the prototype phase It is unlikely that this process will become significantly simpler in the future.

The project required fixity checks to ensure the accuracy of copies made from masters, but not all masters had the necessary checksums A fixity check enables a computer program to verify that it has received an accurate copy of an item Since no human would review each image, the project faced the risk of errors during the transfer of images between locations and hard drives, as each copying action presents an opportunity for potential issues.

The MDL-HT Image Preservation Prototype Report highlights technical issues related to image integrity, emphasizing that the output image may differ from the original input A fixity check is performed by the computer through a calculation on the image file's contents If this calculated result matches a pre-recorded value known as a "checksum," the software can confirm that the file has been preserved intact.

A notable percentage of MHS binaries lacked checksums, leading us to proceed without assurance of image integrity when sending to HT While a long-term project may reject this compromise, it's important to recognize that some partners might lack the necessary expertise to generate the required checksums.

2.1.4 Existing metadata is all over the map

Systems lacking DC data necessitate significant focus on data mapping The project manager collaborated with MHS staff to create a mapping specification, which needed frequent updates each time MHS exported data due to minor adjustments in the XML format This process demands a certain level of metadata expertise and the ability to make swift decisions, often involving compromises that may be challenging for traditional catalogers.

Despite being designed for efficient data sharing, systems like OAI demand considerable staff effort to prepare for specific projects Programmers must create scripts to filter relevant records, and in some instances, adjustments to settings are necessary to generate distinct data sets through OAI, which must then be promptly reverted to avoid disrupting other data harvesting processes.

Digital storage systems rely on unique identifiers for objects, a practice that predates digital technology with the use of call numbers This project necessitated the creation of a unique "namespace" for identifiers assigned to HT, requiring collaboration with the partner institution to potentially modify identifiers to fulfill HT's specifications Additionally, the quality of the identifiers provided by MDL was inadequate, as they lacked a check-digit mechanism, which is essential for ensuring accuracy and integrity.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Technical Issues 24

Effectively managing institutional identifiers necessitates continuous attention and meticulous documentation Our experience with the Minnesota Reflections dataset revealed that even straightforward tasks, like maintaining unique identifiers, were poorly managed, leading to thousands of duplicates This issue has been addressed during the current project, highlighting the importance of a sustained commitment to preserving archives with enhanced diligence.

Archival Issues

During the pilot project, several archival issues emerged, prompting participants to adopt a pragmatic approach due to tight deadlines Consequently, both MDL and HT made concessions to address these issues, resulting in temporary solutions that may not be sustainable in the long term It is essential to revisit and thoroughly evaluate these decisions before MDL embarks on a full program.

This section addresses key archival issues that have emerged and emphasizes the need for additional efforts to resolve these challenges before launching a comprehensive preservation program in the future.

Before launching the pilot project, MDL and MHS planned to preserve their master images, primarily in TIFF format, using HT At present, HT supports a limited selection of image formats, including JP2, JPEG, and binary-only TIFF.

Due to HT's exclusive acceptance of binary TIFFs, MDL and MHS have decided to convert their master TIFF images into JP2 formats for ingestion and preservation This conversion has prompted staff from both institutions to question the implications of archiving and preserving items that differ from their local master copies.

HT’s restrictions around format type are intended to enable HT to ensure that they can properly manage and provide access to the materials they commit to preserve

HT has plans to provide preservation for a wider range of format types in the future, including non-binary TIFFs, but is not yet prepared to do so.

MDL has converted its master images into the JP2 format to comply with HT's current format type restrictions If MDL needs to recover the original master images from HT in the future, they will have to reverse this conversion process This additional step in the SIP preparation for MDL introduces risks, as undergoing two conversions—from TIFF to JP2 and back to TIFF—may compromise the integrity of the MDL content.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Archival Issues 29 difficult for MDL to keep track of the authenticity and accuracy of these original files over time.

Many of MDL's partners, along with various Minnesota institutions requiring digital preservation services, lack the necessary staff time and expertise for significant material changes, such as conversion operations for content ingestion Without infrastructure from MDL to facilitate these conversions, potential participants may find the format restrictions too challenging to overcome at this time.

MDL is exploring a long-term partnership with HT, which necessitates further negotiations, particularly regarding HT’s acceptance of a wider range of formats If HT does not accommodate this, MDL must evaluate the format requirements of its members to ensure they are adequately addressed Additionally, MDL should assess the practical and legal implications of managing file conversions for MN participants in this long-term project.

HT has documented its requirements in the HathiTrust Digital Object Guidelines and also on their website’s “Getting Content into HathiTrust” page

At the start of the project, the team familiarized themselves with the HT Guidelines and planned to convert existing Dublin Core metadata records into METS, enhancing them with PREMIS for descriptive metadata MDL aimed to extract and embed technical metadata (XMP) within the objects and METS records, while MHS intended to map its descriptive metadata from the EMu content management system to Dublin Core, following a similar process as MDL The project team expected these tasks to be completed within the first month, starting with MDL collections before moving to MHS content However, the metadata requirements set by HathiTrust proved to be more demanding than initially anticipated.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Archival Issues 30 and the metadata creation process has taken the majority of the project period to accomplish

HT emphasizes the importance of consistency across SIPs from various Producers to streamline its processes and avoid the complexities of maintaining multiple metadata types By minimizing the need to map SIP metadata to AIP metadata, HT aims to simplify its operations Additionally, HT is focused on ensuring that the materials it preserves are accessible through its framework in a standardized manner Specific metadata is essential for activating access features within HT's system, such as making items searchable, eligible for collection inclusion, and displayable Without clear metadata indicating the nature of an item, such as whether it is a photo, HT's application cannot effectively provide the necessary search and viewing options.

MDL has raised concerns and sought revisions regarding certain recommendations and requirements due to its role as a content curator This pilot project enables MDL to take custody of statewide content, emphasizing the importance of a robust preservation solution To effectively manage this, MDL must create and sustain a data management workflow tailored to its internal needs, rather than being overly reliant on the state's current preservation solutions The practices established by MDL should provide a stable foundation that can adapt to various preservation solutions, ensuring flexibility to meet the evolving needs of the state over time.

The varying viewpoints of MDL and HT on metadata have led to differing project intentions throughout the duration of the pilot project While some of these differences have been addressed, others will need to be reconsidered before establishing a long-term program.

MDL has made substantial progress in aligning with HT's requirements by first assessing HT's needs alongside the current MDL metadata In October, MDL created an initial mapping of its Dublin Core (DC) metadata to the Metadata Encoding and Transmission Standard (METS) and outlined a series of Preservation Metadata: Implementation Strategies (PREMIS) events to document three critical preservation actions: the digitization of the original item from institution X by MDL, the conversion of the digital format, and additional preservation measures.

The MDL-HT Image Preservation Prototype Report outlines the process of converting an item from TIFF to JP2 format, packaging the JP2 along with its METS file, and delivering it to HT This procedure is crucial for establishing a clear chain of custody for the item, tracking its journey from creation to ingestion at HT.

The MDL-METS profile was assessed by HT and revised through several rounds of feedback Central to the discussions and compromises was a differing viewpoint: HT aimed to standardize the METS SIP profile to closely resemble their AIP profile for improved record-keeping, while MDL sought to ensure that the METS SIP profile sufficiently addressed their local requirements.

MDL identified three key PREMIS events for recording: "capture" for the initial digitization, "conversion" for changing the master file from TIFF to JP2 for HT's SIP, and "packaging" for creating the zip file of JP2 and METS HT requested the use of its existing AIP taxonomy to avoid conflicts, particularly with the term "dissemination," which was already in use A compromise was reached, allowing MDL to define its own SIP metadata terms while ensuring they did not clash with HT's AIP taxonomy; for instance, "disseminate" was changed to "packaging." HT agreed to map the SIP metadata to the AIP metadata, resulting in the need for additional documentation but enabling MDL to choose relevant terms that can be utilized across various preservation services in the future.

Governance Issues

MDL has successfully coordinated and hosted the Minnesota Reflections program, collaborating informally with Minnesota-based institutions As part of Minitex, which operates under the University of Minnesota Libraries, the Minnesota Reflections repository is available as a free service for institutions seeking to join a centrally hosted repository infrastructure To participate, institutions must respond to a Call for Proposals (CFP) to suggest content for digitization or contribution By contributing, institutions agree to a document that permits the use of their project images for non-profit educational purposes, as outlined in the MDLC Policy.

Digital Rights and Ownership, the contributor retains ownership of content, and the

MDLC Reflections program provides access to this content “for non-commercial, personal, or research use only.”

Minnesota Reflections operates as an informal program with a two-tier governance structure, consisting of Steering and Management Committees, alongside an Outreach Coordinator, Marian Rengel The program relies on volunteer committee members, primarily from universities involved in digitization efforts, but lacks formal documentation outlining the roles, responsibilities, and membership criteria for these committees Despite the absence of a structured governance framework, Minnesota Reflections has successfully flourished under this informal model.

MDL is launching a new program focused on preservation services for a Minnesota-based audience During this pilot phase, MDL is assessing the necessary governance structure and documentation required for sustainable long-term preservation efforts The Sponsors Group has recognized the need to establish a more formal governance framework, policies, and documentation compared to the previous Minnesota Reflections project One stakeholder noted that the current informal governance is adequate for their limited mission and annual grants, but as they expand their ambitions, a more structured approach is essential.

The MDL-HT Image Preservation Prototype Report highlights the necessity for a more formal governance and reporting framework to address the increasing demand for services and the need for a larger budget, along with its associated implications.

Initial discussions with MDL stakeholders have highlighted critical issues and requirements for the new program Specifically, under the "services" category, internal deliberation is necessary to address questions related to the state’s digital preservation needs, such as metadata, SIP preparation, accepted formats, and the distinction between open and restricted materials Once the state's needs are clearly defined and a governance model is established, negotiations on smaller components with HT can proceed.

I Definition of mission and governance structure

B Who is the legal sponsor and host (Minitex?)

C What is the governance model?

2 Who does work? (including documentation)

3 If there is a board or working group, how are positions allocated? a) Term length? b) Rolling vs rotating? c) Number of representatives? d) Voting policies?

4 How are members’ voices represented?

A Will the solution be available to both closed and open access content (as institutions want to preserve content that they either cannot or do not want to make open access)?

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Governance Issues39

B What is the relationship between the preservation service and the Minnesota Reflections repository?

C How much metadata will be required and who can invest the time to write the scripts to make this conform to a SIP spec (HathiTrust or otherwise)?

D What resolution of images will be displayed via the preservation solution?

E What are Minnesota’s statewide needs in terms of formats, and what formats will the solution address (HathiTrust or otherwise)?

F Define long-term rights of access What happens if MDL requests to have an image or a collection removed from HT’s access portal? What happens if MDL and HT part ways at some point in the future?

III Definition of roles and responsibilities of sponsor and participants or members

A What are the legal implications for the sponsor or host and for the participants or members?

B Possible movement to a membership model (as differentiated from the

Reflections project’s current participant model)

C Determine and document representation opportunities for participants or members in the governance

D Account for differences between participants or members (e.g., perhaps weighted votes to account for differences between the large institutions that run the service and the small institutions that contribute content for preservation)

E Account for the role that University of Minnesota will play as the conduit to HathiTrust if the state uses HathiTrust as its preservation solution

F Ensure that the governance structure will provide stability in the event of administrative changes at any of the participating or member institutions

G Define length of terms, withdrawal policies, replacement of committee members

IV Definition of financial structure and long-term plans

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Governance Issues40

A Ensure that the financial structure will provide stability beyond grant funding and limited-term awards

B Consider potential charges to participants or members for ongoing services to ensure no interruption of preservation services

C Think through issues of free-ridership

D Think through the roles played by institutions and what happens if any of the participants that play central roles in SIP preparation, etc, were to drop out

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Governance Issues41

Alternatives

Research institution-based external service providers

3.1.1 Chronopolis (text by Katherine Skinner and David Minor)

Established in 2007 with support from the Library of Congress's NDIIPP program, Chronopolis is a digital preservation data grid framework It has been developed collaboratively by the San Diego Supercomputer Center (SDSC) at UC San Diego, the UC San Diego Libraries, the National Center for Atmospheric Research (NCAR), and the University of California.

The Chronopolis framework, developed by Maryland’s Institute for Advanced Computer Studies (UMIACS), aims to enhance long-term preservation through cross-domain collection sharing By utilizing high-speed educational and research networks along with significant investments in mass-scale storage infrastructure, this partnership harnesses the data storage capabilities of SDSC, NCAR, and UMIACS The initiative focuses on creating a preservation data grid that prioritizes heterogeneous and highly redundant data storage systems.

Chronopolis members operate nodes with a minimum storage capacity of 100 TB for digital collections, utilizing a methodology that ensures at least three geographically distributed copies of data This approach facilitates curatorial audit reporting and access for preservation clients The core technology behind data management in Chronopolis is the Integrated Rule-Oriented Data System (iRODS), a middleware software that supports robust data management Additionally, the partnership is dedicated to establishing best practices for the global preservation community, focusing on data packaging and transmission across diverse digital archive systems.

Chronopolis is now providing fee-based preservation services to organizations seeking a robust preservation environment This service is readily accessible for institutions that require mature preservation solutions without the need to establish their own infrastructure.

To ingest data into Chronopolis, files must be transferred using either a hard drive or a network method, such as BagIt The Producer is responsible for establishing the Submission Information Package (SIP) structure, while Chronopolis oversees the management and regular auditing of the data Upon request, Chronopolis can provide preservation copies to the Producer, maintaining the original structure.

Chronopolis originally received Chronopolis is open to all file formats.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 43

Pricing for the Chronopolis preservation service is available from David Minor, david@sdsc.edu.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 44

Open Source solutions for internal service hosting

3.2.1 DAITSS (text by Katherine Skinner and Priscilla Caplan)

DAITSS2 is an open-source software solution set to launch in the first quarter of 2011, developed by the Florida Digital Archive (FDA) under Priscilla Caplan's leadership Since 2006, this software has been utilized within the Florida university system, and the FDA aims to assist other states and collaboratives in implementing DAITSS in 2011.

The FDA currently funds the DAITSS program with two key positions: a full-time manager and an operations technician The annual cost to maintain this program is estimated to be under $130,000, while it has successfully preserved 63 TB of content to date.

The FDA operates as a "dark archive," lacking public access and serving solely for long-term preservation purposes Participating university libraries within the state system utilize different applications to manage their institutional repositories effectively.

Digital libraries and digital asset management systems, including ETD systems, play a crucial role in preserving digital resources When a resource is chosen for preservation, it must be submitted to the FDA in a specific format The FDA ensures that a bit-wise identical copy of the original resource will be returned to the library upon request Furthermore, if all files are in supported formats, the FDA provides a version that can be rendered with current tools at the time of the request.

To archive materials in the FDA, a library must negotiate a binding Agreement with FCLA, which defines the responsibilities, liabilities, and warranties of both parties This Agreement is signed by the FCLA director and either the library director or university counsel on behalf of the university board of trustees, establishing the library as an affiliate of the FDA The term "affiliate" highlights the partnership between the FDA and libraries, with library deans acting as the governing board of the FDA Responsibility for long-term preservation is collaboratively shared between the FDA and its affiliates.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 45

In this model of shared responsibility, the affiliate is responsible for

• selecting content to be archived;

• securing rights to archive and preserve the content;

• describing the content adequately for its own purposes;

• submitting packages in format required by the FDA;

• maintaining local records of what it archived;

• withdrawing content that should no longer be archived;

• providing access to disseminated content.

The FDA is responsible for

• accounting for every package submitted with an Ingest or Rejection report;

• providing useful counts and reporting information on ingested materials;

• implementing preservation strategies as described in the FDA policy guide;

• preserving original files exactly as submitted, with demonstrated integrity, viability and authenticity;

• providing a renderable version of all supported formats;

• attempting to achieve and maintain certification as a trustworthy repository.

Brief description of technical achievements

The DAITSS application, developed specifically for the FDA, is a locally written software that adheres to the OAIS reference model and employs active preservation strategies through format transformation It aims to identify and describe all files, creating normalized or migrated versions for any supported format whenever feasible and beneficial.

Some of the hallmarks of DAITSS are:

• The application does preservation and nothing else, and as such must function as a "back end" to other systems for acquisition and user access.

• It depends heavily on well-known standards, including OAIS, METS and PREMIS.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 46

Archived content is preserved along with its complete metadata, while some of this metadata is also duplicated in a database for quick retrieval This ensures that the archival store can be understood and accessed even without the specific application.

• All format-based processing, including migration and normalization, is done inside the system, which keeps a rigorous record of digital provenance

Format-based processing occurs during the ingestion phase and is an integral part of the "refresh" process This ensures that packages are updated before dissemination, guaranteeing they contain the most current information.

The LOCKSS (Lots of Copies Keep Stuff Safe) open source software, created by the Stanford University Libraries team in the late 1990s, is designed for content preservation Today, there are 11 active Private LOCKSS Networks globally, dedicated to safeguarding digital materials.

A Private LOCKSS Network consists of a closed group of geographically distributed servers that utilize the open source LOCKSS software to connect with each other and the websites hosting preserved content This innovative software is format agnostic, allowing it to effectively preserve and monitor a wide variety of content formats.

In a Peer-to-Peer Network (PLN), all servers share equal rights and responsibilities, eliminating the need for a lead server with special privileges Once operational, a server can function independently even if it loses connection with others, making the network highly resilient to failures If one server fails, others can seamlessly take over its duties, and in case of a compromised server, any remaining server can assist in repairing its copies This equality among servers allows for a truly distributed approach to maintaining the preservation network, highlighting the significant advantages of a decentralized preservation strategy.

The LOCKSS servers of a PLN perform an ongoing set of preservation-oriented functions:

• They ingest submitted content and store it on their local disks

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 47

• They conduct polls, comparing all cached copies of content to arrive at regular consensus network-wide on the authenticity and accuracy of content

• They repair any content that is deemed corrupt through the network polling process

They retrieve content from its original source, provided it remains accessible, to identify any new or updated information, while also maintaining a record of any changes made alongside the original version.

• They retrieve and provide a copy of the content to authorized recipients

• They provide information about their stored content for auditing and reporting purposes

• They migrate out-of-date file format types to specified formats, storing both the original and the migrated file in a versioned manner

PLNs vary in size, accommodating groups from seven to 24 distributed servers, and have successfully scaled both organizationally and technically The largest PLN boasts an impressive 270TB capacity Operating primarily as "dark" archives, PLNs restrict content access to authorized representatives from the respective Producer sites.

PLNs, such as COPPUL, ADPNet, and Synergies, are cost-effective to operate, often relying on volunteer leadership rather than centralized staffing Additionally, the essential technical infrastructure is efficiently managed by the LOCKSS team at Stanford University.

Various organizations are developing tools that enhance the curatorial and auditing capabilities of Preservation Networks (PLNs) built on the LOCKSS system Notable examples include MetaArchive, which created the Conspectus in 2004 for metadata capture and content monitoring, and is currently developing web services such as format validation for a 2011 release Additionally, Data-PASS is focused on creating an audit framework set to be launched in the near future.

2011), and LuKII (bridging the leading German open source data management package with the LOCKSS software).

Documentation regarding how to run a PLN (both in terms of governance and the technical structure) abounds; see for example A Guide to Distributed Digital

Preservation (eds Katherine Skinner and Matt Schultz; Educopia Institute: 2010)

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 48 and a wide range of articles (for clusters, see Library Trends Volume 57, Number 3, Winter 2009; and Library Hi Tech Volume 28 issue 2, 2010)

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 49

Collaborative solutions for community-based preservation services

The MetaArchive Cooperative is an international preservation network formed by research institutions in 2004 under the Library of Congress's National Digital Information Infrastructure and Preservation Program (NDIIPP) This innovative model emphasizes collaborative responsibility, shared expertise, and collective cyberinfrastructure, empowering libraries, archives, centers, and museums to achieve their preservation objectives as a distributed community Members leverage their combined strengths to tackle preservation challenges by fostering knowledge and infrastructure within their local institutions, rather than outsourcing or centralizing operations.

MetaArchive offers open-source technologies and curation services that guarantee the long-term accessibility of authentic digital content Its membership collectively preserves a diverse array of digital assets, including electronic theses and dissertations (ETDs), newspapers, journals, and various archival materials such as video, audio, and images Additionally, it encompasses digital creations from fields like the digital humanities, social sciences, and sciences, including datasets, databases, and other essential resources.

MetaArchive is a cooperative membership organization with three membership categories: Sustaining Members, Preservation Members, and Collaborative

Members Each member runs a server for the MetaArchive network and prepares its own content for ingest in consultation with our central staff

Sustaining Members play a crucial role in the cooperative by making significant financial and technical commitments, thus providing essential leadership in the field of distributed digital preservation Each member is entitled to one voting representative on the MetaArchive Steering Committee, contributing valuable insights and guidance for the cooperative's future development, including the creation of innovative data curation and reporting tools To maintain their membership, Sustaining Members pay an annual fee of $5,500.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 50

• Preservation Members are institutions that preserve content in the Cooperative and support its overall infrastructure through running and maintaining a network server Preservation members pay an annual membership fee of

Collaborative Members consist of institutions that operate a centralized repository, preserving shared content within the MetaArchive network They contribute to the Cooperative's infrastructure by managing and maintaining one of the network's servers Membership requires an annual fee of $2,500, plus an additional $100 for each participating institution.

So long as MDL runs a central server through which the state’s collections are processed for ingest, MDL would qualify to be a Collaborative Member of the

The Cooperative is a streamlined organization dedicated to enhancing infrastructure within its member institutions rather than operating through a central service agency The majority of its initiatives, such as SIP preparation, ingest processes, AIP monitoring, and the provision of DIPs when needed, are carried out by the member organizations themselves.

MetaArchive operates under the guidance of a Steering Committee, which includes one voting representative from each Sustaining Member and additional representatives from the Preservation and Collaborative Member categories Leadership is established through nominations and a simple majority vote within the Committee The Steering Committee oversees the Cooperative's activities via weekly phone calls and an annual meeting hosted at a member site.

MetaArchive’s work is also guided by three committees, the Content Committee, Preservation Committee, and Technical Committee

The Content Committee plays a crucial role in organizing and documenting content selection practices and guidelines for MetaArchive SIP preparation Additionally, it recommends the prioritization of new subject- and genre-based archives, such as the ETD Archive, Newspaper Archive, and Southern Digital Culture Archive, to enhance the preservation network.

• The Preservation Committee is responsible for researching, developing, documenting, and disseminating policies, procedures, and evaluative means

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 51 for enhancing the MetaArchive Cooperative’s practice of trustworthy distributed digital preservation

• The Technical Committee is responsible for developing and maintaining technical specifications and coming to agreements on hardware, software, and networking protocols; overall server architecture; application development; and software maintenance.

The Cooperative is currently staffed by 3.5 positions, a Program Manager, a

The roles of Collaborative Services Librarian, Systems Administrator, and Software Engineer are essential for the Cooperative's continuous efforts, while the Program Manager is responsible for managing the daily operations of the MetaArchive.

The Collaborative Services Librarian provides training for new members and supports all members in managing network servers, preparing content for ingestion, and monitoring it effectively The Systems Administrator is responsible for overseeing core network functions, monitoring content, and training members in server management Additionally, the Software Engineer assists members with SIP development, ensuring compliance with MetaArchive data management guidelines.

The Cooperative emphasizes the importance of local staff involvement in preservation activities to foster a knowledgeable preservation community To achieve this, it utilizes a distributed staffing model where each member institution operates an autonomous server, preventing a single point of failure in the network Members collaborate with central staff to prepare content for ingestion and actively participate in Committee assignments, which are essential for the Cooperative's success both practically and philosophically.

MetaArchive's technical model emphasizes the importance of distributed digital preservation, asserting that research institutions must take charge of managing their digital collections By adopting collaborative and distributed long-term preservation strategies, these institutions can reap significant benefits, enhancing the sustainability and accessibility of their digital assets.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Alternatives 52

Effective preservation strategies have historically relied on distributing content copies across secure, diverse locations over time Digital content faces similar threats to obsolescence as previous eras, including natural disasters, intentional attacks, and accidental destruction While there are unique challenges in managing digital content for long-term preservation, these can be effectively addressed within a distributed network environment.

Implementing a robust digital preservation strategy necessitates investing in a distributed array of servers that can efficiently manage and store digital collections Individual research institutions typically lack the resources to operate multiple secure and geographically dispersed servers However, MetaArchive provides a solution by allowing these institutions to leverage shared network capacity through collaborative community efforts Our network is built on the open-source LOCKSS software developed at Stanford University Libraries, and we are enhancing it with additional data management tools to achieve comprehensive preservation goals.

Our distributed network safeguards each member's content by storing it at a minimum of six geographically diverse locations across two continents These replicated copies serve not only as backups for potential data loss but are also routinely compared to maintain consistent data integrity across all versions at all times.

All content is ingested through HTTP, utilizing either an accessible URL (which may be open or secure) or by temporarily mounting it on a staging server for collections that are not consistently online This ingest pathway is compatible with various repository infrastructures, including DSpace and ETD-db.

Appendices

Governance Models for Collaborative Preservation

State-based and collaborative preservation repositories are currently utilizing various governance models, with three notable examples relevant to Minnesota: the Florida Digital Archive managed by the Florida Center for Library Automation, the Alabama Digital Preservation Network (ADPNet), and the MetaArchive Cooperative Each of these governance models offers insights that could inform the governance plan for the Prairie State Repository project.

1 FCLA-FDA (http://www.fcla.edu/digitalArchive/daInfo.htm)

The Florida Digital Archive aims to offer a cost-effective, long-term solution for preserving digital materials that support teaching, learning, scholarship, and research throughout Florida It ensures that all files deposited by its affiliates remain accessible, unaltered, and readable over time For materials requiring full preservation treatment, the archive commits to maintaining a usable version by utilizing the best available format migration tools.

Host: Florida Center for Library Automation, a system-wide center of the state universities attached to the University of Florida for administrative purposes

The FCLA Advisory Board comprises directors from 11 public university libraries, a representative from the state university system, a representative from the division of community colleges, and the state librarian of Florida Additionally, the FDA staff includes an assistant director, technical staff members, and an administrative assistant, all contributing to the effective management and support of library services in Florida.

Public universities in the state university system, PALMM partners (institutions partnering with FL state university libraries), and others as approved on a case-by-case basis by the Board

Our services encompass the ingestion of Submission Information Packages (SIPs), secure storage and management of Archival Information Packages (AIPs), and comprehensive preservation through normalization and migration for supported formats We also facilitate dissemination and withdrawal processes, which include updating information in previously submitted packages; this requires the withdrawal of existing AIPs and the submission of new SIPs for ingestion Additionally, we provide detailed reporting to ensure transparency and accountability in our archival processes.

Shared between FDA and Affiliates

Affiliates must negotiate an agreement with the FDA, detailing contact persons and providing a comprehensive description of all materials to be archived, along with the desired preservation levels They are responsible for selecting appropriate content for archiving and ensuring that sufficient descriptive metadata is maintained locally Additionally, affiliates must secure all necessary rights to deposit materials and grant the FDA the required permissions.

The MDL-HT Image Preservation Prototype Report outlines the requirements for submitting content in accordance with FDA SIP specifications Each SIP must include a valid METS document as a descriptor, reference all files for archiving, and be organized in a single folder with a name limited to 32 characters The SIP descriptor must match the package name, and SIPs should not be bundled, compressed, or digitally signed Additionally, it is essential to maintain records of archived content, review reports, collaborate with FDA staff to address any issues, notify the FDA if archiving is no longer necessary, and request dissemination when required.

The FDA is responsible for ingesting and storing materials in accordance with the Affiliate’s Agreement, while restricting the authorization to submit, withdraw, disseminate, or receive reports to designated individuals They must provide detailed information regarding ingestion or errors for every SIP and preserve original files exactly as submitted, ensuring their integrity, viability, and authenticity Appropriate preservation strategies must be employed for supported file formats to maintain a renderable version Upon request, the FDA provides Digital Information Packages (DIPs), which always include the original files and may also contain modified versions that have been normalized or migrated Additionally, they are required to generate appropriate reports for Affiliates and achieve and maintain certification as a Trusted Digital Repository (TDR).

Actual costs, staffing (around $150K/year)

Currently, there is no fee for usage; however, this may change in the future Any potential billing will be implemented with the approval of the FCLA Advisory Board, providing Affiliates with a notice period of 180 days or more.

2 MetaArchive Cooperative (http://metaarchive.org)

The MetaArchive Cooperative aims to enhance the understanding of distributed digital preservation techniques while establishing reliable and long-lasting "dark archives" of digital materials across various locations These archives serve as a safeguard, enabling contributing organizations to restore their collections when needed.

Steering Committee: one rep from every Sustaining Member plus an elected Chair

Steering Committee meetings are also attended by one non-voting rep for each member category Staff: Program Director, Collaborative Services Librarian, Systems Administrator, 5FTE Software Engineer

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 57

Any digital memory organization or collaborative; approved by the Steering Committee on a case-by-case basis

Our services offer comprehensive assistance to members in developing effective preservation policies and strategies We provide expert advice and support during content preparation for ingest, ensuring thorough testing and quality assurance Our team aids members in establishing their MetaArchive server and facilitates the ongoing ingest of new or modified content, all managed within a robust versioning system We ensure secure storage and management of data across a distributed network, maintaining at least seven copies of each file in diverse geographical locations Additionally, we handle full preservation processes, including normalization and migration for supported formats as needed, and manage dissemination and reporting tasks when required.

Shared between MetaArchive and Members

Members are required to maintain good standing by adhering to the Charter's governance processes and technical standards, while also covering their own costs associated with Cooperative participation, such as membership fees and travel expenses They must implement copyright standards to comply with applicable laws and ensure that contributed content does not infringe on others' rights, holding the necessary licenses for use within a multi-site preservation strategy Members agree to indemnify the Cooperative against any claims of infringement or related losses, waiving rights to recover costs associated with such claims Additionally, they must remedy any material breaches of contract within 90 days, unless an extension is granted, and are responsible for establishing and maintaining a preservation site with a MetaArchive-LOCKSS cache, which should also be available for testing new software.

MetaArchive developments as needed; ã Participate actively in the MetaArchive Preservation Network by contributing, ingesting, and monitoring content from the Cooperative;

To effectively contribute to the LOCKSS Alliance and enhance community strength, it is essential to maintain good standing membership Implementing LOCKSS software that meets all current and future Private LOCKSS Network requirements is crucial Additionally, designing system features that comply with Cooperative security standards and ensuring content validation through integrity checks and metadata analysis is necessary Participation in the Cooperative may also require the installation and maintenance of additional software, along with providing essential technical and administrative contact information to facilitate communication among participants.

MetaArchive offers essential services to its members, including content retrieval in the event of catastrophic loss, access to a web-based tool called Conspectus for documenting submitted collections, and distributed archiving across multiple preservation sites Members benefit from detailed reports and statistical insights about their collections and the Preservation Network as a whole The Cooperative provides opportunities for collaboration within working groups, discounted attendance at conferences and workshops, and access to experienced digital preservation professionals Members can purchase additional storage and preservation services, including consulting and training Technical support is available for the installation and maintenance of LOCKSS software and other enhancements, along with documentation of processes and technical standards In catastrophic situations, members can request technical and financial assistance for restoring their preservation sites Overall, MetaArchive ensures access to a wealth of technical knowledge and support to help members maintain compliance with preservation standards.

Actual costs, staffing (around $300K/year)

The membership structure includes three tiers: Sustaining Members at $5,500 annually plus a storage fee of $1 per GB, Preservation Members at $3,000 per year with the same storage fee, and Collaborative Members at $2,500 per year, an additional $100 per participating institution, along with a $1 per GB storage fee.

All members run one server for the network ($5K investment locally at the beginning of each 3-year membership term)

3 ADPNet (http://www.adpnet.org)

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 59

ADPNet is dedicated to establishing and maintaining a reliable, low-cost "dark archive" for the long-term preservation of digital resources created in Alabama The organization aims to enhance awareness of distributed digital preservation techniques within the state and to develop a stable, geographically diverse dark archive of digital content This archive will serve as a vital resource for restoring collections at member institutions when needed.

Host: Network of Alabama Academic Libraries (NAAL) at the Alabama Commission on Higher

Steering Committee: comprised of one voting rep appointed by each member Term of service is 1 year, may be reappointed

Staff: Chair (elected to 1-year terms), administrator, LOCKSS staff at Stanford

MDL-HT Hot Spots

This document identifies key "hot spots" for discussion among MDL and its partners While the actual discussions are not detailed here, it provides an overview of each issue, its current status, and a concise note on its resolution.

If you think an issue should be included or updated, either leave a comment below, or edit the document yourself following the model set out in existing entries.

These issues are still under active discussion.

MDL members may feel betrayed if their digital masters, being transferred to HathiTrust for preservation, are shared publicly, while HathiTrust advocates for sharing public domain content Therefore, it is essential to develop a technical and policy-based mechanism that restricts the distribution of MDL digital masters exclusively to MDL, ensuring members' trust and the integrity of their materials.

JTB inquires whether MDL will connect an open access policy for digital masters to participation in collaborative efforts, or if it pertains to open access for public domain materials Additionally, the question arises about whether initiatives like Minnesota Reflections could evolve into an open access repository.

Status: Discussed without resolution at 100922 and 101027 sponsors meetings.

{B} Restricted resolution for display of digital images.

MDL members have scanned images at high resolutions but plan to upload only lower resolution versions online The question arises: would HathiTrust agree to restrict the resolution of derivative images shared on the web from MDL contributions?

Status: Discussed without resolution at 100922 sponsors meeting.

{C} Limiting contributed images to JPEG2000.

HathiTrust has exclusively gathered JPEG2000 images, intentionally excluding TIFF formats However, MDL partners, especially MHS, possess master images in TIFF The question arises: would MDL consider converting these TIFF images to JPEG2000 to enhance preservation efforts?

In the context of our work plan, we are considering the concept of "future-proof formats" for our project Since we will be working with master images as they are, we cannot impose strict format requirements Therefore, we need to explore the possibility of developing specific specifications for each content type to enhance sustainability, ensure smooth migration, and ultimately safeguard the longevity of our preservation archive.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 62

EFC and JW, after consulting with JR and BT, have agreed to utilize only JPEG2000 for the prototype project However, they recognized that this decision highlights a significant concern for future collaboration with HathiTrust, which is cautious about the formats it accepts While HathiTrust's conservative approach is understandable, their refusal to ingest TIFF files raises concerns, especially for a project that aims to incorporate contributions from across Minnesota.

Our longstanding practice of utilizing TIFFs in our MDL is not rooted in deep philosophy; rather, we are exploring the broader HT community's perspective on adopting JP2 as a standard for image data Additionally, we are investigating the potential of on-the-fly TIFF derivation and the commitments we can confidently offer moving forward.

Status: Resolved for prototype, but should still be discussed w/r/t wider impact on sustainability of “live” project.

When some of the issues noted above are resolved, we’ll move then to this section for the record.

The metadata sent to HT occasionally needs updates for accuracy, as demonstrated by the recent transfer of the JJHill collection to MHS, which necessitated changes in ownership metadata To ensure the integrity of the information, a system should be established for regularly updating or reloading metadata into HT.

On 100927, a discussion with JW highlighted that HT allows for the reloading of metadata into the HT catalog, which is essential for search and display purposes visible to users In addition, a static copy of the descriptive metadata is preserved with the ingest package at the time of ingestion and remains unchanged thereafter Most users are unlikely to encounter this potentially outdated version of the metadata.

Status: Resolved MDL can accept the potential of correcting catalog metadata as sufficient for our purposes.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 63

MDL-HT Image Ingest Prototype Guidelines

This document outlines the interactions between the Minnesota Digital Library (MDL) and participants in the MDL-HT Image Ingest Prototype project, establishing a framework for the ongoing preservation of Minnesota's digital images through HathiTrust (HT) It provides an overview of the project's objectives and sets clear expectations for the extraction, packaging, transfer, ingestion, and display of content involved in the initiative.

Our objective is to establish a streamlined workflow for transferring digital image data and metadata from Minnesota to HathiTrust We aim to showcase this process by successfully moving a specific collection of images into HathiTrust and collaborating with them to determine the optimal display format for these images The University of Minnesota, an existing participant in HathiTrust, is facilitating this initiative.

To successfully complete the project, it is essential to focus on six key stages: first, extract master images and metadata from existing repositories; second, reformat these images for the preservation archive; third, package the binaries and their associated metadata for ingestion; fourth, transfer the packages to HathiTrust; fifth, ensure the packages are ingested by HathiTrust; and finally, facilitate the display of these images and the retrieval of master files through API calls.

We aim to create diverse content types, including straightforward continuous tone images, complex objects formed by a sequence of images with a specific structural relationship, and images that incorporate text along with text derived from optical character recognition (OCR).

The project operates under a stringent timeline, necessitating a pragmatic approach We will utilize content from local systems without expecting binaries or metadata to conform to our standards While we aim to adhere to the HathiTrust Guidelines for Digital Object Deposit, we acknowledge the challenges posed by new content types and less experienced partners A key aspect of this prototype is identifying the minimum criteria required for successful participation in HathiTrust.

Our goal is to ensure that the products we create are both trustworthy and transparent To achieve this, we are committed to thoroughly documenting our decision-making processes and the origins of our materials, which will be reflected in both external documentation and the metadata accompanying the project.

This project serves as a prototype aimed at developing and demonstrating a workflow for future sustained efforts, acknowledging that not every expected element will be fully executed Due to our tight timeframe, we recognize the necessity of implementing certain shortcuts, which we will discuss and document Additionally, we will provide recommendations for more effective long-term strategies for future workflow implementations.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 64

We prioritize content that holds cultural heritage significance for Minnesota Due to time constraints for in-depth evaluations, we acknowledge that the materials within MDL Reflections and the Minnesota Historical Society (MHS) content management systems possess this valuable heritage.

We prioritize professionally produced content that adheres to essential archival standards, ensuring that digital images closely reflect the original objects This fidelity is represented through the use of JPEG2000 format and high-quality scanning equipment We acknowledge that both MDL and MHS meet these standards without the need for additional evaluation.

To ensure that the digital object transmitted to HT matches the one stored locally, it is essential to implement a validity check at the source A local MD5 checksum for each binary object is a recommended method, although we are open to considering alternative validity check mechanisms.

[2.1.] We require that metadata be provided in XML format We seek as much metadata as we can get from local systems, including any descriptive, technical, or administrative metadata present.

[2.2.] Metadata may be extracted either via a pull process, such as an OAI-PMH (Open Archives

Initiative Protocol for Metadata Harvest) harvest, or via a push mechanism, such as an XML report generated from a local system and sent to the project.

[2.3.] Metadata must include some reference to binary objects that identifies those objects uniquely.

[2.4.] Binary objects may be extracted via a pull process over the network or via a push process such as a hard disk sent to the project from a local system.

[2.5.] Binary objects must be named in such a way as to be uniquely identifiable and matched to their metadata.

Each item detailed in the metadata must have a distinct identifier, which may differ from the identifier used to connect binaries to the metadata.

[3.0.] Binary objects must be provided in JPEG2000 format This means that MDL will have to reformat masters that are currently in TIFF format for this prototype.

[3.0.] This point has been resolved for the prototype, but the implications of JPEG2000-only remain severe for a longer term effort See “hot spot” issue {C} for further discussion.

[3.1.] Each image will be reformatted into JPEG2000 images suitable for HathiTrust.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 65

[3.2.] Technical metadata will be created for each image and recorded in an XMP package that will be embedded in that same image.

[3.3.] A new MD5 checksum of the resulting JPEG2000 image will be generated.

MDL will create a unique HathiTrust identifier for each digital image or complex object, utilizing the "mdl" namespace designated by HathiTrust This identifier will be generated in accordance with the HathiTrust ingest checklist, ensuring it can effectively retrieve specific objects from the HathiTrust system in the future.

MDL generates a Dublin Core (DC) metadata file for all ingested objects, which can be loaded into the HathiTrust catalog prior to individual object ingest The DC records include essential metadata, ensuring a comprehensive cataloging process.

• [4.2.1.] Title: a brief, though possibly not unique, descriptor of the item;

• [4.2.2.] Other identifier: identifiers used by the local institutions to retrieve the item, may not be unique; also any existing OCLC numbers will be included here;

• [4.2.3.] Link: a URL that can be used to navigate to this item within the local system from which it was extracted;

In our process, all additional elements of the Data Container (DC) associated with an object will be transferred to the Header Table (HT) Furthermore, we will incorporate any technical metadata obtained from local sources or through probes of the binary objects and their embedded metadata, which will be provided as embedded XMP metadata in each JPEG2000 file as outlined in section [3.2.].

Each digital image will have a comprehensive METS file that consolidates all descriptive and administrative metadata For compound objects consisting of multiple digital images, a single METS file will encompass the entire collection This METS file will incorporate the Dublin Core (DC) metadata outlined in section [4.2.].

• [4.4.1.] HT identifier: built in [4.1.] uniquely identifying an object;

MDL-HT Specifications for Reflections Continuous Tone Images

This document outlines the specific packaging requirements for MDL Reflections continuous tone images related to the MDL-HT Image Prototype project It should be used alongside the MDL-HT Image Ingest Prototype Guidelines, which define the interactions between MDL and HathiTrust While maintaining the same numbering sequence as the guidelines, this document focuses on areas needing further clarification, with any omitted numbers likely covered in the guidelines.

MDL Reflections is a CONTENTdm system provided by OCLC, serving as the main catalog for the Minnesota Digital Library's statewide content Although CONTENTdm contains descriptive and technical metadata for the collection's images, the actual master images are stored separately on a server managed by the University of Minnesota Libraries, in uncompressed TIFF format.

MDL Reflections features straightforward continuous tone images accompanied by individual descriptions, as well as intricate "compound objects" where a single description pertains to multiple images Currently, the MDL-HT Image Ingest Prototype project focuses solely on transferring the simpler images.

To successfully complete the project, it is essential to focus on six key stages: first, extract master images and metadata from existing repositories; second, reformat the images to meet preservation archive standards; third, package these binaries along with their metadata for ingestion; fourth, transfer these packages to HathiTrust; fifth, ensure the successful ingestion of these packages by HathiTrust; and finally, facilitate the display of these images on HathiTrust and enable retrieval of the master files through API calls.

In this document, the term "we" specifically refers to MDL, primarily indicating the development team composed of Bill, Jason, and Eric, with Bill Tantzen being the primary representative Thank you, Bill!

[1.2.] The masters we have are uncompressed TIFF images They will need to be transformed into

For this project, JPEG2000 (JP2) images will be utilized, which means they will no longer serve as the "master" images for MDL Reflections This change is acceptable for the prototype phase For more details, refer to the discussion on the "hot spot" issue {C}.

The UMN Libraries systems office has created MD5 checksums for the TIFF master files to ensure data integrity However, since we need to transcode these files to JP2 format, the existing checksums will not be applicable for this project Therefore, we will need to generate new checksums for the JP2 versions.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 68

[2.1.] Descriptive metadata available in Dublin Core (DC) format from an OAI-PMH harvest Technical metadata is found in EXIF data with each image Some further technical metadata recorded in

CONTENTdm is not exported via OAI-PMH and will be ignored.

MDL Reflections descriptive metadata is available in XML format through OAI-PMH harvesting, while EXIF data, though not in XML, can be easily accessed using tools like ImageMagick.

The "MDL identifier," found within the OAI data as a generic identifier in Dublin Core (DC), consists of a three-character lowercase code followed by a sequence of numerical digits.

[2.3.] Jason will provide Bill with a comprehensive list of the three character codes used in MDL identifiers.

[2.4.] The TIFF masters are available via on-campus file sharing from the store managed by the systems office They will be copied via file sharing.

[2.5.] The TIFF masters are named using the MDL identifiers.

[2.6.] Prefix the MDL identifier with “reflections.” to create the unique project identifier for each item. [3.] Reformat

[3.1.] Convert TIFF masters into JPEG2000 via ImageMagick Use the MDL identifier as the filename, for example “umn123.jp2” Confirm tool What about kakadu?

Technical metadata for each image will be generated and stored in an XMP package, which will be embedded in the image using ImageMagick This information will be documented for inclusion in a PREMIS event record, as outlined in section 4.2.5.

[3.3.] A new MD5 checksum of the resulting JPEG2000 image will be generated via unix command line md5 and saved for inclusion in PREMIS event record, see [4.2.5.] Confirm tool.

[4.1.] The “HathiTrust identifier” will be generated by adding the namespace “mdl.” as a prefix to the project identifier from [2.6.].

[4.2.] The DC metadata extracted via OAI-PMH will be passed through as the descriptive metadata for each item This will include:

• [4.2.1.] The title as a element and will be left as-is.

• [4.2.2.] The identifiers present as elements and will be left as-is.

• [4.2.3.] The link will be one of many elements.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 69

HT must acknowledge that any element containing a URL should be transformed into an actionable link within the HT catalog This requirement may apply specifically to any identifier that begins with the designated string.

• [4.2.4.] Other dc elements found in Reflections data include: description, date, source, format, subject, coverage, relation, publisher, and rights All will be included as-is.

[4.4.] The METS file will be created using the MDL identifier as the file name, for example

“umn123.xml” This METS file will include:

• [4.4.1.] The HathiTrust identifier from [4.1.] will be used as the OBJID attribute value and as the ID attribute.

• [4.4.2.] A PREMIS event representing the initial ingest into MDL Reflections as a TIFF.

• [4.4.3.] A PREMIS event representing the conversion to JPEG2000.

• [4.4.4.] A PREMIS event representing the contribution of the item to HT.

• [4.4.5.] A section describing the structure of the contribution, which simply contains a pointer to the single image file.

• [4.4.6.] A section describing the single image file, including its type, creation date, size, and checksum.

To prepare your files for submission, place the associated image and METS files, such as “umn123.jp2” and “umn123.xml,” into a single directory named “umn123.package.” Then, compress this directory into a single file using gzip.

Each ZIP file, named in the format “umn123.package.zip,” must exceed 128KB in size If a ZIP file is smaller than this requirement, the directory containing it should be saved and bundled with other undersized files into a single ZIP file.

[5.1.] Files will be transferred via FAT32 formatted hard drives with USB interfaces.

[6.2.] HT will notify Bill of the results of each ingest via email.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 70

MDL-HT Specifications for Reflections Compound Objects

This document outlines the packaging requirements for MDL Reflections compound objects related to the MDL-HT Image Prototype project, complementing the MDL-HT Image Ingest Prototype Guidelines that define the MDL and HathiTrust interactions It follows the same numbering sequence as the guidelines but focuses exclusively on areas needing additional detail Any omitted numbers can be referenced in the guidelines.

MDL Reflections, a CONTENTdm system hosted by OCLC, serves as the primary catalog for the Minnesota Digital Library, showcasing content from across the state While CONTENTdm contains descriptive and technical metadata for the collection's images, the actual master images are stored separately on a server managed by the University of Minnesota Libraries, specifically in uncompressed TIFF format.

MDL Reflections features simple continuous tone images accompanied by individual descriptions, as well as complex “compound objects” where a single description pertains to multiple images Having addressed the simple continuous tone images in the previous phase, the current focus of the MDL-HT Image Ingest Prototype project is to transfer only the “compound” images to HathiTrust.

To successfully achieve the project, it is essential to focus on six key processing stages: first, extract master images and metadata from existing repositories; second, reformat the images to align with preservation archive standards; third, package these binaries along with their metadata for ingestion; fourth, transfer the packaged materials to HathiTrust; fifth, ensure the successful ingestion of these packages by HathiTrust; and finally, facilitate the display of these images and the retrieval of master files through API calls.

In this document, the term "we" specifically refers to MDL, particularly the development team consisting of Bill, Jason, and Eric, with Bill Tantzen being the primary representative.

The project requires converting our uncompressed TIFF master images into JPEG2000 (JP2) format, which means they will no longer serve as the original "master" images utilized by MDL Reflections This transformation is permissible for the prototype, as discussed in the "hot spot" issue {C}.

The UMN Libraries systems office has created MD5 checksums for the TIFF master files to ensure data integrity However, since we need to transcode these files to JP2 format, the existing checksums will not be applicable for this project Therefore, it is necessary to generate new checksums for the JP2 versions.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 71

[2.1.] Descriptive metadata available in Dublin Core (DC) format from an OAI-PMH harvest Technical metadata is found in EXIF data with each image Some further technical metadata recorded in

CONTENTdm is not exported via OAI-PMH and will be ignored.

MDL Reflections descriptive metadata is available for retrieval in XML format through OAI-PMH harvesting, while EXIF data, although not in XML, can be easily accessed using tools like Kakadu.

The MDL identifier, a key component of the OAI data, is classified as a generic identifier in Dublin Core (DC) It consists of a three-character lowercase code followed by a series of numerical digits.

[2.4.] The TIFF masters are available via on-campus file sharing from the store managed by the systems office They will be copied via file sharing.

[2.5.] The TIFF masters are named using the MDL identifiers.

Each image is assigned a unique MDL identifier; however, the entire multi-page object lacks a collective MDL identifier To represent the object in section [2.6.], we will utilize the MDL identifier of the first page.

[2.6.] Add “-all” to the end of the MDL identifier of the first page of the compound object to create a new MDL identifier for each compound object.

[2.6.] So if the first page had the MDL identifier “umn123” then MDL identifier for this compound object would be “umn123-all”.

[2.7.] Prefix the MDL identifier for the compound object with “reflections.” to create the unique project identifier for each compound object.

[2.7.] Continuing the example from [2.6.] the project identifier for the compound object would be

To convert continuous tone TIFF masters to JPEG2000, utilize Kakadu software For bi-tonal TIFF masters, retain the TIFF format while applying G4 (fax) compression, which may necessitate reformatting with tools like ImageMagick Ensure to name the files using the MDL identifier, such as “umn123.jp2”.

Technical metadata for each image will be generated and stored in an XMP package, which will be embedded within the image using Kakadu Additionally, certain details will be preserved for inclusion in a PREMIS event record.

[3.3.] A new MD5 checksum of the resulting JPEG2000 or TIFF image will be generated via unix command line md5 and saved for inclusion in PREMIS event record, see [4.2.5.].

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 72

[4.1.] The “HathiTrust identifier” will be generated by adding the namespace “mdl.” as a prefix to the project identifier from [2.6.].

[4.1.] Continuing our example from [2.6.] the HathiTrust identifier for the compound object would be

[4.2.] The DC metadata extracted via OAI-PMH will be passed through as the descriptive metadata for each compound object This will include:

• [4.2.1.] The title as a element and will be left as-is.

• [4.2.2.] The identifiers present as elements and will be left as-is.

• [4.2.3.] The link will be one of many elements.

HT must acknowledge that any element containing a URL should be transformed into an actionable link within the HT catalog, specifically for identifiers that begin with a certain string.

• [4.2.4.] Other dc elements found in Reflections data include: description, date, source, format, subject, coverage, relation, publisher, and rights All will be included as-is.

We will incorporate technical metadata obtained from local sources or through probing binary objects and their embedded metadata, which will be provided as embedded XMP metadata in each instance.

[4.4.] The METS file will be created using the MDL identifier of the compound object as the file name, for example “umn123-all.xml” This METS file will include:

• [4.4.1.] The HathiTrust identifier from [4.1.] will be used as the OBJID attribute value and as the ID attribute.

• [4.4.2.] A PREMIS event representing the initial ingest into MDL Reflections as a TIFF.

• [4.4.3.] A PREMIS event representing the conversion to JPEG2000.

• [4.4.4.] A PREMIS event representing the contribution of the item to HT.

• [4.4.5.] A section describing the structure of the compound object and with pointers to the files involved This includes pointers to the DC and OCR files described in [4.5.].

• [4.4.6.] A section describing each file, including its type, creation date, size, and checksum.

[4.5.] For each “page” of the compound object, two files will be created as needed:

If a page contains Dublin Core metadata, a corresponding metadata file will be generated, identified by the MDL identifier of the page followed by “.dc” (e.g., “umn123.dc”), and will feature an XML representation of the page's DC metadata.

MDL-HT Specifications for MHS objects

This document outlines the packaging requirements for Minnesota Historical Society objects related to the MDL-HT Image Prototype project, serving as a complement to the MDL-HT Image Ingest Prototype Guidelines that define the interactions between MDL and HathiTrust It follows the same numbering sequence as the guidelines but focuses solely on areas needing additional detail, with any omitted numbers likely covered in the original guidelines.

The Minnesota Historical Society (MHS) houses over 15,000 TIFF and JPEG images within its content management system (CMS) As part of a prototype initiative, approximately 9,000 images from the "Collections Online" will be preserved in HathiTrust (HT) Although the CMS, known as EMu, currently shares these images publicly, the actual number of catalog records is likely closer to 8,000 due to some images being part of compound objects These catalog records will be converted into Dublin Core (DC) records for enhanced accessibility.

Descriptive metadata will come from the catalog side of the EMu system It will be exported as XML data.

The mapping to Dublin Core (DC) is a crucial step in the MDL-HT Digital Preservation Project, which involves creating DC records that will be integrated into METS files These records, along with binary images, will be prepared for shipment to HathiTrust as part of the preservation process.

“mapping” document that describes the path from EMu to DC in more detail.

Technical metadata is in EXIF, IPTC, and XMP data already embedded in the binary files We can extract what we need for the HT XMP.

MHS Collections Online features a variety of content, including straightforward continuous tone images with unique descriptions, as well as intricate "compound objects" that are described collectively for multiple images Both types of content will be incorporated into the upcoming updates.

To successfully achieve this, it is essential to focus on six key stages: first, extract master images and metadata from the MHS EMu system; next, reformat the images for preservation archives; then, package these binaries along with the necessary metadata for ingestion After that, transfer the packages to HathiTrust, ensure their successful ingestion by the platform, and finally, facilitate the display of these images at HathiTrust while enabling retrieval of the master files through API calls.

In this document, "we" refers to MDL, particularly the development team consisting of Bill, Jason, and Eric, with Bill Tantzen being the primary representative Thank you, Bill!

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 75

MHS provides both uncompressed TIFF and compressed JPEG images, with the TIFFs requiring conversion to JPEG2000 (JP2) for this project, which means they will no longer serve as the "master" images for MDLReflections This alteration is permissible for the prototype Additionally, HT has confirmed that JPEG images can be transmitted in their compressed JPEG format.

MHS created MD5 checksums for numerous TIFF and JPEG master files to ensure the integrity of the existing storage To verify the accuracy of the MHS master files, utilize the corresponding checksums.

When a "Multimedia" table is present in the record, it is essential to save the checksum for a PREMIS fixity check event Although we need to transcode to JP2 format and create new XMP information, the original checksums will only verify that we received uncorrupted copies from MHS Consequently, we must generate new checksums for both the JP2 and JPEG images to ensure data integrity.

Descriptive metadata is accessible in an EMu-specific XML format from MHS, while technical metadata is embedded within IPTC, EXIF, and XMP data for each image Additionally, certain technical metadata captured by the EMu multimedia system will not be utilized.

MHS descriptive metadata is available for retrieval in XML format through a manual export from EMu, while EXIF data, although not in XML format, can be easily accessed using Kakadu and other tools.

[2.3.] The “MHS CATIRN identifier” is available from the descriptive XML data as the “CatalogIrn” atom.

[2.4.] The TIFF and JPEG masters will be made available on a hard disk from MHS.

This disk will hold additional images that must not be shared with HT or any other entity Once the necessary masters have been extracted, it is essential to destroy the data on this disk.

[2.5.] The TIFF and JPEG masters are referred to by name in the “multimedia” table of the descriptive XML.

[2.6.] The “MHS identifier” will be the CatalogIrn for each MHS object.

[2.6.] The CatalogIrn is either a four or eight digit number, like “1234”.

[2.7.] Prefix the MHS identifier for the MHS object with “mhs.catirn.” to create the unique project identifier for each compound object.

[2.7.] Continuing our example from [2.6.] the project identifier for the MHS object would be something like “mhs.catirn.1234”.

Convert continuous tone TIFF masters to JPEG2000 using Kakadu, while keeping JPEG masters in their original JPEG format Ensure to check the XMP data for the JPEG files, as it may need reformatting through ImageMagick or similar tools.

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 76 other tool Use the given filename for each of these, though the extension may have to change, for example “mh218s.jp2”.

Technical metadata for each image will be generated and embedded in an XMP package using Kakadu or ImageMagick Additionally, specific details will be preserved for inclusion in a PREMIS event record.

[3.3.] A new MD5 checksum of the resulting JPEG2000 or JPEG image will be generated via unix command line md5 and saved for inclusion in PREMIS event record, see [4.4.].

[4.1.] The “HathiTrust identifier” will be generated by adding the namespace “mdl.” as a prefix to the project identifier from [2.6.].

[4.1.] Continuing our example from [2.6.] the HathiTrust identifier for the compound object would be

[4.2.] DC metadata will be prepared by mapping the descriptive XML from EMu and become the descriptive metadata for each MHS object This will include:

• [4.2.1.] The title as a dc:title.

• [4.2.2.] The identifiers present as dc:identifier.

• [4.2.3.] The link will be one of many dc:identifier elements.

HT must acknowledge that any element containing a URL should be transformed into an actionable link within the HT catalog This requirement may apply specifically to identifiers that begin with a designated string.

• [4.2.4.] Other dc elements found in Reflections data include: description, date, source, format, subject, coverage, relation, publisher, and rights.

[4.2.] See the separate “mapping” document for details on the mapping from EMu’s exported XML to the

DC we want for this project.

Sample MDL METS file for HT

The requested URL was not found on this server.

Minnesota Digital Library

Buhl; Chisholm; Coleraine; Ely; Eveleth; Gilbert; Hibbing; Mountain Iron; Soudan; Tower;

Mesabi Iron Range; Vermilion Iron Range

St Louis

Minnesota

United States

Oliver Iron Mining Company; United States Steel Corporation

1928 mapbook featuring both open pit and underground mining operations on the Mesabi and Vermilion Iron Ranges of Minnesota.

Atlases

http://cdm15160.contentdm.oclc.org/u?/irrc,2983

Iron Range Research Center, 1005 Discovery Drive, Chisholm, Minnesota 55719; http://mndiscoverycenter.com/research-center

The usage of this image is subject to U.S and international copyright laws For further details regarding this image, please reach out to the Iron Range Research Center in Chisholm, MN, or visit their website at http://mndiscoverycenter.com/research-center/archive.

Oliver Iron Mining Company; United States Steel Corporation

Business and industry

Iron mines and mining

The United States Steel industry is significantly shaped by numerous iron mining operations, including notable sites such as the Adams Mine, Alpena Mine, and Arcturus Mine Other important mines in this sector include the Auburn, Burt, and Canisteo Mines, alongside the Carson Lake and Chisholm Mines The Clark and Culver Mines contribute to the rich history of iron mining, as do the Day, Deacon, and Duncan Mines The Ely, Fayal, and Fraser Mines further enhance the landscape of this industry, complemented by the Glen, Godfrey, and Hartley St Clair Mines The Hill, Holman, and Hull Rust Mines are key players, while the Judd, Kerr, and Leonard Mines add to the diversity of operations Additional significant sites include the Leonidas, Lone Jack, and McEwan Mines, alongside the Minnewas and Missabe Mt Mines The Monroe, Morrison, and Mt Iron Mines are also pivotal, as are the Myers, North Star, and Ohio Mines The Ordean, Palmer, and Philbin Mines contribute to the extensive mining network, while the Pillsbury, Pioneer, and Pool Mines round out the list Other noteworthy locations include the Prindle, Rouchlaou, and Sauntry Mines, as well as the Seller, Sharon, and Shaw Moose Mines Finally, the Shiras, Sibley, Soudan, Spruce, Stephens, Sulivan, Sweeney, Walker, and Wellington Mines, along with Bovey, play crucial roles in the iron mining landscape of the United States.

Oliver Iron Mining Company Mapbook

Iron Range Research Center

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 79

bc797078d8c6188255fac1e98b58ef83

http://cdm15160.contentdm.oclc.org/u?/irrc,2548

Iron Range Research Center, 1005 Discovery Drive, Chisholm, Minnesota 55719; http://mndiscoverycenter.com/research-center

The use of this image is subject to U.S and international copyright laws For more details regarding this image, please reach out to the Iron Range Research Center in Chisholm, MN, or visit their website at http://mndiscoverycenter.com/research-center/archive.

Front cover

MDL

reflections.umn79102-all

UUID

C62D4AC6-13AB-11E0-8A1D-C740821A552F

capture

Initial capture of TIFF master

tool

Phase One

scanner

MARC21 Code

MnU

Executor

UUID

C62D5264-13AB-11E0-8A1D-C740821A552F

image compression

MDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 80

Convert TIFF master to compressed format

MARC21 Code

MnU

Executor

tool

kakadu/kdu_compress v6.4.1 software

tool

ImageMagick v6.6.3-1

software

UUID

C62D63C6-13AB-11E0-8A1D-C740821A552F

message digest calculation

Calculation of page-level md5 checksums

MARC21 Code

MnU

Executor

tool

perl v5.10.0/Digest::MD5 v2.51 software

UUID

C62D6A92-13AB-11E0-8A1D-C740821A552F

source mets creation

Creation of HathiTrust source METS file

MARC21 Code

MnU

Executor

tool

makemets_compound.pl v1.1 software

UUID

C62D738E-13AB-11E0-8A1D-C740821A552FMDL-HT Image Preservation Prototype Report / 2010-12-31 / Appendices 81

Ngày đăng: 18/10/2022, 17:28

w