The Knowledge Bank featured collection this month is the Jesse Owens Photographs collection. The collection currently contains 141 images – a sampling of the more than two thousand photographs from the life of Jesse Owens held at The Ohio State University Archives. The digital images that we archived in the Knowledge Bank all contain embedded descriptive metadata added by the University Archives.
The Jesse Owens images were batch loaded into the Knowledge Bank (a DSpace repository). Our routine process for batch loading involves creating a spreadsheet (.csv) containing the metadata and filename for each item. A stand-alone Java tool (SAFBuilder) transforms the metadata contained in the spreadsheet into dublin_core.xml files and builds the simple archive format directory (metadata + content files) required for the DSpace item importer.
Working with the Archives, I designed the Qualified Dublin Core (QDC) metadata for the Knowledge Bank image collection. My initial mock-up of the collection incorporated the descriptive metadata the Archives had added to the digital images using Adobe Photoshop. Although there was not a straight one-to-one relationship between the embedded metadata and the Knowledge Bank metadata, I certainly wanted to reuse the embedded metadata when building the batch load spreadsheet. One possibility for reusing the metadata would be to have a staff member or student assistant manually copy and paste the image metadata into the Knowledge Bank spreadsheet by following a mapping of the Photoshop fields and the spreadsheet columns (QDC fields). That approach, however, would be very time-consuming and inefficient. Instead, we used a workflow I developed for the automated reuse of the embedded descriptive metadata.
In order to take advantage of our routine batch loading process, the embedded image metadata workflow exports the metadata for all of the images in the collection into a .csv file for use with the simple archive format packager SAFBuilder. The tool used by the workflow to extract the embedded metadata is ExifTool by Phil Harvey – a platform-independent Perl library plus a command-line application for reading, writing and editing meta information in a variety of files.
Running exiftool.exe from the command line, I am able to export the embedded metadata for all of the images to a .csv file.
exiftool -csv -r t/images > out.csv
The -csv option pre-extracts information from all input files, produces a sorted list of available tag names as the column headers, and organizes the information under each tag. A “SourceFile” column is also generated. The features of the -csv option make it great for extracting all information from multiple images. The -r option recurses through all images in a hierarchy of directories.
Unwanted columns (data we would not be using for the Knowledge Bank) are deleted from the .csv output and column headers for the remaining data are renamed based on the QDC mapping for the collection. Final batch load preparation may include adding enhancements to values in exported fields, changing the character used for delimiting multiple values in a field, and adding fields not available in the embedded metadata.
Note: rather than deleting unwanted data, a ‘targeted’ CSV export could be run where only desired data is extracted:
exiftool -csv -title -rights -subject t/images > out.csv
This batch load workflow saves re-keying while taking advantage of the metadata created by our partners to more efficiently add new content to our institutional repository.