The Library of Congress recently released 25 million metadata records for free bulk download at loc.gov/cds/products/marcDist.php. These MARC records make up the foundation for library catalogs, such as OskiCat, which have enabled library users to find and access library books and other media for decades. As the LOC describes the collection:
The data covers a wide range of Library items including books, serials, computer files, manuscripts, maps, music and visual materials. The free data sets cover more than 45 years, ranging from 1968, during the early years of MARC, to 2014. Each record provides standardized information about an item, including the title, author, publication date, subject headings, genre, related names, summary and other notes.
The data is available in UTF-8, MARC8, and XML formats, and has been conveniently divided by media type including books, computer files, maps, music, and more.
We’ve added the resource to the public section of the Computational Text Analysis and Text Mining Guide, where you can find many other sources for large-scale text analysis projects. For more information, take a look at the LOC’s Getting Started (PDF) for details on accessing the data.
Stacy Reardon, Literatures and Digital Humanities Librarian, sreardon [at] berkeley.edu
Cody Hennesy, E-Learning and Information Studies Librarian, chennesy [at] berkeley.edu