Home » Library Researcher@Berkeley
A collection of digitized texts marks the start of a research project — or does it?
For many social sciences and humanities researchers, creating searchable, editable, and machine-readable digital texts out of heaps of paper in archival boxes or from books painstakingly sourced from overlooked corners of the library can be a tedious, time-consuming process.
Scholars using traditional methodologies may find it advantageous to have a digital copy of their source material, if only to be able to more easily search through it. For anyone who wants to use computational methods and tools, converting print sources to digital text is a prerequisite. The process of converting an image of scanned text to digital text involves Optical Character Recognition (OCR) software. New developments in campus services are providing additional options for researchers who wish to prepare their texts this way.
What resources does UC Berkeley offer to convert scans to digital text?
- For basic needs, try the Library’s scanners.
- For documents with complex layouts or for additional language support, ABBYY FineReader with Berkeley’s OCR virtual desktop is a solution.
- Finally, Tesseract can handle large scale OCR projects.
Books and simple documents: library scanners with OCR software
All of the UC Berkeley libraries, including the Main (Gardner) Stacks, have at least one Scannx scanner station with built-in OCR software. This software automatically identifies and splits apart pages when you’re scanning a book, and it performs OCR on any text it can identify. You can save your results as a “Searchable PDF” (with embedded OCR output) or as a Microsoft Word document, or you can save page images as TIFF, JPEG, or PDF files (omitting digitized text). For book scanning or simple document scanning, the library scanners can take you from analog to digital in a single step.
Complex layouts or language support: ABBYY FineReader and Berkeley Research Computing’s OCR virtual desktop
If your source material has a complex layout (like irregular columns, embedded images, and/or tables that you want to continue to edit as tables) or uses a non-Latin alphabet, ABBYY FineReader OCR may get you better OCR results. FineReader supports Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese, and Thai, among other languages.
On campus, FineReader is available on computers in the D-Lab (350 Barrows). From off campus, the OCR virtual research desktop provided through Berkeley Research Computing’s AEoD service (Analytic Environments on Demand, pronounced “A-odd”) allows users to log into a virtual Windows environment from their own laptop or desktop computer anywhere there’s an internet connection. If you’re visiting an archive and aren’t sure that your image capture setup is getting good enough results to use as OCR input, you can log into the OCR virtual research desktop and try out a couple samples, then refine your process as needed. You can also work on your OCR project from home, or on nights and weekends when campus buildings are closed. To use the OCR virtual research desktop, sign up for access at http://research-it.berkeley.edu/ocr.
FineReader is not generally recommended for very large numbers of PDFs because each conversion must be started by hand. However, if you don’t need to differentiate the origin of your various source PDFs (e.g., if your text analysis will treat all text as part of a single corpus, and it doesn’t matter which of the million PDFs any particular bit of text originally came from), you might be able to use FineReader by creating one or more “mega-PDFs” that combine tens or hundreds of source PDFs and letting it run over a long period of time. At a certain point, however, Tesseract might be a better choice.
OCR at scale: Tesseract on the Savio high-performance compute cluster
If you have thousands, hundreds of thousands, or millions of PDFs to OCR, a high-powered, automated solution is usually best. One such option is the open source OCR engine Tesseract. Research IT has installed Tesseract in a container that you can use on the Savio high performance computing (HPC) cluster. For researchers who are less comfortable with the command line, there is also a Jupyter notebook available that provides the necessary commands and “human-readable” documentation, in a form that you can run on the cluster. Any tenure-track faculty member is eligible for a Faculty Computing Allowance for using Savio. For graduate students, talk to your advisor about signing up for an allowance and receiving access.
No matter how large or small your OCR project is, UC Berkeley has the perfect tool for you in scanning equipment, ABBYY FineReader, or Tesseract. Happy converting!
Related Event: From Sources to Data: Using OCR in the Classroom
March 16, 2017
10:30am to 12:00pm
Open to: All faculty, graduate students, and staff
Quinn Dombrowski, Research IT quinnd [at] berkeley.edu
Text mining, the process of computationally analyzing large swaths of natural language texts, can illuminate patterns and trends in literature, journalism, and other forms of textual culture that are sometimes discernible only at scale, and it’s an important digital humanities method. If text mining interests you, then finding the right tool — whether you turn to an entry-level system like Voyant or master a programming language like Python — is only a part of the solution. Your analyses are only as strong as the texts you’re working with, after all, and finding authoritative text corpora can sometimes be difficult due to paywalls and licensing restrictions. The good news is the UC Berkeley Libraries offer a range of text corpora for you to analyze, and we can help you get your hands on things we don’t already have access to.
The first step in your exploration should be the library’s Text Mining Guide, which lists text corpora that are either publicly accessible (e.g., the Library of Congress’s Chronicling America newspaper collection) or are available to UCB faculty, students, and staff (e.g., JSTOR Data for Research). The content of these sources are available in a variety of formats: you may be able to download the texts in bulk, use an API, or make use of a content provider’s in-platform tools. In other cases (e.g., ProQuest Historical Newspapers), the library may be able to arrange access upon request. While the scope of the corpora we have access to is wide, we are particularly strong in newspaper collections, pre-20th century English literature collections, and scholarly texts.
What happens if the library doesn’t have what you need? We regularly facilitate the acquisition of text corpora upon request, and you can always email your subject librarian with specific requests or questions. The library will deal with licensing questions so you don’t have to, and we’ll work with you to figure out the best way to make the texts available for your work, often with the help of our friends in the D-Lab or Research IT . We also offer the Data Acquisition and Access Program to provide special funding for one-time data set purchases, including text corpora. Your requests and suggestions help the library develop our collection, making text mining easier for the next researcher who comes along.
- Unless explicitly stated, our contracts for most Library databases and library resources (e.g., Scopus, Project MUSE) don’t allow for bulk download. Please avoid web scraping licensed library resources on your own: content providers realize what is happening pretty quickly, and they react by shutting down access for our entire campus. Ask your subject librarian for help instead.
- Keep in mind that many of the vendors themselves are limited in how, and how much access, they can provide to a particular resource, based on their own contractual agreements. It’s not uncommon for specific contemporary newspapers and journals to be unavailable for analysis at scale, even when library funding for access may be available.
- Library Text Mining Guide
- Library Data Acquisition and Access Program
- D-Lab Computational Text Analysis Working Group
- D-Lab Learn Python Working Group
Stacy Reardon and Cody Hennesy
Contact us at sreardon [at] berkeley.edu; chennesy [at] berkeley.edu
Need to take a poster to a conference? The Social Research Library will loan you a carrying tube that can hold up to a 37″ wide poster! Check one out for up to a month! Click here to view availability, or search OskiCat for “carrying tube” – or just drop by!
Head, Social Sciences Division
sedwards [at] library.berkeley.edu
Cross-posted from the UCB Library Scholarly Communications blog
You’ve worked painstakingly for years (we won’t let on how many) on your magnum opus: your dissertation—the scholarly key to completing your graduate degree, securing a possible first book deal, and making inroads toward faculty status somewhere. Then, as you are about to submit your pièce de résistance through ProQuest’s online administration system, you are confronted with the realization that—for students at many institutions—your dissertation is about to be made available open access online to readers all over the world (hurrah! and gulp).
Because your dissertation will be openly available online, there are many questions you need to address—both about what you put in your dissertation, and the choices you’ll need to make as you put it online. If you are a first-time author, facing these concerns can be daunting to say the least. And you definitely don’t want to be thinking about them for the first time when you are scrambling to submit your dissertation to ProQuest.
For instance, you’ll need to consider:
- Are you using materials created by other people in your dissertation? Perhaps you’re using photos, text excerpts, scientific drawings or diagrams? You might need the authors’ permission to include them.
- Are you including information about particular living individuals? You might need to consider their privacy rights (see, for instance, a discussion on p. 15 of a University of Michigan dissertation guide).
- If you own copyright in your dissertation (as most grad students in the UC campus system do), should you register your copyright?
- Do you need to embargo your dissertation for privacy, patent, or other concerns?
- Should you license your dissertation for greater use by others?
At UC Berkeley, we’ve created a workflow and guide for you to tackle these kinds of important copyright and other legal questions. Below, I’ve included highlights from the workflow, but there are plenty more best practices to draw upon in the guide. What follows are, of course, exactly that: best practices, and not legal advice. Your local scholarly communication officer or librarian (see this list for some resources around UC) can help you find additional information as you consider these issues for your own dissertation.
Rachael G. Samberg
Scholarly Communication Officer
Contact me at rsamberg [at] berkeley.edu.
Do you need to purchase data for your research or teaching? The Library can help!
The Data Acquisition & Access Program is focused on datasets that require license or user agreements to access. Made possible through a partnership between the Library and the D-Lab, this program provides up to $100,000 per year for the purchase of data that is of use to more than one research group or department.
There will be quarterly review cycles each year, and applications are accepted on a rolling basis. The next review date will be November 1st, 2016.
All UC Berkeley faculty, students, and academic researchers (or UCB librarians on their behalf) are eligible to apply; undergraduate requests must be supported by a faculty member.
To apply for funding, fill out the Purchase Request Form and select Data Set under “Type of material.”
Examples of data sets purchased so far:
- India – National Sample Survey – Consumer Expenditure Survey Series
- India Annual Survey of Industries, 2008-2013
- Linguistic Data consortia membership
- Complete Northern California Real Estate Foreclosure
- Amadeus Historical Data (database of comparable financial and business information on Europe’s biggest 510,000 public and private companies by assets; 43 countries are covered)
(Inspired by a post on the Research IT blog)
While doing an academic research project, you may encounter the need for a demographic or economic statistic (What is the current population of King City, CA?; How did people get to work in 1960?; etc). There are many sources of statistics out there, some reliable, and some – well, not so much. Sources may vary by location, time period, types of questions asked, etc. One of the most reliable sources of U.S. demographics statistics and data to become familiar with is the United States Census Bureau.
The work of the U.S. Census Bureau dates back to the founding of the country, though the Census Bureau wasn’t a permanent government office until the early 1900’s. Its primary role is mandated by our constitution: Article 1, Section 2 of U.S. Constitution stipulates an enumeration of the population be taken every ten years for apportionment and redistricting of the U.S. House of Representatives. This enumeration is the Decennial Census that has taken place every 10 years from 1790 to present.
Since there are way too many data programs done by the Census Bureau to cover in this short article, we’ll look at the two most widely used: the Decennial Census and the American Community Survey. The Decennial Census is what most people think of when they think about the U.S. Census: conducted every ten years, lots of details, etc. Up until the year 2000, the Decennial Census was the main source of detailed statistics. However, as of 2005, we now have the American Community Survey, which provides the same level of detailed information in 1-,3-, and 5-year rolling averages. The shorter the timespan, the more current the information – but statistics are only available for population sizes of 80,000 or larger. In a longer timespan, the statistics are less current, but are available for all smaller populations.
The Census Bureau gathers a lot of information and makes it available in a number of ways. Over the last two decades, U.S. Census information has become more readily available online. Current information is available through the Census Bureau’s site or via American FactFinder. Other sources include the Library’s subscription to Social Explorer (which covers 1790-present), and allows for the creation of maps. And if you like maps, try Policy Map. If you want to dig into the numerical data and not just the statistics, the Census Bureau provides that as well. To learn more about these sources and the Census, visit the Library’s Census Guide. The D-Lab also holds training sessions on the Census data.
One final point to consider when talking about the Census or any government program: funding. Many times in the recent past, the U.S. Congress has threatened to cut or even kill funding for the Census Bureau. In 2011, Congress succeeded when they cut the funding to the office that provided the Statistical Abstract of the United States, despite a public outcry against this cut. As the Census Bureau gears up for the 2020 Census, will there be cuts to the program? How might that affect your research or research in your field?
Government Information, Political Science and Public Policy Librarian
Contact me at jsilva [@] library.berkeley.edu
Every time you download a spreadsheet, use a piece of someone else’s code, share a video, or take photos for a project, you’re working with data. When you are producing, accessing, or sharing data in order to answer a research question, you’re working with research data, and Berkeley has a service that can help you.
Research Data Management at Berkeley is a service that supports researchers in every discipline as they find, generate, store, share, and archive their data. The program addresses current and emerging data management issues, compliance with policy requirements imposed by funders and by the University, and reduction of risk associated with the challenges of data stewardship.
In September 2015, the program launched the RDM Consulting Service, staffed by dedicated consultants with expertise in key aspects of managing research data. The RDM Consulting Service coordinates closely with consulting services in Research IT, the Library, and other researcher-facing support organizations on the campus. Contact a consultant at firstname.lastname@example.org.
The RDM program also developed an online resource guide. The Guide documents existing services, providing context and use cases from a research perspective. In the rapidly changing landscape of federal funding requirements, archiving tools, electronic lab notebooks, and data repositories, the Guide offers information that directly addresses the needs of researchers at Berkeley. The RDM Guide is available at researchdata.berkeley.edu.
Research Data Management Service Design Analyst
Contact me at wittenberg[@]berkeley.edu
Are you a humanist working with digital materials to do your research? Are you carrying out your research or presenting your results using digital methods and tools? Are you teaching using digital tools and content? If you answered yes to any of these questions, then your work might be considered digital humanities.
Digital humanities has been described as “dynamic dialogue between emerging technology and humanistic inquiry” (Varner, 2016). It is a term that is used to describe a domain within the humanities where researchers are doing most of their work using digital tools, content, and/or methods. Whether this work is partially or exclusively digital, this designation is a way to set these emerging practices apart from more traditional or “analog” ones, though there is no clear distinction.
The scope of digital humanities has been a hot topic in recent years, especially in relation to the library’s role in this new domain. What services does the library provide to digital humanists? What can the library do to support digital humanities on campus?
The Library has always provided services to researchers and will continue to provide those same services, as well as to expand their offerings to encompass new forms of research, publication, and teaching. It is not a question of libraries supporting one or the other. Digital humanities is still evolving, and the Library is evolving right along with it, continuing to offer collections, research support, and instruction in both traditional ways and new ones as this “dynamic dialogue” expands.
The Library collects and creates digital resources at the same time that it continues to build its analog collections. Myriad databases, data sets, and other digital resources are available through the Library catalog and website. In addition, our digitized special collections are available through Calisphere, which provides access to digital images, texts, and recordings from California’s great libraries, archives, and museums.
While the library is busy collecting and organizing digital resources, reference librarians are ready and willing to provide you with research help. The expertise that librarians have in connecting researchers to materials, designing research, and providing instruction on how to evaluate and use new content and tools continues to grow and expand in this new environment.
In addition, the library provides instruction to help those new to the digital humanities to learn about tools and skills needed to do this work. Many librarians have partnered with the D-lab in Barrows Hall on campus to provide instruction on citation management, metadata, and research data management. The D-lab also offers training in various programming languages and data tools, as well as consulting on research design, data analysis, data management, and related techniques and technologies. Library trainings and events are generally posted to the library events calendar.
The Library also works closely with the Digital Humanities @ Berkeley group (a partnership between Research IT and the Office of the Dean of Arts and Humanities) which support digital humanities events, trainings, course support, and graduate student and faculty projects. Their calendar lists talks, workshops, and other events designed to help move the DH community on campus forward.
Keeping the “dynamic dialogue” of digital humanities moving forward is a campus goal, and the relationship between digital humanities and the Library is an evolving one. We are hiring new librarians with digital humanities skills to further develop this relationship and expect to see more growth in the scope of the library’s involvement in digital humanities as the community on campus continues to expand.
Mary W. Elings
Head of Digital Collection, The Bancroft Library
Contact me at melings [at] berkeley.edu
You probably love maps (who doesn’t?!). They can be beautiful, visually compelling, interesting representations of the world. You might have one hanging on your wall, laugh over one showing how Oakland was the new Brooklyn even back in 1888, or exclaim over one (my new favorite) showing bear concentrations in Norway.
These same qualities that make us love maps are also why they can be excellent research tools. Even if you are not a geographer or urban planner you can use maps to provide context for a place you are describing, explore spatial relationships, or visualize your data in a way that highlights new patterns. For example, a music student used and made maps to trace the locations of 19th century Parisian opera goers!
In addition to the approximately half-a-million physical maps and air photos in the UC Berkeley Library’s collections, the library subscribes to several online databases that let you explore demographic data and create maps that you can share online or print. SimplyMap, Social Explorer, and Policy Map all cover the United States. If you are interested in China, the China Geo-Explorer II database has mappable census data.
There are also many freely available resources available online. My standby is Old Maps Online, which pulls together scanned maps from institutions around the world into a single search screen. Just zoom in to your area of the world, adjust the time slider, and explore!
Contact Susan Powell, GIS & Map Librarian, at smpowell[at]berkeley.edu if you’d like to find out more about how you can use maps — both print and digital — in your research. Or stop by the Earth Sciences & Map Library to explore the collection and find out more!
There are now two official open access (OA) policies at Berkeley:
- Back in July 2013, the Academic Senate adopted the Open Access Policy for the Academic Senate of the University of California to ensure that research articles authored by faculty at all 10 UC campuses are made available free of cost to the general public and researchers worldwide.
- On October 23 of this year, UC issued a Presidential Open Access Policy expanding the reach of the Academic Senate policy by including all UC employees and encouraging them to freely share their research publications worldwide. Among those affected by the expanded policy are clinical faculty, lecturers, staff researchers, postdoctoral scholars, graduate students and librarians.
What does this mean for Graduate Students?
The Academic Senate policy was officially launched on November 17 with the implementation of “harvester” software that sends an automated email message to to faculty listing eligible articles which they authored or co-authored; faculty are then prompted to verify (or reject) the articles and instructed on how to post their publications to eScholarship, UC’s OA publishing platform. If you, as a graduate student, co-authored a paper with an Academic Senate faculty any time after July 2013, that article may be posted to eScholarship which provides free global access to your publication. Wider dissemination of Berkeley research is not only a public good but also results in greater impact and recognition for researchers. Ask your faculty collaborators if they’ve posted their publications and, if not, offer to help them!
The Presidential Open Access Policy covers graduate student work if the student was an employee of the university at the time their article was published. Until eligible UC employees are folded into the harvester software used for Academic Senate faculty, you are encouraged to post your eligible articles using the eScholarship deposit mechanism. See Deposit your content in eScholarship for more details.
Keep in mind that your articles are automatically covered by the policy; you are not required to amend your author agreement and you do not need to pay any additional article processing charges.
Remind me: What is Open Access?
OA literature is free, digital, and available to anyone online. With OA literature, there is the potential for greater access, thus more readers and greater impact. There are two different approaches to open access: Gold and Green. Gold OA provides immediate access on the publisher’s website. In the Green OA model (also known as “self archiving”) authors continue to publish as they always have in all the same journals; once the article has been published in a traditional journal, the author then posts a “final author version” of the article to a repository. The UC Open Access Policy falls under the Green OA model.
For more information
- Open Access: UC Open Access Policy
- For individual questions, contact email@example.com.
- For in-person assistance come to a Library “upload-a-thon”
- Tuesdays and Wednesdays
- Library Data Lab, 189 Doe
- These drop-in sessions will run from November 17- December 16; January 19-February 24 (and beyond, if necessary).
Many subject special libraries are also offering “upload-a-thons” (see For more help).
Margaret Phillips, Education-Psychology Library
contact me at mphillip [at] library.berkeley.edu