Home » Articles posted by asackmann

Posts by Author: asackmann

GitHub: Archiving and Repositories

Github has become ubiquitous in the coding world and, with the advent of data science and computation in a slew of other disciplines, researchers are turning to the version control repository and hosting service. Google uses it, Microsoft uses it, and it’s on the list of the top 100 most popular sites on Earth. As a librarian and a member of the Research Data Management team, I often get the question: “Can I archive my code in my Github repository?” From the research data management perspective, the answer is a little sticky.

github mark

The terms “archive” and “repository” from GitHub mean something very different than their definitions from a research data management perspective. For example, in GitHub, a repository “contains all of the project files…and stores each file’s revision history.” Archiving content on GitHub means that your repository will stay on GiHub until you choose to remove it (or if GitHub receives a DMCA takedown notice, or if it violates their guidelines or terms of service).

For librarians, research data managers, and many funders and publishers, archiving content in a repository requires more stringent requirements. For example, Dryad, a commonly known repository, requires those who wish to remove content to go through a lengthy process proving that work has been infringed, or is not in compliance of the law (read more about removing content from Dryad here). Most importantly, Dryad (and many other repositories) take specific steps to preserve the research materials. For example:
* persistent identification
* fixity checks
* versioning
* multiple copies are kept in a variety of storage sites

A good repository provides persistent access to materials, enables discovery, and does not guarantee, but takes multiple steps to prevent data loss.

So, how can you continue to work efficiently through GitHub and adhere to good archival practices? GitHub links up with Zenodo, a repository based out of CERN. Data files are stored at CERN with another site in Budapest. All data is backed-up on a daily basis with regular fixity and authenticity checks. Zenodo assigns a digital object identifier to your code, making it persistently identifiable and discoverable. Check out this guide on Making Your Code Citable for more information on linking your GitHub with Zenodo. Zenodo isn’t perfect and there are a few limitations, including a max file size of 50 GB. Read more about their policies here.

UC-Berkeley has its own institutional version of GitHub, which means that Berkeley development teams and individual contributors can now have private repositories (and private, shared repositories within the Berkeley domain). If you’d like access, please email github@berkeley.edu. Additionally, we have institutional subscriptions to Overleaf and ShareLaTeX, both of which integrate with GitHub.

Please contact researchdata@berkeley.edu if you’d like more information about archiving your code on GitHub.

Elsevier, Springer Nature, and AAAS: Publisher Research Data Policies

Ever since the Office of Science and Technology introduced a policy addressing the public’s access to data, federal granting agencies, non-profit granting agencies (like the Gates Foundation), publishers, universities, and researchers have been adjusting to reflect changes in access to data at the national level.  The policy requires federal agencies with over $100 million in annual research and development expenses to make research results public and provide a plan for doing so.


As a researcher, this is a difficult landscape to navigate for a number of reasons:
  • you may have entered into a research project mid-grant and are unaware of the data management plan that was included in the grant proposal
  • the data management plan that was included in the grant application is not being followed
  • you’re not sure how funder mandates line up with publisher requirements
  • the language that publishers include about data sharing or publishing aren’t straight forward
  • you know that you’re supposed to make your data public, but you don’t know where to do this or how to do this


There are a number of other obstacles that make data publishing difficult, but for today, let’s take a look at the data sharing policies of three publishers in the Engineering and Physical Sciences. Publishers will often use suggestive or idealistic language, but does that mean you’re off the hook for sharing? If your publisher requires that you make your data public, how do you comply with your funder data mandate and your publisher data policy?


Elsevier is a massive publisher that currently publishes over 49,000 journals in Health, Life Sciences, Physical Sciences and Engineering, and Social Sciences and Humanities. They also publish books, major reference works, and somewhat recently, acquired Mendeley, citation management software. Their most recent product, Mendeley Data, is a cloud-based repository for datasets. To sum it up – Elsevier is huge. They’ve divided their research data policy into two parts – Principles (the expectations, “shoulds,” and “needs” underpinning their research data policy) and Policy (what they actually do). Elsevier’s principles are idealist and sound great and their policies are suggestive.


For example, one of Elsevier’s Data Sharing Principles:
“Research data should be made available free of charge to all researchers wherever possible and with minimal reuse restrictions.”


Policy:
“We will encourage and support researchers and research institutions to share data where appropriate and at the earliest opportunity.”


In their Research data FAQ section they answer the question:
“Is it compulsory to share my research data?”
A: No.


They’ve taken an interesting approach that sets up researchers to share their data (if prepared to do so), without being prescriptive. Elsevier makes it easy to link to datasets in other repositories, and has even started their own repository with Mendeley Data (that’s another blog post for another day). Elsevier has also jumped into the data journal game, with their open access Data in Brief publication. Data publications are emerging as a way for researchers to write an additional article that provides an in-depth description of datasets behind research. This article format provides data, which is typically buried in supplementary material, another avenue for discovery.


Imagine what could happen to the world of data sharing if a research giant like Elsevier made their policies less like principles and required research data sharing instead of suggesting it.


Springer Nature, formerly known as Springer and the Nature Publishing Group, announced a merger in January of 2015. The new publishing giant produces about 13% of the papers in the scholarly publishing market, still behind Elsevier (23%) (scholarly kitchen). About a year after the merger, the new publisher developed an approach to research data policies that would allow them to remain flexible across their wide range of journals.


Four different policy types:
  1. data sharing and data citation is encouraged
  2. data sharing and evidence of data sharing encouraged
  3. data sharing encouraged and statements of data availability required
  4. data sharing, evidence of data sharing and peer review of data required


The Springer Nature approach allows for flexibility and takes into account the current practices of each discipline the publisher supports. However, prior to submission, you need to know which policy your Springer Nature journal follows (yet another argument for following good data management practices from the start). Let’s take a closer look at each policy.


  • Research Data Policy Type 1 is the most lenient by encouraging data citation and sharing. I like to think of policy 1 as “data sharing lite,” because Springer Nature provides you with information about how to share and cite data, but you don’t necessarily have to. A few titles that fit into this category are: Academic Questions, Accreditation and Quality AssuranceAesthetic Plastic Surgery, Contemporary Islam, and Journal of Happiness Studies.
  • Research Data Policy Type 2 requires the authors to be more open with their relevant raw data by implying that the data will be available to any researcher who would like to reuse them for non-commercial purposes (barring confidentiality issues). This policy falls somewhere between “optional” and “mandatory.” The publisher is telling its journal policy 2 readers that this data is freely available for them to reuse, therefore warning, or preparing, the authors that they may be asked for their data. The easiest way to handle requests like this is to make is publicly available, with a citation and assigned digital object identifier in a repository. A few examples of type 2 journals include: Agronomy for Sustainable Development, BioEnergy Research, Brain Imaging and Behavior, and  Journal of Geovisualization and Spatial Analysis
  • Research Data Policy Type 3 is geared specifically for journals that publish research on the life sciences. When an author submits to policy 3 journals, they are strongly encouraged to deposit data in repositories. It is implied that all raw data is freely available (again, barring confidentiality issues) to any researcher who requests it. For policies 1 and 2, authors may deposit data in general repositories. However, for policy 3, researchers must deposit specific types of data in a list of prescribed repositories. For example, DNA and RNA sequencing data must be deposited in the NCBI Trace Archive or the NCBI Sequence Read Archive (SRA). A few examples of type 3 journals include: Journal of Hematology and Oncology, Nature Cell Biology, and Nature Chemistry.
  • Research Data Policy Type 4 requires that all of the datasets for the paper’s conclusion must be available to reviewers and readers. The datasets have to be available in repositories prior to the peer review process (or be made available in supplementary material) and is conditional upon publication that data is in the appropriate repository. Examples of type 4 journals include BMC Biology, Genome Biology, and Retrovirology.


AAAS, the American Association for the Advancement of Science is much smaller in scope than Springer Nature and Elsevier. AAAS is both a professional society and reputable publisher of six journals: Science; Science Translational Medicine;  Science Signaling; Science Advances; Science Immunology, and Science Robotics. Unlike the other two publishers, AAAS can set tight and strict policies surrounding research data because they publish a small percentage of what the other two produce. Datasets must be deposited in approved repositories with an accession number prior to publication. AAAS encourages compliance with MIBBI (Minimum Information for Biological and Biomedical Investigations) guidelines. AAAS provides a list of approved repositories based on data type (similar to Spring Nature type 4). Not only does AAAS stipulate that data must be available, but that all materials that are necessary to understand and assess the research must be made available. This includes code, patents, and even fossils or rare specimens. Please see AAAS’s publication policies for more information.


These publishers are ordered on a scale from “suggestive” and “encouraging” data policies to strict mandates for sharing research materials (AAAS). Ultimately, you should prepare your data and supporting research materials, like code, from the beginning of a research project as if you were going to publish in a AAAS journal. There are more reasons to that than following publisher data sharing mandates, which I’ll explore in future posts.

Virtual Reality for Cal Day

The Kresge Engineering Library will be one of the host sites for VR @ Berkeley, a student group that brings virtual reality to the campus community. By working with industry and UC-Berkeley researchers, VR @ Berkeley makes virtual reality an accessible experience. Each year, members of the group focus on a wide range of projects that bend the intersection between our physical realities and the virtual. Their work spans many applications including: changing the way we read and interact with textbooks, allowing medical workers in the field communicate with doctors in a more intuitive manner, and a virtual experience of our iconic, 61 bell Campanile.

Virtual Reality at Berkeley Landships

 

During Cal Day, the Kresge Engineering Library will be hosting Project Landships, a multiplayer tank combat simulator. Players can work together as a crew to aim, shoot, drive, and spot. The experience emulates a WWII Sherman Firefly Tank.

Check out other VR @ Berkeley Projects on Cal Day at the following locations:
1. Kresge Engineering Library
2. ESS Patio
3. Jacobs Hall
4. Sproul Plaza
5. The House (Bancroft)
6. Moffitt Library

 

 

 

 

Global Engineering Academic Challenge

It’s time again for the Global Engineering Academic Challenge! Starting today, Monday, October 10th, Elsevier will post a challenge question each Monday for the next 5 Mondays (5 questions total). Complete this interdisciplinary challenge with your instructors and peers by solving problem-sets based built around 5 transdisciplinary themes including Future of Energy, Future of Making, Future of Medicine.

Each week, the winner with the highest points will receive $100 to Amazon. The first place grand prize is an Apple iPad and the second place prize is a set of Sonos speakers.

Visit the Engineering Academic Challenge to begin!

DMPTool Updates for August 2016

The crew over at the University of California Curation Center (UC3) and the California Digital Library are working hard to continue to bring big updates to the DMPTool. First off, they’ve added new data management plan templates for the Department of Transportation and NASA. They’re busy working on adding DOD (Department of Defense) and NIJ (National Institute of Justice) templates, but if you’d like another template added, please let them know and send a message here.

Department of Defense logo

NASA logo

 

 

 

 

Additionally, they’re moving forward to create Machine-actionable DMPs. This means that institutions will be able to better manage their data; DMPs will be data mineable; and researchers can better discover data. Read more about the benefits of Machine-actionable DMPs at the DMPTool blog.

New Resource: Corrosion Database

Springer Materials recently announced the launch of their new Corrosion Database. The Corrosion Database lives in Springer Materials and was compiled from various data and literature from the National Institute of Standards and Technology (NIST). The database contains over 24,000 uniques records of corrosion rates/ratings and can be searched by material, environment, or both. Results are given by corrosion rating in order to find the most (or least resistant) for any given application. For example, the database provides data on how seawater corrodes 164 different types of steel and the rate of corrosion.

screenshot of Corrosion database

Users can also download citations from the database in .bib, .EndNote, or .ris file formats.

Visit the SpringerMaterials database to begin using the new Corrosion Database.

Data Visualization Workshop: Thursday, July 7th, 12:00 pm

A well-designed figure can have a huge impact on the communication of research results. This workshop will introduce key principles and resources for visualizing data:

  • Choosing when to use a visualization
  • Selecting the best visualization type for your data
  • Choosing design elements that increase clarity and impact
  • Avoiding visualization issues that obscure or distort data
  • Finding tools for generating visualizations

Date: Thursday, July 7

Time: 12:00 – 1:00

Location: Bioscience Library Training Room, 2101 VLSB (inside the library)

Add this workshop to your bCal

Presenters:

  • Anna Sackmann, Science Data and Engineering Librarian
  • Becky Miller, Environmental Sciences and Natural Resources Librarian
  • Elliott Smith, Emerging Technologies Librarian

Open to all; no registration is required. Please forward to interested colleagues.

Questions? Please contact esmith@berkeley.edu

Big Changes for the DMPTool, but first, a little downtime.

During the month of May, project developers for the DMPTool and DMPOnline (the UK’s version) began combining documentation to create the DMPRoadmap. Coming next year, the DMPTool and DMPOnline will merge into one Data Management Plan service that can be used internationally and that combines the best features of the current DMPTool and DMPOnline. You can follow their progress via their GitHub Repository: DMPRoadmap.

Stay tuned for updates. In the meantime, the DMPTool will experience brief downtime for mini-maintenance on Wednesday, June 8 2016 from 4:00 – 4:30 (PST).

DMPTool Downtime Wednesday May 4th

The DMPTool will be unavailable on Wednesday, May 4th 2016 from 3:00 – 4:00 (PST). During this period users will not be able to log in or have access to their work. We apologize for the inconvenience.

For questions about the DMPTool or other data management tools and services available to UC Berkeley researchers, please see our Research Data Management page or contact researchdata@berkeley.edu.

The Materials Project

The Materials Project

 

The Materials Project provides open web-based access to computed information on known and predicted materials as well as powerful analysis tools to inspire and design novel materials. Through computational modeling and supercomputing, the Materials Project allows the user to assess how different atoms and molecules interact with each other. The Materials Explorer is the core tool, or app, through which users can query all of the data in the materials compound database through an interactive Periodic Table of Elements. With 66,140 computed compounds, users discover a number of material properties including compound formation energy, stability, bandgap, density, volume, and more. This app, along with seven others (including the crystal toolkit, structure predictor, and the battery explorer) allows researchers to compute the properties of compounds before materials are synthesized in a lab, all of which save money, time, and guesswork.

The Materials Project was founded by two current UC-Berkeley Materials Science and Engineering professors, Dr. Kristin Persson and Dr. Gerbrand Ceder. The Project is supported by the US Department of Energy, Lawrence Berkeley National Lab, MIT, and the Battery Materials Research Program. For more information on collaborators, visit About the Materials Project.

Show Your Support

Show Your Support button to donate to the Library

Library Events Calendar

@ The Library

UC Berkeley Library's Twitter avatar
UC Berkeley Library
@ucberkeleylib

Sharing knowledge is our goal here at the UC Berkeley Library. Check out our spaces and the students and staff... t.co/8DEHvKqt0W

UC Berkeley Library's Twitter avatar
UC Berkeley Library
@ucberkeleylib

Check out this video tour of our Hargrove Music Library! t.co/2Xo36E5dCA

UC Berkeley Library's Twitter avatar
UC Berkeley Library
@ucberkeleylib

What do you love about UC Berkeley libraries? During Cal Day, we asked prospective students, community members,... t.co/yj64oR0sFM

UC Berkeley Library's Twitter avatar
UC Berkeley Library
@ucberkeleylib

Students and visitors flock to the Doe Library on Cal Day. #ShareCalDay #UCBerkeleyLib t.co/6FBAmdSjg5

Show Media
Tweet Media
UC Berkeley Library's Twitter avatar
UC Berkeley Library
@ucberkeleylib

Little Cal fans color images from our collections in the Morrison Library. #ShareCalDay #UCBerkeleyLib t.co/Y2fBDT1I8r

Show Media
Tweet Media