Information

Are original x-ray diffraction data available

Are original x-ray diffraction data available


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Is it customary for investigators to publish the original x-ray diffraction data used in macromolecular structural determination? If not, why not; and if so, is there an online database where these data may be downloaded?


In many cases they are available. One of the establishing principles of the Protein Data Bank (PDB) was to store not only the models (atomic positions and identities) of macromolecules and proteins, but also the originating X-ray data, more recently in structure factors.

If the question is 'why are they giving only the structure factors and not the original data they took' - such a task would require a lot of curation effort for very little scientific benefit. Scaling the individual sets of data from the detector used to be a bear of a task. The original myoglobin structure would entail scanned films in an ancient format from the 50s. Nobody would be able to use that now without hacking the image format, if it wasn't on paper tape or cards. In fact in that case the structure factors are not available. Scaling many or several data collections together was often done with custom tweaks into the 70s and then in the 90s data collection became more routine, but several generations of x-ray detectors became popular and then faded from the market. Each had its own eccentricities and requirements for combining data sets from multiple reads.

The purpose of having structure factors available is to allow anyone to reconstruct the electron density and evaluate the interpretive act which is tracing a peptide through electron density. Since that format is mainly independent of the detector and has been fairly consistent over the years, it delivers a significant amount of scientific bang for the buck.

If you want raw image data or detector-proprietary data before multiple data sets from different crystals before they were combined you will have to contact the authors, who would probably have to sift through a sea of DVDs to get to it. In older cases it might be tapes.

So as far as Structure factors, which are essentially the square root value of the combined and scaled Intensity data, they are available and a part of every submission to the pdb:

Look on any X-ray Structure Page at RCSB. For example this one.

There is a box called "Experimental Details" and you can download the structure factors there by clicking a link.

If you are looking for more than one at a time, there are bulk downloads available through their download page. Check the "Structure Factors" box. Raw intensity data should be available if you look around for it as well.

Further Suggestion: I was thinking that if you look at the scaling software that comes with the X-ray detectors you might find some tutorials with unscaled raw data. I found one example at Marresearch - hen egg white lysozyme.


A public database of macromolecular diffraction experiments

Each dot forms from the constructive interference of X-rays passing through a crystal. The data can be used to examine the crystal's structure. Credit: M. Grabowski et al.

The reproducibility of published experimental results has recently attracted attention in many different scientific fields. The lack of availability of original primary scientific data represents a major factor contributing to reproducibility problems, however, the structural biology community has taken significant steps towards making experimental data available.

Macromolecular X-ray crystallography has led the way in requiring the public dissemination of atomic coordinates and a wealth of experimental data via the Protein Data Bank (PDB) and similar projects, making the field one of the most reproducible in the biological sciences.

The IUCr commissioned the Diffraction Data Deposition Working Group (DDDWG) in 2011 to examine the benefits and feasibility of archiving raw diffraction images in crystallography. The 2011-2014 DDDWG triennial report made several key recommendations regarding the preservation of raw diffraction data. However, there remains no mandate for public disclosure of the original diffraction data.

The Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) is part of the Big Data to Knowledge programme of the National Institutes of Health and has been developed to archive raw data from diffraction experiments and, equally importantly, to provide related metadata. The database [Grabowski et al. (2016). Acta Cryst. D72, 1181-1193, DOI: 10.1107/S2059798316014716], contains at the time of writing 3070 macromolecular diffraction experiments (5983 datasets) and their corresponding partially curated metadata, accounting for around 3% of all depositions in the Protein Data Bank. The resource is accessible at http://www.proteindiffraction.org and can be searched using various criteria via a simple, streamlined interface. All data are available for unrestricted access and download. The resource serves as a proof of concept and demonstrates the feasibility of archiving raw diffraction data and associated metadata from X-ray crystallographic studies of biological macromolecules.

Talking to a reporter about the project, team leader Wladek Minor said, "There is so much research underway that it can't all be published, and often the results of unsuccessful studies don't appear in the literature. I think the key to success is to know about unsuccessful experiments, we want to know why they fail".

The goal of the project is to expand the IRRMC and include data sets that failed to yield X-ray structures. This could facilitate collaborative efforts to improve protein structure-determination methods and also ensure the availability of "orphan" data left behind by individual investigators and/or extinct structural genomics projects.


X-ray Powder Diffraction (XRD) Instrumentation - How Does It Work?

The geometry of an X-ray diffractometer is such that the sample rotates in the path of the collimated X-ray beam at an angle θ while the X-ray detector is mounted on an arm to collect the diffracted X-rays and rotates at an angle of 2 θ . The instrument used to maintain the angle and rotate the sample is termed a goniometer. For typical powder patterns, data is collected at 2 θ from

5 ° to 70 ° , angles that are preset in the X-ray scan.


Are original x-ray diffraction data available - Biology

Recent years have seen a growth in interest in retaining raw diffraction data sets collected for the determination of crystal and molecular structures. This interest has arisen spontaneously within the crystallographic community on a number of fronts. For example, raw data sets are valuable for developing new methods of structure determination and for benchmarking of software algorithms (Terwilliger & Bricogne, 2014 ) they are sometimes important for validating the interpretation of structural features and increasingly they repay closer study, whether for allowing data analysis at higher resolution than used in the original work, understanding the presence of multiple lattices present in a crystal, or deducing details of correlated motions or disorder from the diffuse scattering that is largely ignored in determining Bragg peak positions and characteristics.

In parallel, the evolution of science policy in the wider world is prompting closer scrutiny of the whole practice of research data management, and there are a growing number of mandates to retain the raw data underpinning any experimental study and to make it available to other researchers. By early 2016, all UK scientific research councils had stated positions on data management, access and long-term curation (Digital Curation Centre, 2016 Research Councils UK, 2015 ). A useful summary of US Federal Funding Agency requirements for scientific data management is hosted by Northwestern University Library (2016 ). A noteworthy recent proposal calls for a European Open Science Cloud for Research (Jones, 2015 ).

Different communities have different ideas of what data they value most – and, indeed, of what constitutes `data'. The USA's National Science Foundation (NSF) makes this explicit in its published `Frequently Asked Questions' (National Science Foundation, 2010 ):

1. What constitutes `data' covered by a Data Management Plan?

What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models.

In consequence, there is great variety amongst different scientific disciplines in their approaches to data management and retention, and therefore in the availability of public repositories and in the software tools to manage deposition, access and reuse. Nevertheless, two themes recur in the various published mandates and best-practice guidelines: the importance of persistent identifiers for data sets, and the vital need to characterize them as fully as possible by appropriate metadata.

Crystallography is generally regarded as a science that has its house in good order regarding data management, validation, access and reuse. This is largely true so far as `derived' data (by which we mean atomic positional coordinates and displacement parameters resulting from structure determinations) and associated publications are concerned. It is more debatable where processed diffraction data are concerned – the post-experiment processed data (typically structure factors) that form the basis of the atomic and molecular structure determination and subsequent refinement leading to a structural model. Some journals require deposition of structure factors in support of any publication, and the Protein Data Bank (PDB Berman et al. , 2000 ) requires structure factors to be deposited along with the atomic coordinates. However, these are usually the final set of structure factors used in refinement, and may lack information discarded when merging symmetry-related diffraction peaks, or excluded for other reasons from early cycles of refinement. The PDB will accept unmerged processed intensity data, and there are community recommendations encouraging their deposition (International Structural Genomics Organization, 2001 ), but the practice is not yet universal in macromolecular crystallography. For small-unit-cell crystal structures, even journals that accept structure factors have not hitherto required unmerged intensities. However, there is growing recognition that they are important, both for further development of the checkCIF validation carried out during the peer review process, and indeed to encourage future researchers to revisit and re-evaluate the published results, perhaps when new ideas or tools become available (A. Linden, personal communication).

However, historically there has not been a tradition of retaining the raw X-ray diffraction images collected by electronic detectors, although centralized neutron facilities have long-standing traditions of raw data preservation. In recent years the practices nurtured by the neutron facilities have been spreading each type of large-scale centralized instrumental facility (synchrotrons and latterly free-electron lasers, as well as neutron reactors) has begun to move towards raw data preservation. This trend has been encouraged by rapidly improving electronic data-handling procedures.

In 2011, the International Union of Crystallography (IUCr) established a Working Group to explore the merits and challenges of retaining the initial experimental data. This group, the Diffraction Data Deposition Working Group (DDDWG), has conducted a number of consultations, discussion meetings and workshops to explore the topic. A set of papers published in Acta Crystallographica Section D (Terwilliger, 2014 ) provided an overview of the reasons for archiving raw data in the field of macromolecular crystallography, models for doing so on a routine or large-scale basis, current practical initiatives, and the potential benefits for improving macromolecular structure models.

These papers also highlighted the importance of assigning persistent identifiers to data sets to facilitate their management and long-term curation, and to ensure that each data set was characterized by rich metadata, both to facilitate discovery and to allow effective scientific reuse (Guss & McMahon, 2014 Kroon-Batenburg & Helliwell, 2014 ).

In the remainder of this Introduction , we introduce a recent workshop that concentrated on metadata in crystallographic and related experiments we review the arguments for depositing raw data as a routine practice and we place these activities in the context of global science policy initiatives. The paper then looks in more detail at the current and evolving mechanisms for the deposition of raw experimental data (especially X-ray diffraction images) at detailed requirements for metadata that describe archived data sets, in order to ensure the reproducibility of the derived scientific results and at the next steps forward.

1.2. Improving the metadata

To focus on the metadata issues, the DDDWG conducted a two-day workshop at Rovinj, Croatia, in August 2015. A complete record of the workshop is maintained online at http://www.iucr.org/resources/data/dddwg/rovinj-workshop and a number of articles arising from the meeting are in preparation. We detail here some specific outcomes from the workshop.

1.2.1. Efforts of the IUCr Commissions

The IUCr manages its scientific mission through a number of Commissions, each responsible for a particular topic area within crystallography. The DDDWG has requested each Commission to consider its own needs for defining metadata for raw experimental data within its field. Among those that have been most active in responding to this request are the Commission on XAFS (Ravel et al. , 2012 ) the Commission on Small-Angle Scattering (Jacques et al. , 2012 ) the Commission on High Pressure (Fig. 1 ) and the Commission on Biological Macromolecules ( e.g. Gutmanas et al. , 2013 ).


Figure 1
Montage of slides from Kamil Dziubek's presentation at the Rovinj workshop, illustrating aspects of diffraction experiments under high pressure and other non-ambient conditions that need to be well characterized and recorded. (Graphics courtesy of Ronald Miletich-Pawliczek, University of Vienna.)

The International Centre for Diffraction Data (ICDD, Pennsylvania, USA http://www.icdd.com) has been active in the harnessing of raw powder diffraction data sets for some time and reported to us at ECM29 in Rovinj (August 2015) that they have now incorporated over 10� raw powder diffraction data sets into the Powder Diffraction File. They note that one-dimensional data sets are generally reasonably well characterized in terms of the experimental metadata catalogued in the powder CIF (pdCIF) dictionary (Toby, 2005 ), but that interpretation of two-dimensional diffraction images is hampered by a lack of consistency in reporting such characteristics as goniometer axes, detector dark current, distortion and other corrections (T. Fawcett, personal communication see also Section 1.2.2 ). The Commission on Powder Diffraction is planning further work on neutron powder diffraction raw data and will liaise with the Commission on Neutron Scattering as appropriate. The Commission on Structural Chemistry has had enthusiastic participants in events convened by the DDDWG in Madrid, Bergen and Rovinj.

1.2.2. Characterizing X-ray diffraction images

The class of experimental data sets that most closely fits the original remit of the DDDWG is X-ray diffraction images collected from CCD or pixel detectors. A good catalogue of the metadata needed, in general, to interpret a raw image data file was given by Kroon-Batenburg & Helliwell (2014 ). Many of the individual items required are defined in the imgCIF dictionary (Bernstein, 2005 ), and there have been partial implementations of some of them in so-called `mini-CBF' headers of image files written by a number of commercial detector systems. However, this has not been done in a consistent way between vendors nor even across the entire product range of individual vendors. (CBF, the crystallographic binary file, and imgCIF, its pure ASCII counterpart, are equivalent implementations of the CIF ontology for diffraction images.)

Increasingly, images are being stored using the HDF5/NeXus data format (Könnecke et al. , 2015 ), and although the physical format of the data file should not affect its ability to store specific structured information (Hester, 2016 ), some effort will be needed to ensure that the CIF and NeXus data representations are equally capable of storing the appropriate experimental metadata. Significant effort to achieve this at the technical level has already been invested following participation in an earlier workshop by representatives of COMCIFS (Committee for the Maintenance of the CIF Standard) and NIAC (NeXus International Advisory Committee), the bodies responsible for managing the CIF and NeXus data formats, respectively (Bernstein et al. , 2013 ). Nevertheless, presentations at the Rovinj Workshop by Kroon-Batenburg (https://youtu.be/XXFDlNn21SY) and by Minor (https://youtu.be/eQbs9sB_pOM) emphasized that there is still a long way to go before the myriad different formats generated by commercial electronic position-sensitive detectors do contain the necessary common metadata to allow for easy interpretation and management (see further discussion in Section 3.2 ).

The arrival of the new Dectris Eiger pixel detector, with its colossal increase in diffraction image data rates, has highlighted the importance of efficient data format and metadata recording, not only for diffraction data processing on a synchrotron or X-ray laser beamline, but also for subsequent processing outside the facility, and ultimately for reprocessing/reanalysis from a raw data archive as may be needed. The various issues have been highlighted in detail in a discussion thread on the CCP4bb mailing list in early March 2016 (involving, amongst others, G. Winter, A. Förster, H. J. Bernstein, C. Vonrhein and G. Bricogne).

1.3. The case for raw data deposition

We summarize the case for routine storage and retrieval of raw data to emphasize its potential value to the community. At the same time we acknowledge the cost and other practical constraints of storing all collected data sets indefinitely, and we are unable to give a definitive indication of where the balance might lie between archiving and discarding raw data. However, we show in Section 1.4 that there are discernible trends towards storing more data sets than we might have expected in the early work of the DDDWG.

There is a broad philosophical view of the importance of access to raw diffraction data, namely that science requires the ability to conduct a comprehensive analysis through one's own eyes and not the lens of someone else. Raw diffraction images offer several opportunities for improved or novel science. They permit the analysis of data at higher resolution than used in the original work [allowing comparisons not only among data processing software (Tanley et al. , 2013 ), but also in the effectiveness of structure determination and refinement with ever weaker data beyond normal limits]. Raw data sets can serve as benchmarks in developing improved methods of analysis. They allow checking of the interpretation of the symmetries of the crystals, and detailed analysis of diffraction from multiple lattices present in the crystals. More generally, they promote the study of the diffuse scattering that reflects correlated motions or disorder of atoms in the crystals, namely the `structural dynamics'.

The retention of raw data can be seen as complementing the extensive archives of derived data ( i.e. cell parameters, molecular coordinates, anisotropic displacement parameters) and processed data (structure factors, Rietveld refinement profiles) in the crystallographic databases. The contributions of the former are very well understood: they form part of the scientific record, they lead to database-driven discovery, e.g. in understanding protein–ligand interactions, they lead to new pathways to synthesis, improvements in manufacturing and better understanding of energetics, and they have use in identification and indexing applications ( e.g. in forensic science).

Until the advent of CIF and the automated structure validation checks with the checkCIF suite (Strickland et al. , 2005 ) that it enabled, many structures were published which required subsequent correction. Often, the interpretation of the results produced molecular structures that were broadly correct, but overlooked higher lattice symmetries. Such examples were best detected and corrected through access to the deposited structure factors (well illustrated by Marsh et al. , 2002 ).

So, broadly speaking, structure validation (the credibility of a structural model, both in its adherence to norms of geometric configuration and its derivation from X-ray diffraction images) can be carried out with reference to the derived data sets (the structural coordinates) and the structure factors alone, and this has been the practice in various crystallography journals for a considerable length of time. However, the availability of the raw data ( i.e. original diffraction images) can enhance structure validation in the following ways:

(i) The structure can be re-refined, perhaps making use of diffraction peaks that were excluded because the processed diffraction data were truncated at an arbitrary resolution limit. Retention of the original data also permits re-evaluation of the space-group symmetry, which is normally settled upon during an early stage of conventional refinement.

(ii) Data reduction is often performed according to established protocols, but retention of the original images allows the opportunity to test those protocols, especially if there is any suspicion of systematic bias. Indeed, statistical analysis of a collection of stored raw images may allow the detection of systematic biases that are not at all apparent in individual experiments. Further, the availability of large collections of raw data sets allows periodic recalibration of solution methods and the development of new methods to tackle data sets that have previously been resistant to conventional solution.

(iii) Attention to diffuse scattering between the diffraction spots allows insight into correlated motions or disorder of atoms in crystals. This might involve quasicrystalline behaviour, determination of incommensurate modulation or multi-phase representation, macromolecular motions or conformational changes etc .

Note that these benefits may not be apparent for every structure, and the cost–benefit calculus informing policies of routine deposition has still to be determined by the community and funding bodies (Guss & McMahon, 2014 ). It may be that there are different entry points where the potential benefits can be most readily realised, e.g. by making available the experimental data for `difficult structures' that have proved impossible to refine satisfactorily.

However, more-or-less routine deposition of primary data would help to improve the quality and reliability of the scientific record (Minor et al. , 2016 ). It would allow closer scrutiny of scientific deductions by peer reviewers prior to publication it would allow for revisiting and revising structural models already in the databases, as new techniques are developed – e.g. the notion of `continuous improvement of macromolecular structure models' (Terwilliger, 2012 ) it allows reanalysis of a structure or series of structures independent of an author's interpretational bias (B. D. Bax, personal communication) and it provides the experimental evidence needed to support any claims made by the publishing author. In this last role, it helps to guard against the use of the wrong data set, either through error or deliberate intention.

1.4. Deposition imperatives and opportunities

As previously mentioned, there have been developments since the DDDWG was established in the climate for data deposition and sharing, both in the wider scientific world and in the field of crystallography and related structural sciences. The benefits of open data ( i.e. collecting research data arising from publicly funded scientific research and making it available for reuse without charge to the end user) have been reiterated in recent years in international, governmental and scientific policy discussions and practical initiatives. Among a few portal web sites of note are the United Nations data portal (UNdata: http://data.un.org), the US Government open data site (https://www.data.gov) and the federated `Global Science Gateway' http://worldwidescience.org. Calls for implementation include `The Good Growth Plan', a collaboration for agricultural development involving the UK Open Data Institute (ODI https://theodi.org) and Syngenta the European Open Science Cloud (EOSC), a European Union strategy for linking research networks, data storage facilities and computing resources across the continent (Jones, 2015 Fig. 2 ) and an Open Data Accord (Science International, 2015 ) launched by the International Council for Science (ICSU), the InterAcademy Partnership (IAP), The World Academy of Sciences (TWAS) and the International Social Science Council (ISSC).


Figure 2
A graphic linking data publishing and management workflow to EU research infrastructural components. Part of a presentation introducing the European Open Science Cloud for Research (illustration courtesy of Natalia Manova for the European OpenAIRE project).

Although these various initiatives are very diverse in their objectives, collectively they are raising the perceived importance of data repositories to research funders, to researchers who are encouraged or in some cases mandated to deposit their data in robust and durable repositories, and to other researchers who are becoming increasingly aware of the availability of other data sets and their potential usefulness to their own work. A gradual change in cultural attitudes to research data is taking place.

Since the DDDWG was established in 2011, there have been a number of developments, some catalysed by these high-level initiatives, that have increased the options for deposition of diffraction images:

(i) The number and scope of university data repositories has expanded.

(ii) The European Synchrotron Radiation Facility (ESRF Grenoble, France) has launched a Data Archive, in which every raw data set measured can be associated with a registered DOI.

(iii) The Zenodo science data archive, hosted on the extremely high capacity CERN storage system, has gathered momentum.

(iv) A repository for diffraction experiments used to determine protein structures has been established as part of the US National Institute of Health's BD2K (Big Data to Knowledge) programme (Grabowski et al. , 2016 ) it is run by Wladek Minor's group at the University of Virginia, USA (http://www.proteindiffraction.org/).

(v) The Structural Biology Data Grid (SBDG) has been established as a diffraction data publication and dissemination system for structural biology (Meyer et al. , 2016 ).

(vi) The Protein Data Bank (PDB) now requests the DOI (digital object identifier) for raw data and metadata for raw data during a deposition (Fig. 3 ).


Figure 3
Online form allowing PDB depositors to link experimental data sets and their associated metadata with a deposited macromolecular structure.

(vii) IUCrData (an IUCr data service, initially handling derived data sets) has been launched.

Some of these are described in more detail in Section 2.2 .

2. Mechanisms for raw diffraction data preservation

We review some of the de facto repositories that are currently hosting, and in many cases providing access to, experimental data sets in our domain.

2.1. Institutional data repositories. Case study: University of Manchester

The meticulous approach of the University of Manchester makes one of us (JRH) feel very fortunate to be working in this research environment. In researching the binding of the anti-cancer agent cisplatin to histidine [which has received intense interest see, for example, Messori & Merlino (2016 )], JRH's research group made the raw diffraction data open access at the University of Manchester institutional data repository. Fig. 4 illustrates the data access record within the Library system, while Fig. 5 illustrates the classification-level metadata required by such a repository. This type of institutional cataloguing and archive is increasingly characteristic of modern data archive initiatives. In addition, we have followed the standard community data deposition requirements of depositing coordinates and processed diffraction data at the Protein Data Bank. To permit the widest possible access to our work, we have also been able, via the EPSRC funding we have had, to publish the bulk of our articles reporting our results as `gold' open access ( i.e. the full peer-reviewed articles of record can be accessed without a journal subscription) in Acta Crystallographica Sections D and F .


Figure 4
Manchester University Library access record for experimental data sets associated with a published research article. Links are provided to the published article in the `Related resources' column.

Figure 5
Classification-level metadata associated with experimental data sets archived at the University of Manchester Data Library. These identify the archived data sets and provide links to related resources.

In becoming pioneers of making both our raw diffraction data and our data and model interpretations fully open (Tableف ), thus achieving a rare breadth and depth of openness within a focused research theme, our research has received a gratifying amount of detailed interest. There have been many downloads of these raw data, both from their original web location at Utrecht University and subsequently from the University of Manchester. The download totals for each year from Utrecht were: 2012 17 GB, 2013 47 GB, 2014 57.69 GB and 2015 31.47 GB equivalent download information is not available from the University of Manchester. One such raw data download featured in a new publication (Shabalin et al. , 2015 ), a wide-ranging critique of the whole field of cisplatin binding to various proteins. This article suggested improvements to three of our cisplatin–lysozyme models in the PDB via three of their own alternative interpretations two of these involved use of our processed diffraction data held at the PDB (4xan and 4mwk) and one of our raw data (4g4a in Table 1 and Fig. 4 ). We have accepted some of their recommendations and rejected others (Tanley et al. , 2016 ). Some of these points of `data debate' also suggest a lack of mature community standards, even within one journal (Tanley et al. , 2015 ), but they also show a way forward for discussions to be conducted, e.g. within IUCr journals. In other aspects, it shows the benefits of the continuing pursuit of improved methods of analysis and a better understanding of the role of weak data in improving protein model refinements (Diederichs & Karplus, 2013 ), which we harnessed in detail in Tanley et al. (2016 ). Such improvements have arisen even in just the last few years, and illustrate the `young age' of macromolecular crystallography, a field that is still clearly maturing as a technique.

Table 1
A thematic raw data collection as an example: the suite of research studies, relating to platins binding to histidine, held at the University of Manchester Data Library

2.2. General data repositories for structural biology

The importance of data capture and archiving has been widely recognized around the world and several repositories are now available where nearly any researcher can, or will soon be able to, deposit their raw data and associated metadata for anyone in the world to view and download, subject of course to the natural constraints of file size and network bandwidth.

Two major publicly funded repositories are the Integrated Resource for Reproducibility in Macromolecular Crystallography (http://www.proteindiffraction.org) and the Zenodo repository (https://zenodo.org) for general scientific data. The former has been developed by the Minor group at the University of Virginia (http://olenka.med.virginia.edu/CrystUVa) and is supported by the US National Institutes of Health Big Data to Knowledge Initiative (https://datascience.nih.gov/bd2k). Zenodo has been developed by CERN (http://www.cern.ch) as part of the European Union OpenAIREplus initiative (http://www.openaire.eu).

Two additional private repositories are available for general use. The Harvard-based SBGrid organization (https://sbgrid.org) has developed a Structural Biology Data Grid (https://data.sbgrid.org) that can be used by any member of SBGrid to archive raw data and metadata. The ResearchGate scientific networking site (https://www.researchgate.net) allows researchers to share data (https://www.researchgate.net/blog/post/present-all-your-research-in-a-click).

2.2.1. The Integrated Resource for Reproducibility in Macromolecular Crystallography

The Integrated Resource for Reproducibility in Macromolecular Crystallography (Grabowski et al. , 2016 ) is a protein diffraction database that addresses the need for archival of crystallographic raw images, as outlined in the discussion above and in the Acta Cryst. D group of articles published recently (Terwilliger, 2014 ). This database currently includes over 2900 raw crystallographic data sets and associated metadata. Most of these are linked with a deposit in the Protein Data Bank (http://www.pdb.org Berman, 2000 ) and many of them represent work from structural genomics projects (http://csgid.org, http://ssgcid.org, http://www.jcsg.org, http://mcsg.anl.gov, http://thesgc.org). The database is highly structured, with crystallographic metadata associated with each data set. A very useful feature of this service is that the web interface to the database shows a representative diffraction image from each data set, allowing a researcher to note quickly the characteristics of the diffraction from the crystals used in each data set, for example the order in the diffraction pattern, the presence of diffuse scattering and the extent of anisotropy in the diffraction pattern. The database can be searched based on PDB ID, resolution of diffraction, the location where data were collected, authors, and many other characteristics. It is planned for the database to be available for deposits and downloads by anyone. Every entry in the database has an assigned DOI that can be used to refer to the data and which provides a stable permanent link to the data, and the data deposited are not limited in file size. The metadata associated with the raw data are an integral part of the database, so that it may be practicable in the future to reprocess automatically much of the raw data in the database as new algorithms for data analysis become available ( cf. Terwilliger & Bricogne, 2014 ).

2.2.2. Zenodo

The Zenodo archive is a general scientific archive developed by researchers at CERN as part of a European Union Framework 7 initiative. It provides a repository for scientific data sets in any field and has the unique feature that, as part of CERN, it has access to exceptional capacity for data storage and archiving. Though it is supported by the EU, researchers from anywhere in the world can archive their data and anyone can access the data. The Zenodo archive is designed to provide a resource for the many small scientific projects in the world that do not have an easy way to make their data available to the scientific community and, unlike the other databases discussed here, plans to charge a fee for larger-scale users. The archive currently has over 2500 data sets from all fields of science. Data sets can have multiple files, normally up to a total size limit of 50 GB individual files can be a maximum of 2 GB in size. Each data set is assigned a DOI for permanent archiving and discovery, and is linked with metadata provided by the researcher.

2.2.3. Structural Biology Data Grid

The SBGrid organization provides access for researchers at many structural biology laboratories around the world to a packaged set of software that can be used in many areas of structural biology, including X-ray crystallography, cryo-electron microscopy, electron diffraction, small-angle scattering and other areas. SBGrid also provides access to cloud-based computing resources that carry out structural biology calculations. The Structural Biology Data Grid is a service recently started by SBGrid that allows any SBGrid researcher to archive raw data from any of the SBGrid structural biology areas. This database currently has over 240 data sets from 62 different institutions. The data can be viewed by anyone and crystallographic data sets can be downloaded by anyone, with cut-and-paste scripts for easy downloading of individual data sets. Each data entry has a unique DOI assigned, there are no limitations on file sizes, and metadata describing how to analyse the data are provided.

2.2.4. ResearchGate

ResearchGate is a commercial scientific social networking service that provides a simple mechanism for researchers to post their scientific papers and information about themselves, and for researchers to communicate about and discuss scientific topics. ResearchGate additionally allows researchers to archive scientific data sets for anyone to download. The data sets are assigned a DOI, and the size of individual files is limited.

2.3. Synchrotron, neutron and X-ray laser facility options

There are now several striking examples of current and evolving practice in data capture and management across a range of large-scale facilities accommodating a variety of techniques and sciences. Among those we are aware of are the Australian Synchrotron (Clayton, Victoria, Australia), the ESRF, the Institut Laue–Langevin (ILL, Grenoble, France), the Diamond Light Source (Didcot, UK) and the ISIS neutron source at the Rutherford Appleton Laboratory (Didcot, UK). The Australian Synchrotron has led the world's synchrotrons on data archival with its Store.Synchrotron data storage service for macromolecular crystallography (Meyer et al. , 2014 ). As well as diffraction image data archiving, it also supports users in their publications with linking to raw data sets via DOI registrations and, finally, the release of data sets for public analysis – something that, in the neutron community, the ILL is doing as well. There are also fine examples like Diamond that has so far retained all of its measured data. The ESRF has published a summary of its views on the era of Big Data at synchrotron radiation facilities in general and the challenges that today face the ESRF itself (ESRF, 2013 ). In an encouraging recent statement, it has announced a proactive data archiving policy (Andy Götz and colleagues from ESRF, personal communication).

There are still very significant challenges of data management in home laboratories and for medium-scale service providers such as the UK National Crystallography Service (Southampton, UK). In all these places, all the data from an experiment must be handled in the context of resource management, provenance, validation and bulk storage, all of which require ever greater volumes of metadata that should conform to widely accepted standards.

2.4. The data deluge

One caveat that we apply to our encouraging survey of repository solutions is that, as technology advances, so the volume of data collected is increasing at a dramatic rate. Hence, while the entire download total from Utrecht University in 2015 was 31 GB, a single data set produced by an Eiger 16M detector currently operating on a synchrotron beamline could be over 70 GB. This suggests that centralized experimental facilities, with their large data storage capacities and gigabit internal networks, will continue to play an important role as first-choice repositories for quasi-routine retention of data sets. However, it may also become necessary to apply principles of `triage', either at the point of data collection or in subsequent long-term storage allocation. Such triage might either delete certain data sets or retain some subset, according to a variety of possible criteria. An initial suggestion for a set of such criteria was proposed in the DDDWG online forum in 2011 (http://forums.iucr.org/viewtopic.php?f=21&t=57) but has yet to be developed by the community.

3. Metadata for raw data requirements

3.1. A holistic metadata framework for crystallography

Crystallography and related structural sciences are fortunate in having a standardized approach to data characterization and management, known as the Crystallographic Information Framework (CIF Hall & McMahon, 1995 ). This has two components: a standard file format and data model (Hall et al. , 1991 Bernstein et al. , 2016 ), which facilitate data exchange between software programs, structural databases and publishing systems and a set of `dictionaries' that control the meaning of the tags associated with data values, and which can impose restrictions on data types and values where appropriate. These dictionaries collectively constitute the controlled vocabulary and associated definitions that represent the semantic meaning of a data file or stream – what is fashionably called the `ontology' of a particular scientific domain.

Each CIF dictionary contains definitions relevant to a particular field or topic area, such as small-unit-cell structures determined by single-crystal diffractometry (the so-called `core' dictionary), powder diffraction, biological macromolecular structures, modulated incommensurate structures, multipole electron density or diffraction images (Hall & McMahon, 2016 ). These compilations by topic take a comprehensive view of what may be termed `data'. Thus, the core dictionary contains items as diverse as a single atomic positional coordinate, the ambient temperature at the time the experiment was conducted, the convergence metrics of the least-squares refinement, the software used for generating molecular graphics, or the entire text of an associated scientific publication. That is, there is no differentiation between items that might normally be categorized as `raw', `processed' or `derived' data, or that might be characterized as `metadata'.

The advantage of this lack of differentiation is that all the information needed to interpret, validate or reuse a data set can be stored in a single file and this can make it easier to collect and verify such information during the course of an experimental workflow. Fig. 6 illustrates how the CIF ontologies inform the `coherent information flow' at every stage of the information processing lifecycle in a typical structure determination experiment. In practice, not all real-world workflows use CIF as their actual mechanism for capturing data and metadata. For example, in large instrumental facilities, information about a particular experiment might be collected within a unified content management system developed by the facility to accommodate a wide range of different scientific experiments (Matthews et al. , 2010 ). Similarly, to manage the high-throughput data acquisition requirements of modern detectors, images may be generated as binary HDF5 files, or in proprietary formats.


Figure 6
A coherent information flow in crystallography. CIF ontologies characterize data at every stage of the information processing life cycle, from experimental apparatus to published paper and curated database deposit.

Nevertheless, all raw data sets and associated metadata can, in principle, be converted into CIF representations, which might be a practical benefit for archiving purposes ( i.e. to use a single standard representation), or at the very least can demonstrate what important metadata are missing, by comparison with the comprehensive CIF dictionary compendia of what can and should be collected.

Various IUCr Commissions are continuing to compile metadata definitions relevant to their field of interest in the form of CIF dictionaries. In addition to those listed by Hall & McMahon (2016 ), a small-angle scattering dictionary (sasCIF) has recently been published (Kachala et al. , 2016 ) work is well advanced by the IUCr Commission on Magnetic Structures to characterize magnetic structures and their underlying symmetries (magCIF) and the Commission on High Pressure has an active working group defining essential aspects of the experimental setup needed in non-ambient crystallography.

As mentioned before, the imgCIF dictionary describes an actual format for storing raw diffraction data. However, it also includes a rather complete set of data items that, if fully populated and used in conjunction with other items in the core or macromolecular CIF dictionaries, can fully describe the experimental apparatus and operating parameters, thus permitting a complete interpretation of archived images in this format. The imgCIF format itself is relatively little used, largely because of the speed requirements in modern detectors which require different data acquisition strategies. However, there is an ongoing effort to define metadata terms in the increasingly common NeXus format (Könnecke et al. , 2015 ) that are in concordance with the experimental metadata items defined in the imgCIF dictionary.

3.2. The diversity of instrumentation

In this section we examine the specifics of some of the problems encountered in practice with missing or poorly characterized metadata. The availability of metadata in image headers and their interpretation by software developers has been discussed previously (Tanley, Schreurs et al. , 2013 Kroon-Batenburg & Helliwell, 2014 ). It can safely be concluded that metadata information is often lacking or is ambiguous, i.e. can be interpreted in different ways. Hardware manufacturers may use different words for the same physical parameter or its units, and it is all in the hands of the software developers to make correct use of the metadata information and fill in the missing parts, simply by acquired knowledge or by trial and error. We refer to the supporting information in the paper by Kroon-Batenburg & Helliwell (2014 ) for a discussion between Kay Diederichs, Toine Schreurs and Loes Kroon-Batenburg about φ scans around an axis not perpendicular to the X-ray beam on a fixed χ goniometer. Though sufficient information was available in the header, the XDS software (Kabsch, 2010 ) ignored most of it and used knowledge of the (usual) instrumental set-up, which in this case did not suffice. Initially the raw data, which are now on the Manchester University Library archive, were stored on a website at Utrecht University (http://rawdata.chem.uu.nl) and we added a photograph of the experimental set-up as metadata to resolve the ambiguity of the goniometer, e.g. is the spindle axis pointing up or down?

We should distinguish between diffraction equipment designed to be used in combination with the manufacturer's software, which adequately handles metadata information, and assembled instruments like those on a synchrotron beamline. In the first case, taking the data to another place for use with third party software may give rise to problems, as described by Tanley, Diederichs et al. (2013 ). The image headers at best contain the type of goniometer ( e.g. ` MACH3 with KAPPA ' for Bruker Proteum) but rarely are the orientations and dependencies of the four axes given. In the second case, commercial detectors ( e.g. the Pilatus from Dectris) are installed on a beamline and it is the beamline control software, in close interaction with the detector software, that is responsible for writing information in the image headers. In this mixed environment not all metadata are captured. Usually, but not always, the wavelength, detector-to-sample distance, pixel size and number of pixels in either direction, rotation start angle and increment, and exposure time are given.

The most common problems with metadata, however, are related to the orientations of the goniometer axes and rotation directions, and the definition of the faster and slower directions in pixel coordinates with respect to the laboratory axes and the origin of the pixel coordinates especially disturbing is the absence of or an incorrect beam centre (see below). Tableق gives the goniometer definitions known to the EVAL software (Schreurs et al. , 2010 ) and shows their large variety.

Table 2
Implementation of goniometer types in EVAL (Schreurs et al. , 2010 )

An interesting tabulation of beamline settings for running autoPROC (Vonrhein et al. , 2011 ) is given at the website http://www.globalphasing.com/autoproc/wiki. Values such as BeamCentreFrom = header:x,-y , ReversePhi = `yes' and TwoThetaAxis = `-1' are given in order to cope with similar problems to those mentioned above (Table 2 ). There are eight possible ways in which the pixel values in the image file relate to the physical detector face, and detector vendors use all eight possible conventions (Wladek Minor, private communication). A wrong beam centre can hamper the indexing step. One can estimate the beam centre by manual inspection, by calibration using powder diffraction, by taking a direct beam shot or by removing Bragg spots and using the solvent diffuse ring to find the beam centre (Vonrhein et al. , 2011 ) otherwise one has to resort to trial and error. Fig. 7 shows the mini-CBF header that is used by Dectris for Pilatus detectors. Most of the information is present but some parameters are ambiguous: Beam_xy : see discussion above Oscillation_axis is given as ' X ': what is the X direction? Polarization is 0.990 : which plane has the strong intensity? We encountered an especially confusing situation where a Bruker fixed- χ goniometer was mounted with 90° rotation on Argonne beamline 15ID-B, while the images were converted to the normal Bruker instrument orientation. The strong polarization direction therefore appeared to be along the oscillation axis, but it was not (Jozef Kožíšek, private communication) only the string TARGET SYNCHROTRON in the header warned us.

More a priori knowledge is often needed to interpret diffraction image data. For example, there are different conventions on how to record dead regions on the detector: strips between detector panels on Pilatus detectors are indicated by ` -1 ', whereas in ADSC detector image files these are indicated by ` 0 '. Data processing software has to interpret such pixel data correctly. Dark image and non-uniformity corrections may lead to negative intensities and some detector read-out handlers use a so-called baseline offset: a fixed integer number has been added to all pixel intensities to avoid having to store negative numbers. Removing the baseline offset is important in estimating the standard deviations of net Bragg reflection intensities and for measuring diffuse intensities between the Bragg peaks. Spatial distortion corrections are usually carried out and cannot be undone or corrected by processing software, but they affect standard deviations (Waterman & Evans, 2010 ) and this information should be conveyed in the metadata.

Detector hardware is being developed for high-speed serial crystallography experiments at X-ray free-electron laser (XFEL) installations or high-flux synchrotron beamlines that require ultra-fast data acquisition. A container HDF5 format, often with a NeXus data format layer on top, is designed for flexible and efficient input/output (I/O) for such high volumes of data. New data processing software packages such as CrystFEL (White et al. , 2012 ), cctbx.xfel (Sauter et al. , 2013 ) and DIALS (Waterman et al. , 2013 ) for serial crystallography are under development and this provides the opportunity to address the metadata issues anew.

Dectris has installed the Eiger detector at several synchrotron beamlines. Metadata are contained in a separate file ( master.h5 ) linking to the image data files. The NeXus data representation (Könnecke et al. , 2015 ), like CIF, is very flexible and all metadata required can be captured by defining NeXus groups, fields and attributes. A good example of how consistent and comprehensive metadata can be stored in an imgCIF/CBF file is provided in Fig. 8 (Jörg Kaercher, Bruker AXS, private communication). In the proprietary Bruker .sfrm format the starting angles 2 θ , ω , φ and χ are given (` ANGLES: . '). Their axis directions are not defined, whereas they are in the CBF format: the orientations and dependencies are given in the left-hand panel of Fig. 8 ( b ). In .sfrm the rotation axis ` AXIS: 2 ' indicates ω , and the starting angle and increment are found at ` START: ' and ` INCREME: ' equivalent values are found in the CBF header at ` _diffrn_scan_axis.displacement_angle ' and ` _diffrn_scan_axis.displacement_increment ' (Fig. 8 b , right-hand panel).

4. A concern and an action arising from the Rovinj Diffraction Data Deposition Workshop

A concern was voiced during open discussion at the workshop via the question ` Can we move away from the knowledge base in the various software packages, and make use of well developed metadata formats such as in CIF or NeXus? ', i.e. a standardized raw diffraction image data format would make life easier for software developers but would require coordination between detector manufacturers. This has led directly to renewed calls for a standardized image format of appeal across the whole community. In conjunction with this question, the DDDWG is working on defining minimum requirements for metadata. We acknowledge that there will continue to be a great diversity of image formats (not least because of the existing installed base of detectors and the legacy data sets that have been archived), and conversion utilities such as eiger2cbf (https://github.com/biochem-fan/eiger2cbf) will continue to be needed. Nevertheless, it is important that anyone seeking to develop further new formats should be acutely aware of the need for adequate metadata characterization and interoperability that we have described above, and such an awareness may temper the proliferation of more new formats without particular demonstrable value.

In a separate discussion it was agreed that there is a need for a set of criteria for capturing and validating the essential experimental metadata for reproducibility of scientific results from any given raw data set. The proposal referred to this as ` checkCIF for raw data' and a close collaboration on this matter has been established with the IUCr COMCIFS (chaired by James Hester, who also attended the Rovinj Workshop). To develop these ideas further, a workshop run by the DDDWG is to take place at the ACA 2017 Conference in New Orleans in May 2017.

5. Concluding remarks

In this topical review we have provided descriptions of the rapidly developing interest in and storage options for the preservation and reuse of raw data within the scientific domain supervised by the IUCr and its Commissions. We have highlighted the initiatives of science policy makers towards an `Open Science' model within which crystallographers will work in the future this will bring new funding opportunities but also new codes of procedure within open science frameworks. Skills education and training for crystallographers and frank discussion will be needed. Overall, we now have the means and the organization for preservation of our raw data, but still the need for careful thought about the metadata descriptors for each of the IUCr Commissions continues to be pressing. We note that the Commissions work within a diversity of instrumentation, and so a range of actions is required to improve on this current situation.

We have identified specifically the need to revisit the imperative for the community to adopt a standardized image format, and to agree at least a minimal set of essential metadata for reproducibility. The imgCIF dictionary (Hammersley et al. , 2005 ) is the natural starting point for the former, and the interaction between COMCIFS and NIAC (Könnecke et al. , 2015 ) demonstrates the feasibility of applying a common ontology across differing physical formats. There are also grounds for optimism that the idea of ` checkCIF for raw data' will appeal to both researchers and instrument vendors, given the enthusiastic representation of both at the Rovinj Workshop. As with all such initiatives, the rate of uptake will depend on drivers within the community. In the case of the original ` checkCIF ' for derived data, structural science journals (especially those of the IUCr) that demanded relevant metadata and consistency checking provided one such important driver. In the case of raw data, which underpins all subsequent scientific deductions and derivations, we are encouraged by the emerging policies on research data management that we have summarized in this article, and by the many archiving initiatives that have sprung up around X-ray diffraction images in the space of the past few years.

Acknowledgements

We are grateful to the IUCr for continuing support of DDDWG activities, including the Workshop in Rovinj that led to this and a number of other articles. We are very grateful to various research institutes and universities who sent their staff to take part in that Workshop. Support for technical services and associated staffing costs was contributed by Dectris, IUCr Journals, CODATA, the Cambridge Crystallographic Data Centre, Bruker, FIZ Karlsruhe/ICSD, Oxford Cryosystems and Wiley, to whom we are very grateful. We are also indebted to the Croatian Association of Crystallographers for their active help in securing the best possible Workshop to address this important topic.

References

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalow, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28 , 235�. Web of Science CrossRef PubMed CAS Google Scholar
Bernstein, H. J. (2005). Classification and Use of Image Data . International Tables for Crystallography , Vol. G, Definition and Exchange of Crystallographic Data , edited by S. R. Hall and B. McMahon, pp. 199�. Dordrecht: Springer. Google Scholar
Bernstein, H. J., Bollinger, J. C., Brown, I. D., Gražulis, S., Hester, J. R., McMahon, B., Spadaccini, N., Westbrook, J. D. & Westrip, S. P. (2016). J. Appl. Cryst. 49 , 277�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Bernstein, H. J., Sloan, J. M., Winter, G., Richter, T. S., NIAC & COMCIFS (2013). Coping with BIG DATA Image Formats: Integration of CBF, NeXus and HDF5 . American Crystallographic Association Meeting, 20󈞄 July, 2013, Honolulu, Hawaii, USA. Poster T-16. Google Scholar
Diederichs, K. & Karplus, P. A. (2013). Acta Cryst. D 69 , 1215�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Digital Curation Centre (2016). Overview of Funders' Data Policies. http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies. Google Scholar
ESRF (2013). ESRFnews , December ed., pp. 14󈞁. ESRF, Grenoble, France. Google Scholar
Grabowski, M., Langner, K. M., Cymborowski, M., Porebski, P. J., Sroka, P., Zheng, H., Cooper, D. R., Zimmerman, M. D., Elsliger, M.-A., Burley, S. K. & Minor, W. (2016). Acta Cryst. D 72 , 1181�. Web of Science CrossRef IUCr Journals Google Scholar
Guss, J. M. & McMahon, B. (2014). Acta Cryst. D 70 , 2520�. Web of Science CrossRef IUCr Journals Google Scholar
Gutmanas, A., Oldfield, T. J., Patwardhan, A., Sen, S., Velankar, S. & Kleywegt, G. J. (2013). Acta Cryst. D 69 , 710�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A 47 , 655�. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R. & McMahon, B. (1995). Editors. International Tables for Crystallography , Vol. G, Definition and Exchange of Crystallographic Data . Dordrecht: Springer. Google Scholar
Hall, S. R. & McMahon (2016). Data Sci. J. 15 , 3. Google Scholar
Hammersley, A. P., Bernstein, H. J. & Westbrook, J. D. (2005). Image Dictionary (imgCIF) . International Tables for Crystallography , Vol. G, Definition and Exchange of Crystallographic Data , edited by S. R. Hall and B. McMahon, pp. 444�. Dordrecht: Springer. Google Scholar
Hester, J. R. (2016). Data Sci. J. 15 , 12. CrossRef Google Scholar
International Structural Genomics Organization (2001). Report of Task Force on Numerical Criteria in Structural Genomics. http://www.isgo.org/organization/members07/010410.html. Google Scholar
Jacques, D. A., Guss, J. M., Svergun, D. I. & Trewhella, J. (2012). Acta Cryst. D 68 , 620�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Jones, B. (2015). Towards the European Open Science Cloud . http://doi.org/10.5281/zenodo.16001. Google Scholar
Kabsch, W. (2010). Acta Cryst. D 66 , 125�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kachala, M., Westbrook, J. & Svergun, D. (2016). J. Appl. Cryst. 49 , 302�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Könnecke, M. et al. (2015). J. Appl. Cryst. 48 , 301�. Web of Science CrossRef IUCr Journals Google Scholar
Kroon-Batenburg, L. M. J. & Helliwell, J. R. (2014). Acta Cryst. D 70 , 2502�. Web of Science CrossRef IUCr Journals Google Scholar
Marsh, R. E., Kapon, M., Hu, S. & Herbstein, F. H. (2002). Acta Cryst. B 58 , 62󈞹. CSD CrossRef CAS IUCr Journals Google Scholar
Matthews, B., Sufi, S., Flannery, D., Lerusse, L., Griffin, T., Gleaves, M. & Kleese, K. (2010). Int. J. Digit. Curation , 5 , 106�. CrossRef Google Scholar
Messori, L. & Merlino, A. (2016). Coord. Chem. Rev. 315 , 67󈟅. Web of Science CrossRef CAS Google Scholar
Meyer, G. R., Aragão, D., Mudie, N. J., Caradoc-Davies, T. T., McGowan, S., Bertling, P. J., Groenewegen, D., Quenette, S. M., Bond, C. S., Buckle, A. M. & Androulakis, S. (2014). Acta Cryst. D 70 , 2510�. Web of Science CrossRef IUCr Journals Google Scholar
Meyer, P. A. et al. (2016). Nat. Commun. 7 , 10882. Web of Science CrossRef PubMed Google Scholar
Minor, W., Dauter, Z., Helliwell, J. R., Jaskolski, M. & Wlodawer, A. (2016). Structure , 24 , 216�. Web of Science CrossRef CAS PubMed Google Scholar
National Science Foundation (2010). Data Management and Sharing Frequently Asked Questions (FAQs). http://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp. Google Scholar
Northwestern University Library (2016). Data Management: Federal Funding Agency Requirements. http://libguides.northwestern.edu/datamanagement/federalfundingagency. Google Scholar
Ravel, B., Hester, J. R., Solé, V. A. & Newville, M. (2012). J. Synchrotron Rad. 19 , 869�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Research Councils UK (2015). Guidance on Best Practice in the Management of Research Data. http://www.rcuk.ac.uk/documents/documents/rcukcommonprinciplesondatapolicy-pdf/. Google Scholar
Sauter, N. K., Hattne, J., Grosse-Kunstleve, R. W. & Echols, N. (2013). Acta Cryst. D 69 , 1274�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schreurs, A. M. M., Xian, X. & Kroon-Batenburg, L. M. J. (2010). J. Appl. Cryst. 43 , 70󈞾. Web of Science CrossRef CAS IUCr Journals Google Scholar
Science International (2015). Open Data in a Big Data World. Paris: International Council for Science (ICSU), International Social Science Council (ISSC), The World Academy of Sciences (TWAS), InterAcademy Partnership (IAP). Google Scholar
Shabalin, I., Dauter, Z., Jaskolski, M., Minor, W. & Wlodawer, A. (2015). Acta Cryst. D 71 , 1965�. Web of Science CrossRef IUCr Journals Google Scholar
Strickland, P. R., Hoyland, M. A. & McMahon, B. (2005). Small-Molecule Crystal Structure Publication Using CIF . International Tables for Crystallography , Vol. G, Definition and Exchange of Crystallographic Data , edited by S. R. Hall and B. McMahon, pp. 557�. Dordrecht: Springer. Google Scholar
Tanley, S. W. M., Diederichs, K., Kroon-Batenburg, L. M. J., Levy, C., Schreurs, A. M. M. & Helliwell, J. R. (2015). Acta Cryst. D 71 , 1982�. Web of Science CrossRef IUCr Journals Google Scholar
Tanley, S. W. M., Diederichs, K., Kroon-Batenburg, L. M. J., Schreurs, A. M. M. & Helliwell, J. R. (2013). J. Synchrotron Rad. 20 , 880�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Tanley, S. W. M., Schreurs, A. M. M., Helliwell, J. R. & Kroon-Batenburg, L. M. J. (2013). J. Appl. Cryst. 46 , 108�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Tanley, S. W. M., Schreurs, A. M. M., Kroon-Batenburg, L. M. J. & Helliwell, J. R. (2016). Acta Cryst. F 72 , 253�. Web of Science CrossRef IUCr Journals Google Scholar
Terwilliger, T. C. (2012). Continuous Improvement of Macromol­ecular Crystal Structures . ICSTI Insights: The Living Publication , pp. 16󈞉 (http://www.icsti.org/IMG/pdf/Living_publication_Final-2.pdf). Paris: ICSTI. Google Scholar
Terwilliger, T. C. (2014). Acta Cryst. D 70 , 2500�. Web of Science CrossRef IUCr Journals Google Scholar
Terwilliger, T. C. & Bricogne, G. (2014). Acta Cryst. D 70 , 2533�. Web of Science CrossRef IUCr Journals Google Scholar
Toby, B. H. (2005). Classification and Use of Powder Diffraction Data . International Tables for Crystallography , Vol. G, Definition and Exchange of Crystallographic Data , edited by S. R. Hall and B. McMahon, pp. 117�. Dordrecht: Springer. Google Scholar
Vonrhein, C., Flensburg, C., Keller, P., Sharff, A., Smart, O., Paciorek, W., Womack, T. & Bricogne, G. (2011). Acta Cryst. D 67 , 293�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Waterman, D. & Evans, G. (2010). J. Appl. Cryst. 43 , 1356�. Web of Science CrossRef CAS IUCr Journals Google Scholar
Waterman, D. G., Winter, G., Parkhurst, J. M., Fuentes-Montero, L., Hattne, J., Brewster, A., Sauter, N. K. & Evans, G. (2013). CCP4 Newsl. Protein Crystallogr. 49 , 16󈝿. Google Scholar
White, T. A., Kirian, R. A., Martin, A. V., Aquila, A., Nass, K., Barty, A. & Chapman, H. N. (2012). J. Appl. Cryst. 45 , 335�. Web of Science CrossRef CAS IUCr Journals Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.


Title: Experiences with making diffraction image data available: what metadata do we need to archive?

A local raw ‘diffraction data images’ archive was made available and some data sets were retrieved and reprocessed, which led to analysis of the anomalous difference densities of two partially occupied Cl atoms in cisplatin as well as a re-evaluation of the resolution cutoff in these diffraction data. General questions on storing raw data are discussed. It is also demonstrated that often one needs unambiguous prior knowledge to read the (binary) detector format and the setup of goniometer geometries. Recently, the IUCr (International Union of Crystallography) initiated the formation of a Diffraction Data Deposition Working Group with the aim of developing standards for the representation of raw diffraction data associated with the publication of structural papers. Archiving of raw data serves several goals: to improve the record of science, to verify the reproducibility and to allow detailed checks of scientific data, safeguarding against fraud and to allow reanalysis with future improved techniques. A means of studying this issue is to submit exemplar publications with associated raw data and metadata. In a recent study of the binding of cisplatin and carboplatin to histidine in lysozyme crystals under several conditions, the possible effects of the equipment and X-ray diffraction data-processing software onmore » the occupancies and B factors of the bound Pt compounds were compared. Initially, 35.3 GB of data were transferred from Manchester to Utrecht to be processed with EVAL. A detailed description and discussion of the availability of metadata was published in a paper that was linked to a local raw data archive at Utrecht University and also mirrored at the TARDIS raw diffraction data archive in Australia. By making these raw diffraction data sets available with the article, it is possible for the diffraction community to make their own evaluation. This led to one of the authors of XDS (K. Diederichs) to re-integrate the data from crystals that supposedly solely contained bound carboplatin, resulting in the analysis of partially occupied chlorine anomalous electron densities near the Pt-binding sites and the use of several criteria to more carefully assess the diffraction resolution limit. General arguments for archiving raw data, the possibilities of doing so and the requirement of resources are discussed. The problems associated with a partially unknown experimental setup, which preferably should be available as metadata, is discussed. Current thoughts on data compression are summarized, which could be a solution especially for pixel-device data sets with fine slicing that may otherwise present an unmanageable amount of data. « less


A public database of macromolecular diffraction experiments

The reproducibility of published experimental results has recently attracted attention in many different scientific fields. The lack of availability of original primary scientific data represents a major factor contributing to reproducibility problems, however, the structural biology community has taken significant steps towards making experimental data available.

Macromolecular X-ray crystallography has led the way in requiring the public dissemination of atomic coordinates and a wealth of experimental data via the Protein Data Bank (PDB) and similar projects, making the field one of the most reproducible in the biological sciences.

The IUCr commissioned the Diffraction Data Deposition Working Group (DDDWG) in 2011 to examine the benefits and feasibility of archiving raw diffraction images in crystallography. The 2011-2014 DDDWG triennial report made several key recommendations regarding the preservation of raw diffraction data. However, there remains no mandate for public disclosure of the original diffraction data.

The Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC) is part of the Big Data to Knowledge programme of the National Institutes of Health and has been developed to archive raw data from diffraction experiments and, equally importantly, to provide related metadata. The database [Grabowski et al. (2016). Acta Cryst. D72, 1181-1193, doi:10.1107/S2059798316014716], contains at the time of writing 3070 macromolecular diffraction experiments (5983 datasets) and their corresponding partially curated metadata, accounting for around 3% of all depositions in the Protein Data Bank. The resource is accessible at http://www. proteindiffraction. org and can be searched using various criteria via a simple, streamlined interface. All data are available for unrestricted access and download. The resource serves as a proof of concept and demonstrates the feasibility of archiving raw diffraction data and associated metadata from X-ray crystallographic studies of biological macromolecules.

Talking to a reporter about the project, team leader Wladek Minor said, "There is so much research underway that it can't all be published, and often the results of unsuccessful studies don't appear in the literature. I think the key to success is to know about unsuccessful experiments, we want to know why they fail".

The goal of the project is to expand the IRRMC and include data sets that failed to yield X-ray structures. This could facilitate collaborative efforts to improve protein structure-determination methods and also ensure the availability of "orphan" data left behind by individual investigators and/or extinct structural genomics projects.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.


Are original x-ray diffraction data available - Biology

a Life Science, Diamond Light Source, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0DE, UK, and b Division of Structural Biology, University of Oxford, Wellcome Centre for Human Genetics, Oxford, Oxfordshire OX3 7BN, UK
* Correspondence e-mail: [email protected], [email protected]

Developing methods to determine high-resolution structures from micrometre- or even submicrometre-sized protein crystals has become increasingly important in recent years. This applies to both large protein complexes and membrane proteins, where protein production and the subsequent growth of large homogeneous crystals is often challenging, and to samples which yield only micro- or nanocrystals such as amyloid or viral polyhedrin proteins. The versatile macromolecular crystallography microfocus (VMXm) beamline at Diamond Light Source specializes in X-ray diffraction measurements from micro- and nanocrystals. Because of the possibility of measuring data from crystalline samples that approach the resolution limit of visible-light microscopy, the beamline design includes a scanning electron microscope (SEM) to visualize, locate and accurately centre crystals for X-ray diffraction experiments. To ensure that scanning electron microscopy is an appropriate method for sample visualization, tests were carried out to assess the effect of SEM radiation on diffraction quality. Cytoplasmic polyhedrosis virus polyhedrin protein crystals cryocooled on electron-microscopy grids were exposed to SEM radiation before X-ray diffraction data were collected. After processing the data with DIALS , no statistically significant difference in data quality was found between datasets collected from crystals exposed and not exposed to SEM radiation. This study supports the use of an SEM as a tool for the visualization of protein crystals and as an integrated visualization tool on the VMXm beamline.

1. Introduction

In the last decade, microfocus X-ray beamlines have facilitated advances in structural biology by providing increasingly small intense beams of X-rays. Crystal sizes on the order of tens of micrometres down to a few micrometres are now generally considered accessible, albeit challenging, targets for protein structural biology projects. Serial femtosecond crystallography X-ray free-electron laser (XFEL) approaches have also pushed this limit, using tens of thousands of microcrystals [for a review see Martin-Garcia et al. (2016 )] and even nanocrystals (Gati et al. , 2017 ) to determine high-resolution protein structures. Still, XFEL-based techniques have their challenges including the large number of crystals required, the inability to collect rotation data, and also the expense and limited availability of XFEL beam time. Synchrotron serial crystallography methods are also developing, but again often require reasonably large numbers of crystals (Ebrahim et al. , 2019 Diederichs & Wang, 2017 ). Electron diffraction is another growing technique for structure determination from protein crystals that are a few hundred nanometres in size (Shi et al. , 2013 Nannenga et al. , 2014 Yonekura et al. , 2015 Clabbers et al. , 2017 Xu et al. , 2018 ), with an upper limit to sample thickness of �� nm (Shi et al. , 2013 ). Focused ion-beam milling promises a way to circumvent this thickness limit by selectively obliterating excess crystal sample to give a thin (� nm) lamella from which data can be collected (Duyvesteyn et al. , 2018 Martynowycz et al. , 2019 ). Still, cryoEM microscopes equipped with dedicated detectors and software for low-dose protein electron-diffraction studies are reasonably scarce.

The versatile macromolecular crystallography microfocus (VMXm) beamline at Diamond Light Source, part of the VMX beamline suite, aims to further increase the scope of crystal sizes available to synchrotron-based X-ray crystallography. VMXm is designed to enable the collection of rotation datasets from crystals measuring down to 0.5 µm in size, thereby reducing the sample material required for protein structure determination, compared with serial methods, by improving the quality of data recorded from each individual crystal. In addition, crystals measuring several micrometres or less may encounter a reduced rate of radiation damage during X-ray diffraction experiments by harnessing potential photoelectron escape effects (Nave & Hill, 2005 ). A discussion from Holton & Frankel (2010 ) suggested that it is possible, under ideal conditions, to determine a 2.0 Å resolution structure from a single spherical crystal of lysozyme protein with a diameter of 𕙙.2 µm. This simulation ignored all contributions to background scatter arising from dis­ordered solvent within the crystal. VMXm aims to close the gap between theory and what is currently possible in macromolecular crystallography using synchrotron X-rays. To date and to our knowledge, the smallest crystals measured using the rotation method at a synchrotron to yield a structure were reported by Ginn et al. (2015 ), where diffraction data from 768 𕙙.0 µm 3 sized crystals were recorded at Diamond beamline I24, analysed and merged to produce a dataset complete to 2.2 Å resolution.

The VMXm beamline optics will deliver a focused variable vertical X-ray beam size of 0.3󈝶 µm using a single custom-profiled fixed focal-length mirror (Laundy et al. , 2016 ). Horizontal beam sizes of 0.5𔃃 µm are to be achieved using a two-stage demagnification scheme and a variable secondary source aperture. The horizontally deflecting double-crystal monochromator permits energies between 6 and 28 keV and, depending on the optical configuration, will deliver between 10 11 and 10 12  photons s 𕒵 to the sample when operating at 12 keV. Samples for VMXm will typically be prepared on electron-microscopy grids using techniques borrowed from cryoEM. To further improve the signal to noise of the diffracted X-rays, the sample environment will be held under a vacuum of 󕽺 𕒺  mbar. As of January 2020, the major construction of the beamline has been completed, with commissioning of its components ongoing.

Collecting rotation data, as opposed to single still images, from protein crystals measuring less than a micrometre poses many practical challenges beyond the obvious radiation-damage limitations, in particular, locating and centring crystals of this size to the X-ray beam. To enable rotation data collection from crystals in this size range, VMXm aims to produce both a beam position and a sample position, stable to within 50 nm. These design specifications impose high accuracy and resolution imaging of the sample position to ensure coincidence of the beam and sample. Therefore, to align and visualize micro- and nanocrystals, which could be below the resolving power of an optical-light microscope, a scanning electron microscope (SEM) has been incorporated into the VMXm endstation sample environment. Although other methods for visualizing and centring protein crystals have been explored elsewhere (for a review see Becker et al. , 2017 ), the superior resolving quality of an SEM and the independence of SEM image quality from crystal space group, morphology, orientation and protein sequence, formed the basis of this design decision. One consideration in using an SEM in this way, however, is the potential for damage to the samples resulting from electron interactions. In an analysis by Hattne et al. (2018 ), the global and site-specific radiation damage resulting from the use of a 200 keV electron beam suggested that an incident electron dose of 𕙛 e −  Å 𕒶 resulted in the loss of high-resolution information (classed as reflections of 3 Å resolution and above). This is in line with previous analyses which have assessed electron-induced radiation damage of protein crystals (Chiu, 2006 Henderson, 1995 ).

CryoSEM applications for uncoated biological samples use excitation energies with orders of magnitude lower than those in the transmission electron microscopy (TEM) methods described by Hattne et al. (2018 ). Instead of needing to penetrate through the entire crystal volume as in TEM-based experiments, the SEM beam needs only to interact with the surface layer of the crystal for image formation. Although there is little published data for SEM interaction volumes of protein crystals when using low (ɝ keV) incident energies of electrons, a Kanaya–Okayama estimation of the interaction hemisphere of pure amorphous carbon is � nm at 2 keV (Kanaya & Okayama, 1972 ). Monte Carlo simulations carried out by Barnett et al. suggest that the penetration depth of 2 keV electrons in water ice is � nm, although further experiments by the same group suggest these simulations perhaps underestimate this depth (Barnett et al. , 2012 ). Finally, simulations of the interaction of 2 keV electrons with graphene-coated chitin provided a maximum penetration depth of 140 nm (Park et al. , 2016 ). Given these data, the interaction depth of a 2 keV electron within a protein crystal is likely to be of the order of 100 to 200 nm.

In this study, polyhedra protein crystals from Lymantria dispar cytoplasmic polyhedrosis virus (CPV14) were imaged using an offline SEM, the column of which is to be integrated directly into the VMXm endstation to enable future visualization and centring of protein crystals. X-ray diffraction data were collected subsequently from these same SEM-imaged crystals. The aim was to identify whether collecting SEM images was detrimental to the diffraction quality of CPV14 crystals. This was carried out by assessing whether any significant difference was observable between diffraction data measured from crystal samples exposed to electrons versus those that were not. We demonstrate that low-dose SEM imaging is a viable method for accurately locating and aligning protein crystals without impacting the diffraction quality prior to X-ray data collection.

2. Materials and methods

2.1. Monte Carlo simulations

The program CASINO (Hovington et al. , 1997 Drouin et al. , 2007 ) was used to simulate the trajectory and penetration depth of 2 keV electrons in a protein crystal. A total of 200 electrons were simulated as a 10 nm beam. The protein crystal sample was described as 1000 nm thick with the formula C 1284 H 2695 N 351 O 748 S 12 and a density of 1.35 g cm 𕒷 . This stoichiometry emulates the chemical composition of crystals of CPV14 with 22% solvent content [PDB ID 5a96 (Ji et al. , 2015 )].

2.2. Protein preparation and crystallization

CPV14 polyhedra were expressed and purified as described previously (Hill et al. , 1999 Anduleit et al. , 2005 Ji et al. , 2015 ). Purified cubic CPV14 crystals measured 2𔃂 µm in each dimension and were stored as a slurry in H 2 O at 4°C.

2.3. Sample mounting

The CPV14 crystal slurry was diluted 1 in 12 into a solution of ethyl­ene glycol to give a final ethyl­ene glycol concentration of 50%( v / v ). Ethyl­ene glycol was added to allow for finer control over the subsequent blotting process and to ensure cryoprotection of the crystals.

Crystals were cryocooled on electron-microscopy grids in preparation for further analysis. Cu 200 mesh grids coated with Quantifoil R 2/2 carbon film (Quantifoil) or Cu 400 mesh H7 finder grids with holey carbon (AgarScientific) were glow discharged before application of the sample. A 2 µl aliquot of 50%( v / v ) ethyl­ene glycol was applied to the Cu side of the grid, followed by application of 2 µl of the diluted crystal slurry onto the carbon film. The grid was then blotted for 3.0𔃃.5 s from the Cu side of the grid using a Leica EM GP (20°C, humidity 90%). Blotted grids were then plunge frozen in liquid ethane. Grids were stored under liquid nitro­gen until required.

2.4. Sample treatment

The samples were divided into four treatment groups: untreated, SEM loaded, SEM unexposed and SEM exposed, the details of which are described in Sections 2.4.1 𔃀.4.3 . Tests to assess radiation damage as a result of electron-beam exposure were performed using a JEOL JSM-IT100 SEM equipped with a Quorum PP3000T cryostage and cryotransfer system. The PP3000T cryostage, preparation stage (prepstage) and anticontaminator were cooled to �°C, �°C and �°C, respectively. A gold-coated copper Zeiss scanning TEM shuttle was used to hold the samples during these experiments.

2.4.1. Untreated

Untreated samples were plunge frozen in liquid ethane and stored in liquid nitro­gen as detailed in Section 2.3 .

2.4.2. SEM loaded

SEM-loaded samples were additionally transferred into the SEM using the cryotransfer system. Plunge-frozen samples were loaded into the shuttle under liquid nitro­gen. The cryotransfer system was used to transfer the samples into the cooled preparation chamber of the SEM. The shuttle was placed on the prepstage for 30 s before transfer onto the SEM stage for 2 mins. The shuttle was then retracted back onto the prepstage for a further 30 s before transfer out of vacuum into liquid nitro­gen using the cryotransfer system. The sample was then removed from the shuttle and stored under liquid nitro­gen.

2.4.3. SEM unexposed and SEM exposed

Crystals for the SEM-exposed and SEM-unexposed X-ray diffraction experiments were all on the same grid to control for inter-grid sample variation because of grid handling. These grids were treated in the same manner as SEM-loaded samples (see Section 2.4.2 ) however, instead of the 2 min incubation on the SEM stage, the grids were kept on this stage for 𕙙.5 h whilst SEM exposures were carried out. SEM-exposed crystals were imaged at an accelerating voltage of 2 kV, a probe current of 40 (arbitrary units) and a working distance of 10 mm. To help with navigation around the grid and to assess grid quality, a global image of the grid was taken at 30× magnification using a 0.5 s acquisition time (total dose, 4.6 × 10 𕒼  e −  Å 𕒶 ). A single grid square was then used to optimize focus and astigmatism. The optimum parameters were those which provided the sharpest image as judged by eye. Image contrast and brightness were optimized using the autocontrast and brightness feature of the InTouchScope software package (JEOL). Images of individual grid squares containing crystals were taken at 1900× magnification using a 20 s acquisition time (7.6 × 10 𕒷  e −  Å 𕒶 ). Between 50 and 75 grid squares were imaged with these conditions, crystals in these images formed the SEM-exposed population. The remainder of the grid was left unexposed to electrons. Crystals in these areas formed the SEM-unexposed population. A description of the electron-dose calculations for these images can be found in the Supporting information.

2.5. X-ray data collection

Electron-microscopy grids were mounted onto the beamline goniometer using a custom-made sample pin. The pin constituted a blood-vessel clip (product 14120, World Precision Instruments) on a standard magnetic pin base held in place with 3M Scotch-Weld Ep­oxy Adhesive 1838 [see Figs. S1( a )–S1( c ) in the Supporting information]. Grids were transferred into the pin under liquid nitro­gen and then capped [Figs. S1( d )–S1( f )]. The capped pin was mounted onto the goniometer by hand and the cap was rapidly removed such that the grid was quickly exposed to the cryostream before liquid nitro­gen had drained from the cap.

Data were measured at Diamond Light Source beamlines I24 and I04. In all instances, data were collected as 5° wedges of contiguous data with an oscillation width of 0.1° and an exposure time of 0.05 s. Data from I24 were collected on a Dectris PILATUS3 6M detector using an X-ray beam size of 6 × 9 µm [full width at half-maximum (FWHM)] at 100% transmission and a wavelength of 0.9686 Å, producing a flux of 3.0 × 10 12  photons s 𕒵 . Data from I04 were recorded using a Dectris PILATUS 6M-F detector with a beam size of 11 × 5 µm (FWHM) at 100% transmission and a wavelength of 0.9795 Å, producing a flux of 2.8 × 10 11  photons s 𕒵 . For each of the four conditions, data were collected from at least three independently prepared grids. At least 100 crystals were analysed for each condition on each grid. For the SEM-exposed crystals, the electron-microscopy images were used in combination with the optical microscope views of the X-ray beamline sample position to identify the crystals that had been exposed to electrons.

2.6. Data processing and analysis

In order to assess potential differences in diffraction quality, data were processed using DIALS (Winter et al. , 2018 ) and then analysed using BLEND (Foadi et al. , 2013 ). The synthesis mode of BLEND was then used to scale and merge the data collected from each treatment from a single grid.

In order to look for differences in initial diffraction quality between SEM-exposed and SEM-unexposed treatments, all datasets collected from the same beamline that were successfully integrated using DIALS were scaled together using dials.scale . The program dials.cosym was used to ensure consistent indexing prior to scaling (Gildea & Winter, 2018 ). The scale factor and relative B factor for the first image of each dataset were then extracted using dials.python to execute a Python script developed in-house.

Three replicate grids produced three complete scaled-and-merged datasets each for all four treatment groups. The mean values of key crystallographic statistics across these three replicates were compared using a one-way analysis of variance (ANOVA) method. The mean values of key statistics for the SEM-exposed and SEM-unexposed treatments were additionally compared with each other using Student's t-tests. The distributions of scale factors and relative B factors for the initial images from each dataset for each of the treatment groups were compared using Kolmogorov–Smirnov (KS) tests. These statistical analyses were carried out using GraphPad Prism 8.0 (GraphPad Software, La Jolla, California, USA).

3. Results and discussion

3.1. Monte Carlo simulations

The average penetration depth of 2 keV electrons in a simulated CPV14 crystal was 70.0 ± 19.8 nm and the maximum penetration depth was 109.8 nm (Fig. S2). However, it should be noted that experiments by Barnett et al. (2012 ) – which assessed electron penetration depth within amorphous water-ice crystals – suggest that CASINO simulations may underestimate the penetration depth of electrons at these low accelerating voltages. Still, these simulations provide an estimate of the electron interaction volume for CPV14 protein crystals. On this basis, for a 2 µm CPV14 crystal (8 µm 3 ), 2 keV electrons scanned across the entire surface of the crystal have the potential to penetrate, on average, 𕙛.5% of the total diffracting volume. For a 0.5 µm (0.125 µm 3 ) crystal, this increases to 󕽾% of the total diffracting volume. This analysis does not, however, inform about the impact of electrons on diffraction quality.

3.2. Sample preparation and SEM exposures

Plunge freezing the CPV14 crystals in liquid ethane using a Leica EM GP provided a reproducible method with which to mount crystals on cryoEM grids. The cuboid morphology of the crystals resulted in a preferential orientation of the crystals on the grids. The crystals generally lay with their faces parallel to the carbon film on the grids, rarely did the crystals sit on an edge or vertex. Although not explored here, methods designed by Wennmacher et al. (2019 ) have been shown to successfully combat preferential orientation of crystals on electron-microscopy grids. These methods are likely to be of particular use in future cases involving crystals from low-symmetry space groups which exhibit preferential orientation. Significant manual handling was required to transfer the plunge-frozen grids in and out of the SEM and subsequently onto the X-ray beamline whilst maintaining the samples at cryotemperatures. The combination of mechanical handling and transfer of sample grids in and out of a 1 × 10 𕒺  mbar vacuum may have induced variation in sample treatments and could account for differences in crystal properties other than those caused by electron-beam exposure. In order to control for this grid-to-grid variation in crystal characteristics – which could potentially mask the effects of exposure to the electron beam – the data for SEM-exposed and SEM-unexposed crystals were taken from a single grid. For these samples, part of the grid was exposed to electrons, with the crystals in this section making up the population of SEM-exposed crystals. The remainder of the grid was not exposed to electrons and crystals in this section made up the SEM-unexposed population.

3.3. Data collection

An example SEM image of the CPV14 crystals is shown in Fig. 1 ( a ). The crystals in this image form part of the population of crystals which were exposed to electrons prior to X-ray data collection. In order to collect X-ray diffraction data from these SEM-exposed crystals, each crystal had to be located and identified on the X-ray beamline using the optical microscope on-axis-viewing system (OAV). This was achieved using `finder' electron-microscopy grids (see Section 2.3 ) such that each individual grid square was easily identifiable and indexable under both the SEM and OAV magnification schemes. Fig. 1 ( b ) depicts the corresponding OAV image for the crystals shown in the SEM image. The improvement in resolution when using an SEM is evident. It is also easier to identify the vitreous crystallization solution surrounding the individual crystals and the areas of vitreous crystallization solution close to the Cu grid bars.


Figure 1
CPV14 crystals imaged using electrons and visible-light microscopy. ( a ) An example cryoSEM image of CPV14 crystals taken at an accelerating voltage of 2 kV with a working distance of 10 mm and an electron dose of 7.6 × 10 𕒷  e −  Å 𕒶 . The crystals in this image formed part of the SEM-exposed treatment group. The maximum achievable resolution under these conditions with this microscope is 𕙠 nm. ( b ) An image taken using the optical microscope OAV of the I24 beamline shows the corresponding grid square to that shown in panel ( a ). The maximum achievable resolution with this optical microscope is 0.7 µm. In panel ( b ), the red crosshair indicates the microfocus beam position on I24 prior to X-ray diffraction data collection from a single CPV14 crystal. The equivalent position in panel ( a ) is indicated by a dashed white circle. In both panels, the scale bar indicates 10 µm.

To overcome the preferential orientation of the crystals on the grids, a concerted effort was made to collect data using different starting angles with respect to the orientation of the grid for the 5° wedges. The grids limited the rotation angles from which data could be collected. With the grid perpendicular to the beam, ∼䕠° of data could be collected from both the `front' and `back' of the grid giving a total accessible range of �°. Despite this limitation it was still feasible to obtain complete data because of the high symmetry of CPV14 crystals (space group I 23).

3.4. Data processing and analysis

DIALS was used to process the 5° wedges of data. Where data could be successfully integrated, the resultant .mtz files were fed into BLEND . All clusters from the analysis mode of BLEND were scaled and merged before a single dataset with optimal completeness was taken forward from crystals measured from each grid for further analysis. For each dataset, the high-resolution cut off was chosen based on CC 1/2 > 0.3 (Karplus & Diederichs, 2015 ), which sometimes required an additional run of the program AIMLESS within the BLEND pipeline. The results of this data-processing step are presented in Tables 1 and 2 .

Table 1
Data-processing statistics

Values for the outer shell are given in parentheses.

Table 2
Data-processing statistics

Values for the outer shell are given in parentheses.

The overall values for maximum resolution, R p.i.m. and CC 1/2 were plotted for data collected for all four treatment groups (Fig. 2 ). At least three complete datasets were collected for each of the treatment groups. In the case of the SEM-exposed and SEM-unexposed datasets, complete datasets were collected for both treatment groups from each of the three replicate grids, i.e. one SEM-exposed and one SEM-unexposed dataset per grid, providing a total of six datasets. The mean value for each of the above-listed statistics was then calculated for the replicates of each sample treatment. The mean values for each of these statistics were compared across all treatment groups through use of a one-way ANOVA method. These analyses showed no statistically significant difference between the mean values of maximum resolution, R p.i.m. or CC 1/2 across any of the treatment groups. A further Student's t-test was used to compare the mean values of these statistics between the SEM-exposed and SEM-unexposed datasets. Using this method of analysis, there was no statistically significant difference ( p > 0.05) measured between these crystallographic statistics for data collected from crystals pre-exposed to 2 keV SEM beam (SEM exposed) versus crystals that were not exposed (SEM unexposed).


Figure 2
Plots of key data-processing statistics for merged datasets from the four treatment groups: untreated (cyan), SEM loaded (green), SEM unexposed (blue) and SEM exposed (red). Plots of ( a ) maximum resolution, ( b ) R p.i.m. and ( c ) CC 1/2 show each dataset as a coloured circle and the black line indicates the mean value. For the SEM-unexposed and SEM-exposed samples, the numbers next to the circles indicate which of the three grids the data were collected from. The data from grids 1 and 2 were collected on I24, and the data from grid 3 were collected on I04.

To further investigate the potential damage to the crystals caused by pre-exposure to SEM radiation the 1151 integrated datasets collected on I24 were all scaled together. This was achieved using dials.cosym (Gildea & Winter, 2018 ), to ensure a consistent indexing scheme, followed by dials.scale . In an attempt to assess whether the sample treatments significantly altered the initial diffraction of the crystals, both the scale factor and the relative B factor for the initial diffraction pattern from each dataset were extracted from the data, these values can be seen plotted as histograms for each treatment group in Fig. 3 .


Figure 3
Histograms showing the initial scale factors and relative B factors for datasets collected from crystals across different treatments. Scale factors ( a )–( d ) and relative B factors ( e )–( h ) for the first frame of each dataset collected from individual CPV14 crystals were extracted following a single scaling job of all 1151 datasets with DIALS . These factors were then plotted as histograms, where each histogram contains the distribution of either initial scale factor or B factor within a given treatment group. The treatment groups were: untreated [cyan, ( a ) and ( e )], SEM loaded [green, ( b ) and ( f )], SEM unexposed [blue, ( c ) and ( g )] and SEM exposed [red, ( d ) and ( h )].

A comparison of these distributions between treatment groups by way of a KS test revealed that the distributions of both scale and B factor for SEM-unexposed and SEM-exposed treatments were not significantly different to each other (scale factors of p > 0.05 and D = 0.07175, and B factors of p > 0.05 and D = 0.07613) (where D is the KS distance). This analysis infers that the pre-exposure of the crystals to the electron dose used here did not significantly alter the diffraction quality of these crystals. Further KS tests comparing the distributions of initial scale and B factor between the other treatment groups were also carried out. The distributions of the scale factors for untreated samples were significantly different to the distributions of both the SEM-loaded and SEM-unexposed samples ( p < 0.0001 in both tests). These results suggest that the grid handling involved in putting the grids into and out of vacuum at cryogenic temperatures has an effect on the diffraction quality of the crystals. Furthermore, the distributions for the SEM-loaded samples were significantly different to those of the SEM-unexposed samples ( p < 0.0001 in all tests). This suggests that the additional time spent on the SEM cryostage in the case of the SEM-unexposed samples is having an effect on the diffraction quality of the crystals. This could be related to the vacuum environment or the cooling of the samples whilst in the SEM, or a combination of the two. An analysis of the temperature of the SEM shuttle was carried out (data not shown) indicating that the shuttle is kept below devitrification temperature during transfer and whilst on the SEM stage however, no measurements were able to be carried out to measure the temperature of the grid itself during transfer. Given that the grid relies on thermal contact with the shuttle for effective cooling, it cannot be ruled out that inefficient thermal contact and thus insufficient cooling contributed to these significant differences. This study highlights the importance of detailed characterization of cryogenic handling workflows when dealing with sensitive biological samples.

It is important to note that CPV14 is a well diffracting sample and that other crystals, such as those formed from large molecular weight membrane proteins, might be more susceptible to radiation damage. In reference to this point, research from Holton & Frankel (2010 ) provides some useful discussion and offers some insight into the potential relationship between CPV14 and other potentially more disordered or radiation-sensitive proteins. Their discussion compares the test protein case of lysozyme with a large (10 MDa) protein crystal with a Wilson B factor of 61 Å 2 . The calculations within the article suggest that this larger protein with a Wilson B factor three times that of the lysozyme crystal requires a volume close to two orders of magnitude larger to produce the equivalent diffraction resolution and quality. This suggests that such a crystal is approximately two orders of magnitude more sensitive to X-ray dose than the lysozyme counterpart described in the article. The soluble nature of CPV14 and its molecular weight make it more comparable with the lysozyme example of Holton & Frankel (2010 ) than the 10 MDa protein. It is possible, therefore, that a more disordered or radiation-sensitive protein, for instance a membrane protein, could be approximately two orders of magnitude more sensitive to radiation damage compared with CPV14. Considering this, we believe that the incident electron doses used here still place us well within the damage threshold for even the most sensitive crystals, especially since the low-energy electrons used are predicted to penetrate no more than 150 nm into the surface of the samples.

4. Conclusions

The analyses described here support the use of low-voltage SEM imaging as a method to visualize and locate micrometre-sized protein crystals prior to X-ray diffraction experiments. Using 2 keV electrons at the doses described, the results presented here indicate no significant difference between the quality of X-ray diffraction data from crystals that were exposed to the SEM beam and those that were not. This is in line with the literature which states that doses of 3 e −  Å 𕒶 are required to cause a reduction in high-resolution reflections (described as reflections < 3 Å resolution) (Chiu, 2006 Henderson, 1995 Hattne et al. , 2018 ). These experiments were carried out using electron doses that were several orders of magnitude lower than this 3 e −  Å 𕒶 threshold and electron energies that leave the bulk of the protein crystals unpenetrated. Indeed, the lack of statistically significant or measurable radiation damage to the SEM-exposed samples supports the use of such doses and electron energies for imaging. In conclusion, low-voltage SEM imaging is an appropriate method for the visualization and subsequent alignment of samples below the resolution of optical microscopy.

5. Related literature

The following reference is cited in the Supporting information for this article: Zheng et al. (2009).


Manual evaluation

Although the structural biology community has achieved a high level of automation in data collection, data processing and structure solution in recent years, the process of structure determination still requires interpretation by researchers. This especially applies to low-quality maps with poor fit between experimental data and structural models. Visual residue-by-residue inspection by an experienced structural biologist remains the best way to judge quality. We therefore select representative structures of each SARS-CoV-2 protein, as well as those of particular interest for drug development, for manual evaluation. Certain problems are surprisingly common, such as peptide bond flips (Fig. 1c,d), rotamer errors, occupancy problems (Fig. 1e) and misidentification of small molecules or ions, for example, water as magnesium and chloride as zinc. Of note, zinc plays an important role in many SARS-CoV-2 proteins. We found many zinc coordination sites to be mismodelled, with the zinc ion missing or pushed out of the density and/or erroneous disulfide bonds between the coordinating cysteine residues (Fig. 1a,b,h). In addition, many coronavirus proteins are glycosylated at surface asparagine residues, but glycan sugars were often flipped from their correct orientation around the N-glycosidic bond (Fig. 1f,g). This can be avoided by using tools such as Privateer 19 and the automated carbohydrate building tool in Coot 20 . It is important to note that deviation from expected behavior is not always an error and can also be a functionally relevant feature, for example, the strained geometries often found at catalytic sites. However, such deviations must be strongly supported by the experimental data. Of the structures we checked manually, we were able to substantially improve 31 in terms of model quality, data quality, or both. Below we give two examples to illustrate the importance of carefully inspecting the experimental data and resulting models.

All pictures except i are screenshots from the Coot v0.9.9 prerelease. Residual density and reconstruction maps are in blue-gray, difference electron density in red and green. a, SARS-CoV-1 Nsp14–Nsp10 (PDB 5C8T) histidine zinc-coordination site (B603), with residual density contour level 0.445, root mean square deviation (r.m.s.d.) 0.150. b, Histidine from a has been swapped in ISOLDE 25 , leading to tetrahedral coordination of Zn 2+ , then refinement was performed using PDB-REDO 11 with manual addition of links. c, Proline A505 is modelled as trans in the RdRp complex (PDB 7BV2, left), but the density indicates a cis main chain conformation, shown in d. d, The deposited PDB entry was updated after we contacted the original authors. e, High difference electron density at residue A165 in the SARS-CoV-2 main protease (PDB 5RFA) due to an occupancy of only 0.44 rather than 1.00 near the potential inhibitor (left). Residual map contour level 0.54, r.m.s.d. 0.319 difference density at contour level 0.35, r.m.s.d. 0.114. f, SARS-CoV-2 spike receptor-binding domain complexed with human ACE2 (PDB 6VW1). This N-linked glycan is flipped approximately 180° around the N-glycosidic bond. After we contacted the original authors, this entry was revised (shown in g). g, Correction improves the density fit of the sugar chain. Residual map at contour level 0.311, r.m.s.d. 0.265. h, Disulfide bond A226–A189 in papain-like protease (PDB 6W9C), with electron density at contour level 0.214, r.m.s.d. 0.136 the other two cysteine residues remain uncoordinated. While the density map does not indicate a zinc, it is a zinc finger domain the other NCS copies include a coordinated zinc at this position. i, AUSPEX 8 plot of SARS-CoV main protease (PDB 2HOB) ice rings are reflected by a bias in the intensity distribution (red). j, Ramachandran plot or torsion angles in the peptide backbone for the SARS-CoV Nsp10–Nsp14 dynamic complex (PDB 5NFY). In principle, there should only be a few outliers (red), as most peptide bonds adhere to typical angular distributions. Picture: CSTF/insidecorona.net.

Papain-like protease

SARS-CoV-2 nonstructural protein 3 (Nsp3) contains a papain-like protease domain that is essential for infection because it cleaves the viral polypeptide. The first structure of the SARS-CoV-2 papain-like protease (PDB 6W9C) was released 1 April 2020, only three months after the viral genome was reported (GenBank MN908947.2) 21 . The structure was immediately used in drug design efforts. The overall completeness of the measured data, however, was only 57%. Examination of the raw data, available from https://proteindiffraction.org/ 10 , revealed strong radiation damage, exacerbated by a poor data collection strategy. This could not be deduced from the PDB deposition, underlining the importance of making raw data available.

The crystal has 3-fold non-crystallographic symmetry (NCS), with each papain-like protease domain monomer containing a functionally important Zn 2+ ion bound by four cysteine residues with similar Cß–Sγ–Zn angles and Zn–Sγ bond lengths. Because of radiation damage, the Zn–S sites have poor density. In one NCS copy, the site has been modelled as a disulfide bond and two free cysteine residues (Fig. 1h), while the other two NCS copies coordinate the zinc atom with strongly varying Cß–Sγ–Zn angles and Zn–S bond lengths. We reprocessed the images using XDS 22 , a software for the processing of single-crystal X-ray diffraction images. The STARANISO server was used to determine and apply an anisotropic limit for the diffraction data. This careful manual intervention improved the overall quality of the data and increased the resolution from 2.7 to 2.6 Å, but the revised overall ellipsoidal completeness was only 44.5%. Adding zinc atoms to all sites, restraining the bond lengths and angles to the expected values and using NCS restraints and an overall higher weighting for ideal geometry, together with remodeling the side chains and water molecules, improved the electron density maps and lowered the R values by 4%. This exemplifies the interconnection between data collection, data processing and model building: even if the data collection strategy is not ideal, taking the resulting problems into account during data processing and refinement can drastically improve the final model.

A structure of the C111S mutant of the papain-like protease domain (PDB 6WRH) was released one month later. In this structure, the zinc sites were clearly resolved in all subunits. In the meantime, however, PDB 6W9C had been widely used in in silico drug design. 20% of the over 140 research teams in the JEDI COVID19 GrandChallenge, a competition to find potential COVID-19 drugs in silico, have used this model. The availability of a better structure one month earlier would have increased their chances of success and saved computing and person hours.

RNA polymerase complex

SARS-CoV-2 replicates its single-stranded RNA genome using a macromolecular complex of RNA-dependent RNA polymerase (Nsp12 RdRp), Nsp7 and Nsp8. Earlier cryo-EM structures of the SARS-CoV-1 homologues (PDB 6NUR, PDB 6NUS) include a disordered unmodelled loop followed by a visible but short and irregular helix and a flexible C terminus. Density for this helix was poorly resolved, but the model had valid geometry. Our analysis of one of the first structures of the equivalent SARS-CoV-2 complex (PDB 7BTF) revealed that the sequence in this C-terminal region (part of the RNA-binding groove) was misaligned by nine residues (Fig. 2). This error was present in all related SARS-CoV-1 and SARS-CoV-2 structures, probably because new structure determination typically starts from an earlier model when one is available.

a, Overview with missing loop shown as a dashed line (PDB 7BV2) map at 2.4σ. Right, details of the C-terminal helix at 5σ. b, Lower resolution map and model (PDB 6NUS). Judging the side chain fit is difficult. c, Higher resolution map and model (PDB 7BV2) as deposited the side chain fit is suboptimal due to the register error. d, Amended model for PDB 7BV2 the side chains now fit the density. The register shift is indicated by the labelled Tyr915. Picture: CSTF/insidecorona.net.

A structure of the RdRp complex bound to the nucleotide analogue remdesivir (PDB 7BV2 (ref. 23 )) was released soon after and provided the basis for rational design of related drug candidates 24 . This structure also featured the nine-residue sequence misalignment. We rebuilt the structure using ISOLDE 25 , CaBLAM 6 and visual inspection, correcting some flipped or cis versus trans peptides (Fig. 1c,d) and three RNA conformers near remdesivir, including a backward adenosine base. We were also able to add several residues and waters with good density and geometry. Remdesivir is covalently attached to the RNA, but it is only present in an estimated ≤50% of the measured molecules 12 . This means that the active site is a mixture of at least two different states, so unsurprisingly, the modeled Mg 2+ ions and pyrophosphate are poorly supported by the experimental density and local contacts. This is of concern for subsequent in silico docking and drug design, which often take all atoms in the deposited structure as a fixed framework to build into. The remodelled structures of the complex may offer a more solid basis for drug design, even if the

50% occupancy of the active site was not widely discussed 12 . It is notable that despite the large register error and various smaller issues, by traditional “summary” metrics the model appeared extremely good, with no Ramachandran nor rotamer outliers and a clash score of 2, highlighting that direct visual inspection must remain a key step in any modelling process.

Although the problems discussed above were present in the originally deposited structures, nearly all are now corrected. This was achieved at least in part because we made corrected models available on our website and contacted the original authors of these structures with detailed descriptions, supporting them to deposit revised versions to the wwPDB at their discretion.


Title: Raw diffraction data preservation and reuse: Overview, update on practicalities and metadata requirements

A topical review is presented of the rapidly developing interest in and storage options for the preservation and reuse of raw data within the scientific domain of the IUCr and its Commissions, each of which operates within a great diversity of instrumentation. A résumé is included of the case for raw diffraction data deposition. An overall context is set by highlighting the initiatives of science policy makers towards an `Open Science' model within which crystallographers will increasingly work in the future this will bring new funding opportunities but also new codes of procedure within open science frameworks. Skills education and training for crystallographers will need to be expanded. Overall, there are now the means and the organization for the preservation of raw crystallographic diffraction dataviadifferent types of archive, such as at universities, discipline-specific repositories (Integrated Resource for Reproducibility in Macromolecular Crystallography, Structural Biology Data Grid), general public data repositories (Zenodo, ResearchGate) and centralized neutron and X-ray facilities. Formulation of improved metadata descriptors for the raw data types of each of the IUCr Commissions is in progress some detailed examples are provided. Lastly, a number of specific case studies are presented, including an example research thread that provides complete open accessmore » to raw data. « less


Footnotes

↵ ¶ To whom correspondence should be sent at the * address. E-mail: chris.jacobsenstonybrook.edu .

Author contributions: D. Shapiro, T.B., V.E., M.H., C.J., J.K., E.L., and D. Sayre designed research D. Shapiro, P.T., T.B., V.E., M.H., C.J., J.K., E.L., H.M., and A.M.N. performed research D. Shapiro, P.T., V.E., C.J., E.L., H.M., and A.M.N. analyzed data D. Shapiro, P.T., T.B., V.E., M.H., C.J., and E.L. contributed new reagents/analytic tools and D. Shapiro, V.E., M.H., C.J., J.K., and D. Sayre wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviations: XDM, x-ray diffraction microscopy CCD, charge-coupled device STXM, scanning transmission x-ray microscope.