DIALS-West-1:
Partnering Data Collection and Reduction in the Beamline
Environment
July 27, 2012, Harvard Medical School,
Boston, MA
Room TMEC 423 (Tosteson Medical Education
Center, 260 Longwood Ave)
Workshop Organizer: Nick Sauter, LBNL
Meeting Summary
The aim was
to explore the latest ideas and developments in beamline data processing. DIALS (Diffraction Integration for
Advanced Light Sources) is a new software collaboration between the NIH-funded
group led by Nick Sauter at Berkeley Labs (LBNL), and the BioStruct-X Work
Package 6 consortium led by Gwyndaf Evans at Diamond Light Source. New data reduction approaches are
needed to handle high framing rates from pixel array detectors, and to improve
the outcome from marginal data. Both groups recognize the benefit of an
open-source software development strategy like that used in the Computational
Crystallography Toolbox (cctbx)
initially introduced by LBNL. The ability to rapidly prototype new algorithms, as
well as existing tools to manage collaboration, make cctbx a good starting point for the DIALS effort. We implemented test code for Bragg spot
integration using 2D pixel summation, and plan to work on 3D profile fitting
for delivery within a year. High
throughput will be achieved initially with multicore processing, and possibly
with GPU computing in the future.
Graeme
Winter (Diamond Light Source) described a software approach that would allow
light source facilities to implement code specializations in the field, e.g. to support new hardware or
algorithms not supported by the central code base. Such ideas help make the code "future proof", so Graeme's
code to represent X-ray detectors (dxtbx)
will be included in the initial release.
Xia2 is a data processing
pipeline now supported at Diamond, which could act as a conduit for delivering
the new DIALS software to the beamline (users would run xia2, which would
delegate work to DIALS). FastDP is another Diamond development
that runs XDS data reduction on up to
480 cpu cores, merging large (1800 frame) Pilatus datasets in 2 minutes. David Waterman (CCP4) sees DIALS as an
opportunity to correct the deficiencies of current integration programs. In particular, the error on integrated
Bragg intensities is poorly modeled, and needs to be improved with new models that
account for the physics of each detector type and X-ray source.
Jon
Schuermann and David Neau of NE-CAT described Rapid Automated Processing of
Data (RAPD), which performs massively parallel automatic analysis of all user
datasets by pipelining standard software tools, with final output presented
within a Web interface. Results
are available in minutes, fast enough to influence real-time user
decisions. However, the same
factors driving DIALS development (speed and handling of problem datasets) also
limit the performance at NE-CAT.
Chris
Nielson (ADSC) speculated that the toolbox approach of DIALS may never take the
place of standard data processing packages, and but might become a state of the
art description of best practices.
The need for peer review of each new module was discussed, with a
possible publication venue in the software section of an IUCr journal. Also, we can adopt the regular practice of documenting
mathematical formulae along with the source code (in LaTeX format), thus preserving the thought process behind the
algorithm. An open architecture
will allow vendors and hardware developers to experiment with improvements in
detectors and calibration without involving the authors of standard data
processing programs.
Christian Brönnimann of
Dectris Ltd. pointed out that physical properties of subsequent PAD products
(Pilatus3 and Eiger) will differ from the current Pilatus2 architecture and
will require new physical models.
For example, the Pilatus3 retrigger mode will handle count rates up to
107 cps, an order of magnitude higher than current performance. The Eiger sensor will be much thicker
(1mm vs. 300mm) and require a different thickness correction. It will be essential to have a
clear dividing line between data corrections made by the vendor process, and
those made within downstream software such as DIALS. Fast framing rates (up to 3 kHz, 5GB/s for the Eiger) will
require new file formats (beyond CBF) for fast storage. HDF5 is a possible candidate presently
being studied.
Herbert
Bernstein (Dowling College) described aspects of data management planned for
NSLS-II. At raw data generation
rates of 9GB/s, it will be impossible to store all the data. However, we can be much smarter than
simply reducing data to Bragg intensities and discarding the original; instead,
we can do lossy (up to 50:1) compression giving images that are nearly
indistinguishable from the original, as far as data processing statistics are
concerned. To guard against fraud
and legal challenge, a small fraction (5%) of images will be retained with
lossless compression to prove their equivalence to lossy data. The current CBF library provides 4:1
lossless compression with the Hammersley byte-offset algorithm; current work
focuses on extending that algorithm to give 10:1 lossless compression, or
adopting a different lossless (JP Abrahams) algorithm. Also, an HDF5 backend is being added to
CBFlib to support metadata access in the gigabyte data rate context.