Partnering Data Collection and Reduction in the Beamline Environment
July 27, 2012, Harvard Medical School, Boston, MA
Room TMEC 423 (Tosteson Medical Education Center, 260 Longwood Ave)
Workshop Organizer: Nick Sauter, LBNL
Meeting Summary Back to Agenda
The aim was to explore the latest ideas and developments in beamline data processing. DIALS (Diffraction Integration for Advanced Light Sources) is a new software collaboration between the NIH-funded group led by Nick Sauter at Berkeley Labs (LBNL), and the BioStruct-X Work Package 6 consortium led by Gwyndaf Evans at Diamond Light Source. New data reduction approaches are needed to handle high framing rates from pixel array detectors, and to improve the outcome from marginal data. Both groups recognize the benefit of an open-source software development strategy like that used in the Computational Crystallography Toolbox (cctbx) initially introduced by LBNL. The ability to rapidly prototype new algorithms, as well as existing tools to manage collaboration, make cctbx a good starting point for the DIALS effort. We implemented test code for Bragg spot integration using 2D pixel summation, and plan to work on 3D profile fitting for delivery within a year. High throughput will be achieved initially with multicore processing, and possibly with GPU computing in the future.
Graeme Winter (Diamond Light Source) described a software approach that would allow light source facilities to implement code specializations in the field, e.g. to support new hardware or algorithms not supported by the central code base. Such ideas help make the code "future proof", so Graeme's code to represent X-ray detectors (dxtbx) will be included in the initial release. Xia2 is a data processing pipeline now supported at Diamond, which could act as a conduit for delivering the new DIALS software to the beamline (users would run xia2, which would delegate work to DIALS). FastDP is another Diamond development that runs XDS data reduction on up to 480 cpu cores, merging large (1800 frame) Pilatus datasets in 2 minutes. David Waterman (CCP4) sees DIALS as an opportunity to correct the deficiencies of current integration programs. In particular, the error on integrated Bragg intensities is poorly modeled, and needs to be improved with new models that account for the physics of each detector type and X-ray source.
Jon Schuermann and David Neau of NE-CAT described Rapid Automated Processing of Data (RAPD), which performs massively parallel automatic analysis of all user datasets by pipelining standard software tools, with final output presented within a Web interface. Results are available in minutes, fast enough to influence real-time user decisions. However, the same factors driving DIALS development (speed and handling of problem datasets) also limit the performance at NE-CAT.
Chris Nielson (ADSC) speculated that the toolbox approach of DIALS may never take the place of standard data processing packages, and but might become a state of the art description of best practices. The need for peer review of each new module was discussed, with a possible publication venue in the software section of an IUCr journal. Also, we can adopt the regular practice of documenting mathematical formulae along with the source code (in LaTeX format), thus preserving the thought process behind the algorithm. An open architecture will allow vendors and hardware developers to experiment with improvements in detectors and calibration without involving the authors of standard data processing programs.
Christian Brönnimann of Dectris Ltd. pointed out that physical properties of subsequent PAD products (Pilatus3 and Eiger) will differ from the current Pilatus2 architecture and will require new physical models. For example, the Pilatus3 retrigger mode will handle count rates up to 107 cps, an order of magnitude higher than current performance. The Eiger sensor will be much thicker (1mm vs. 300mm) and require a different thickness correction. It will be essential to have a clear dividing line between data corrections made by the vendor process, and those made within downstream software such as DIALS. Fast framing rates (up to 3 kHz, 5GB/s for the Eiger) will require new file formats (beyond CBF) for fast storage. HDF5 is a possible candidate presently being studied.
Herbert Bernstein (Dowling College) described aspects of data management planned for NSLS-II. At raw data generation rates of 9GB/s, it will be impossible to store all the data. However, we can be much smarter than simply reducing data to Bragg intensities and discarding the original; instead, we can do lossy (up to 50:1) compression giving images that are nearly indistinguishable from the original, as far as data processing statistics are concerned. To guard against fraud and legal challenge, a small fraction (5%) of images will be retained with lossless compression to prove their equivalence to lossy data. The current CBF library provides 4:1 lossless compression with the Hammersley byte-offset algorithm; current work focuses on extending that algorithm to give 10:1 lossless compression, or adopting a different lossless (JP Abrahams) algorithm. Also, an HDF5 backend is being added to CBFlib to support metadata access in the gigabyte data rate context.