CCTBX tutorial Crystallographic computing school, Melk, 08/2019

Script reading a model file

This basic script illustrates how to read in a model file.

Files

Getting started:

Open the file tutorial_1_v0.py in your source code editor.
Most of this script is so called boilerplate code, i.e. code that in some form or shape is found in most Python scripts.

At the beginning of the script are import statements, which import the modules needed for a task.
The bottom lines represent best practice. They enable the script to be imported and used from other Python scripts.
The first two lines of the run() function are a minimalistic - but often sufficient - way to give users a hint how to use the script. It works both for someone reading the source code of the script, and a user running the script without arguments.

Task Execute the script by typing python tutorial_1_v0.py.

The output is:

Traceback (most recent call last):
  File "tutorial_1_v0.py", line 15, in 
    run(sys.argv[1:])
  File "tutorial_1_v0.py", line 8, in run
    raise RuntimeError("Please specify one pdb file name.")
RuntimeError: Please specify one pdb file name.

In addition to showing the error message, Python shows exactly where the error originates. This is often extremely helpful.

The important part of the script is in these two lines:

The first line creates the class for reading a model from a file or string. It is the main input method for both PDB and mmCIF files; it will automatically determine the actual format and return the appropriate data type.
The second line creates the model class, which serves as container for model information to be passed between library functions. For example, it selects, copies and updates model information, stores libraries, and ensures consistency between lower level objects.

Use print and help()

While analyzing a script, insert print() statements and run the script to find out more about the objects. It may also be useful to insert help(obj) to see the attributes and methods of obj, where obj can be any of the objects created in the script.
For example, let's try to use the show method of the model.composition() object, which produces output that is useful to give the user a quick overview of what is in the PDB file. Add the following lines to your script.

The help() method calls the built-in Python help system. It lets you read the docstring and get an idea of what attributes and methods a class might have. (Press q to exit the help mode)

Task Run the script by using the Python help method for the composition class.
Help on composition in module mmtbx.model.statistics object:

  class composition(__builtin__.object)
   |  Methods defined here:
   |
   |  __init__(self, pdb_hierarchy)
   |
   |  result(self)
   |
   |  show(self, log, prefix='')
   |
   |  ----------------------------------------------------------------------
   |  Data descriptors defined here:
   |
   |  __dict__
   |      dictionary for instance variables (if defined)
   |
   |  __weakref__
   |      list of weak references to the object (if defined)
  (END)

We can see from the help output that the composition class has two methods ('show' and 'result') and that the 'show' method requires setting the parameter 'log'. Set it to sys.stdout.

Task Use the method composition.show() and run the script with the file 1aba_pieces.pdb.

The output is:

 Number of:
   all atoms      : 48
   H or D atoms   : 0
   chains         : 2
   a.a. residues  : 5
   nucleotides    : 0
   water          : 4
   other (ligands): 0
 Ligands: None

Truncate to Poly-Ala - Basic

Files

The PDB hierarchy

Let's continue working with the model file 1aba_pieces.pdb. We will write a basic script to truncate all amino acids in the model to Poly-Ala. To achieve this, we will use the pdb_hierarchy object, which is a five-deep nested data structure:

model
  chain
    residue_group
      atom_group
        atom

The hierarchy object is obtained with the get_hierarchy method of the model class:

The model, chain, and atom levels of the hierarchy object are probably immediately obvious to someone familiar with the content of model files (such as a PDB file). Note that that is no 'residue type' in the data structure. Instead, there are the two types residue_group and atom_group. They are related to alternative conformations. If there are no alternative conformations in the model, all residue groups contain exactly one atom group, which contains all the atoms of a residue. A file with alternative conformations will lead to residue groups with multiple atom groups, one for each conformer. (Note: about a quarter of the files in the PDB database contain alternative conformations).

To truncate amino-acid residues to alanine, we need to know which residues are amino-acids, and the atom names. A more detailed presentation of the hierarchy object shows where we can find this information:

model(s)
  id
  chain(s)
    id
    residue_group(s)
      resid
      atom_group(s)
        altloc, resname
        atom(s)
          name
          segid
          element
          charge
          xyz
          occ
          b
          uij

(The list under each level is not exhaustive.)

Loop over the hierarchy

Before doing the truncation, let's see what is actually in the levels of the hierarchy. Loop through all the levels of the hierarchy and print out:

  • Chain ID
  • Residue name
  • Altloc
  • Atomname
The code below shows how to print the chain ID for each chain.

Output:

Loop over hierarchy
Chain:  A
Chain:  B
Task
Complete the loops through the hierarchy levels to print out atom names in the different atom_groups.
Loop over hierarchy:
Chain:  A
  Resnumber:     3
    Resname: LYS, Altloc:
       N
       CA
       C
       O
       CB
       CG
       CD
       CE
       NZ
  Resnumber:     4
    Resname: VAL, Altloc:
       N
       CA
       C
       O
       CB
       CG1
       CG2
  Resnumber:     5
    Resname: TYR, Altloc:
       N
       CA
       C
       O
       CB
       CG
       CD1
       CD2
       CE1
       CE2
       CZ
       OH
  Resnumber:     6
    Resname: GLY, Altloc:
       N
       CA
       C
       O
  Resnumber:     7
    Resname: TYR, Altloc:
       N
       CA
       C
       O
       CB
       CG
       CD1
       CD2
       CE1
       CE2
       CZ
       OH
Chain:  B
  Resnumber:    96
    Resname: HOH, Altloc:
       O
  Resnumber:   161
    Resname: HOH, Altloc:
       O
  Resnumber:   169
    Resname: HOH, Altloc:
       O
  Resnumber:   193
    Resname: HOH, Altloc:
       O

TIP: You can see the attributes and methods of each level by using the dir method to return a list of valid attributes of the object, for example, replace line 21 with print(dir(rg)). Don't get confused if there are many methods/attributes. Look out for something ressembling residue names, altlocs and atom names.

Let's look at Tyr 85, which has an alternate conformation for the side chain. This example highlights the usefulness of residue groups and atom groups. Tyr 85 has three atom groups, one for the altloc " " (blank), one for "A" and one for "B".

Resnumber:    85
    Resname: TYR, Altloc:
      N, CA, C, O, CB
    Resname: TYR, Altloc: A
      CG, CD1, CD2, CE1, CE2, CZ, OH
    Resname: TYR, Altloc: B
      CG, CD1, CD2, CE1, CE2, CZ, OH

Simple truncation

We have residue names and atom names now, but to perform the truncation, we need to know which residues are amino acids (because we only want to truncate amino acids), and which atom names we want to keep.
The iotbx.pdb module contains the sub-module amino_acid_codes. This sub-module contains two Python dictionaries, one of which is (shortened):

one_letter_given_three_letter = {
"ALA": "A",
"ARG": "R",
...
"TYR": "Y",
"VAL": "V"}

We don't need the one-letter codes, but we can re-use the keys of this dictionary to efficently decide if a residue name corresponds to an amino acid. Here are the relevant lines to add in your script.

For the atom names, we use a Python set. Here are the relevant lines to add in your script.

We use a Python set because it uses hashing techniques for element lookup when processing the "in" in the if statement. For a small list like here it doesn't really matter, but in Python it is so easy to use advanced hashing techniques, simply by converting the list of atom names to a set, there is no reason not to take advantage of them.

Task
Modify your script so that it only prints out atom names that are NOT in the poly-alanine set and wich are in amino acid residues.
Loop over hierarchy:
Chain:  A
  Resnumber:     3
    Resname: LYS, Altloc:
       CG
       CD
       CE
       NZ
  Resnumber:     4
    Resname: VAL, Altloc:
       CG1
       CG2
  Resnumber:     5
    Resname: TYR, Altloc:
       CG
       CD1
       CD2
       CE1
       CE2
       CZ
       OH
  Resnumber:     6
    Resname: GLY, Altloc:
  Resnumber:     7
    Resname: TYR, Altloc:
       CG
       CD1
       CD2
       CE1
       CE2
       CZ
       OH
Chain:  B
  Resnumber:    96
  Resnumber:   161
  Resnumber:   169
  Resnumber:   193

Now that we know which residues we want to truncate, and which atom names we want to keep, we just need one more line to remove the side chain atoms:

This removes the atom from the atom group (note that atom stands for the object).
Now let's save the modified hierarchy to a file. The method to write a PDB file from the hierarchy is write_pdb_file.

Task Modify the script to save the modified hierarchy in a PDB file. Use a molecular viewer to look at the output. Did it work correctly?

TIP: To use the method, you'll need to pass a file name. Use the help method to find out what the parameter name for the file name is.

Truncate to Poly-Ala - Improved

Files

Rare cases

For most practical purposes, the script from the previous task is completely sufficient. However, there are some files in the PDB for which this is not true. One example is the structure with the PDB ID 1ysl.
Run your script on the file resname_mix.pdb and look at the output (use a text editor or PyMol; the file cannot be opened in Coot).

The script did not remove the side chain atoms of the modified residue CSD. This residue has in fact an alternate conformation. Conformation "A" is the modified amino acid CSD, conformation "B" is a CYS.

HETATM 3907  N  ACSD B 111      25.006  36.731  16.222  0.50 18.83           N
HETATM 3908  CA ACSD B 111      25.536  35.903  15.152  0.50 19.90           C
HETATM 3909  CB ACSD B 111      25.931  36.658  13.876  0.50 21.09           C
HETATM 3910  SG ACSD B 111      25.414  38.295  13.671  0.50 26.29           S
HETATM 3911  C  ACSD B 111      26.713  35.054  15.562  0.50 19.23           C
HETATM 3912  O  ACSD B 111      27.472  34.533  14.697  0.50 20.10           O
HETATM 3913  OD1ACSD B 111      23.793  38.008  13.181  0.50 30.17           O
HETATM 3914  OD2ACSD B 111      25.111  39.102  15.048  0.50 26.06           O
ATOM   3915  N  BCYS B 111      24.996  36.697  16.246  0.50 13.39           N
ATOM   3916  CA BCYS B 111      25.522  35.913  15.123  0.50 16.53           C
ATOM   3917  C  BCYS B 111      26.790  35.104  15.498  0.50 15.20           C
ATOM   3918  O  BCYS B 111      27.342  34.391  14.660  0.50 16.26           O
ATOM   3919  CB BCYS B 111      25.840  36.879  13.947  0.50 20.05           C
ATOM   3920  SG BCYS B 111      24.645  38.257  14.039  0.50 29.86           S

Rare cases like this are the reason why the residue_group and atom_group levels are needed in the hierarchy. Here are two different residue names for the same member of a chain. Even though this sitution is rare, it is entirely plausible and valid, and a comprehensive PDB processing library has to be able to handle it.
Our script will only truncate the CYS residue, but it would be better if it also truncated the non-standard CSD residue in the A alternative conformation. Let's find out what it takes to achieve this.

One way would be to add the 3 letter code for all modified amino acids to the dictionary. This however relies on the completeness of the dictionary; what happens if the 3 letter code changes or if other entries are added?
Another possibility is to check if there is at least one amino acid in a residue group, and if so, apply the truncation to all residues in the group, even if they don't have a standard residue name. This means, for each residue group we have to loop over the atom groups twice, first to scan for at least one standard amino-acid residue name, and if there is one, a second time to do the truncation. This will double effort (and computing time), but it is required to accomodate all possible cases.

This is the part of the script we have to work on:

Task Modify the script so that it also truncates CSD.

TIP: Replace the if-statement in line 2 (above) with a function that returns True if the residue_group contains at least one amino-acid (and False otherwise).

Refining the script

The script now truncates the model and can take care of rare cases involving alternate conformations. However, it doesn't tell the user anything about the performed tasks. For example, it would be interesting to know how many atoms were deleted, how many residue are affected, and how many residue are unchanged.

To get the desired information, we need counters, and we need to initialize them before we enter the nested loops over the hierarchy:

When the loops over the hierarchy are finished, we print the counts:

Since we can now easily find out if no atoms were removed (e.g. because someone passed in a DNA model), we should take advantage of it and write the output PDB file only if there are changes. For example:

Since we can now easily find out if no atoms were removed (e.g. because someone passed in a DNA model), we should take advantage of it and write the output PDB file only if there are changes. For example:

Task
Modify the script:
  • Print the counts
  • Pass the crystal symmetry parameter to write_pdb_file (what changes if we do this?).
  • Write the output only if the input contains amino acid residues.
Number of amino acid residues: 3
Number of other residues: 0
Number of atoms removed: 11
Writing file:  resname_mix_truncated_to_ala.pdb

Miller arrays

Files

Reading an MTZ file

Miller arrays are containers for experimental (or calculated data). A miller array contains the crystal symmetry, an array of Miller indices (h,k,l), a boolean flag indicating anomalous pairs and a flex array containing data (X-ray amplitudes or intensities, experimental sigmas, etc.).
Note that the Miller arrays do not necessarily correspond to a single column of data in a reflection file. There are several major differences:

  • Friedel mates (F(+) and F(-) from an MTZ) file become a single array with both (h,k,l) and (-h,-k,-l) present as distinct items. (Note that one consequence of this behavior is that the number of reflections will appear to double-count acentric reflections for which both Friedel mates are present.)
  • For experimental data (I or F), the array also stores the corresponding sigmas. In combination with the treatment of anomalous data, this means that a single Miller array can represent the combination of columns I(+),SIGI(+),I(-),SIGI(-) from a file.
  • Weighted map coefficients such as FWT,DELFWT or 2FOFCWT,PH2FOFCWT are treated as a single array.

The code below (tutorial_4_v0.py) shows how to obtain a miller array from a reflection file:

Information about miller arrays

Let's first see what data are in the input file.

Task
Loop through the miller_arrays and print the name of the labels in the array. To obtain the labels, use ma.info().label_string().
FOBS,SIGFOBS
IOBS,SIGIOBS
DANO,SIGDANO
R-free-flags
F(+),SIGF(+),F(-),SIGF(-)
I(+),SIGI(+),I(-),SIGI(-)

Let's focus on the FOBS,SIGFOBS miller array. We know the data in this array are amplitudes. But what is the resolution, the completeness and the spacegroup? Use the following methods to learn more about this array:

  • d_max_min()
  • space_group_info()
  • f_obs.completeness()
  • size()
  • f_obs.show_summary() (This will print the output on the screen, so no need to use the python print() function)

Task
Print out some more information of the miller array FOBS,SIGFOBS.
Resolution limits:  (49.59133498239787, 5.5000670861140355)

Space group:  P 1 21 1

Completeness:  0.970842392819

Size:  17414

Miller array info: 4zyp.mtz:FOBS,SIGFOBS
Observation type: xray.amplitude
Type of data: double, size=17414
Type of sigmas: double, size=17414
Number of Miller indices: 17414
Anomalous flag: False
Unit cell: (114.38, 210.29, 118.2, 90, 100.46, 90)
Space group: P 1 21 1 (No. 4)

We saw in the first exercise that there is also an Rfree array in the MTZ file. Let's have a closer look at that array.

Task
Print out the result of the size() method for the Rfree miller array.
Size R-free array:  17937

The size of the miller arrays FOBS,SIGFOBS and R-free-flags is not the same. In other words, the number of hkl indices with corresponding data or R-free-flag differs in the two arrays. This situation occurs rather frequently, even with deposited data in the PDB. For example, the mismatch can occur when R-free-flags are copied from another data set.

Programs dealing with data has to address this mismatch. Let's say we want to calculate R-factors. If a reflection with an associated intensity does not have an R-free flag, it can't be assigned neither to the working set nor the the test set of reflections. It is therefore useful to obtain a common set of reflections. Let's try to obtain such a common set for out test data set.

It is instructive to determine how many reflections are mismatched. The miller class has the method match_indices, which will do all the work. Use dir() to see which methods are available for the result of match_indices. You can use the method as in this example:

Here, f_obs and r_free are miller array objects.
We will now figure out how many reflections are "single" in each of these two miller arrays, i.e. we determine how many reflections don't have an equivalent in the other array. The result object of match_indices has the methods pairs() and singles(), which are all we need to get this information.

Task
Determine how many reflections are common or single for the FOBS,SIGFOBS and R-free-flags array.
Common reflections:  17414

Single reflections in FOBS,SIGFOBS:  0

Single reflections in R-free-flags:  523

TIP:

  • Access the length of the arrays with size()
  • singles() needs an argument: 0 for the array on which match_indices is performed, and 1 for the array which is the argument of match_indices.

We just found out that the R-free-flags array has about 500 additional miller indices. Note that match_indices is a low level method and is typicially not used. It is nevertheless instructive for this example. Practically, to obtain a common set of reflections, the common_sets() method can be used:

Task
Use common_sets() to create two arrays with the same hkl indices. Use the size() method to double check the result.
After using common_sets:
Size FOBS,SIGFOBS:  17414
Size R-free-flags:  17414

R-factors

Files

Calculating R-factors

We have learned how to open a model and reflection files and how to obtain the corresponding CCTBX objects. Let's put this together to calcualte R-factors.
We'll have to make a script which opens a PDB file and a MTZ file. Using the scripts from the previous parts as template, write a basic script that obtains the PDB and mtz file names from the input arguments. Use the os module to get the extension of a file in order to decide if a file is a model file or a reflection file.

Task
Write a script that opens a model file (1aba_model.pdb) and a reflection file (1aba_reflections.mtz) and prints out the filenames. The goal here is to correctly assign the filetype to the input arguments of the script; it should work regardless of the order of the files.
Model file name:  1aba_model.pdb
Reflection file name:  1aba_reflections.mtz

TIP: filename, extension = os.path.splitext('/path/to/somefile.ext') yields '/path/to/somefile' for filename and '.ext' for extension. Use this to decide if a file is a pdb file or a mtz file.

This approach to decide about filetype is quite simplistic, but it should be sufficient for "personal" use, i.e. for scripts where we know the input format. For general use however, a more elaborate approach would be necessary. For example, the script will fail for compressed pdb files (model.pdb.gz), which can be opened with iotbx.pdb, for model files in CIF format or for reflection fils in other formats (we won't cover this here).

The next step is to create model objects and miller arrays.

Task
In your script, create:
  • a model object
  • a miller array for FOBS,SIGFOBS
  • a miller array for FreeR_flag

To calculate R-factors we need a model (coordinates, b-factors, atom-type, occupancy, etc.) to obtain calculated structure factors and data (structure factor amplitudes or intensities, R-free-flags for Rfree). The following lines of code yield the f_model object combining all these inputs:

You can see that neither the model object nor the pdb_hierarchy are passed to f_model. The reason is that both obects are too "large", containing way more infomation than necessary to calculate structure factors. For example the model object may carry restraint information, while the pdb_hierarchy is a complicated data structure conveying the hierarchical levels of the input model. To calculate structure factors, neither is needed. The minimal structural information is contained in the xray_structure object, which is obtained from model:

The update_all_scales() method is necessary to calculate scale factors and add a bulk solvent model. This procedure puts the calculated and observed structure factor amplitudes to a common scale and adjusts calculated low resolution structure factors.

The f_model object has the methods r_free() and r_work() to calculate R-factors.

Task
Create f_model and print R-factors.
Rwork:  0.180265172752

Rfree: 0.19096914483

TIP: Got an assertion error? Investigate the content of the miller arrays to find out if there is any mismatch.