This basic script illustrates how to read in a model file.
Open the file tutorial_1_v0.py
in your source code editor.
Most of this script is so called boilerplate code, i.e. code that in some form or shape is found in most Python scripts.
At the beginning of the script are import statements, which import the modules needed for a task.
The bottom lines represent best practice. They enable the script to be imported and used from other Python scripts.
The first two lines of the run() function are a minimalistic - but often sufficient - way to give users a hint how to use the script. It works both for someone reading the source code of the script, and a user running the script without arguments.
python tutorial_1_v0.py
.
The output is:
Traceback (most recent call last):
File "tutorial_1_v0.py", line 15, in
run(sys.argv[1:])
File "tutorial_1_v0.py", line 8, in run
raise RuntimeError("Please specify one pdb file name.")
RuntimeError: Please specify one pdb file name.
In addition to showing the error message, Python shows exactly where the error originates. This is often extremely helpful.
The important part of the script is in these two lines:
The first line creates the class for reading a model from a file or string. It is the main input method for both PDB and mmCIF files; it will automatically determine the actual format and return the appropriate data type.
The second line creates the model class, which serves as container for model information to be passed between library functions. For example, it selects, copies and updates model information, stores libraries, and ensures consistency between lower level objects.
While analyzing a script, insert print()
statements and run the script to find out more about the objects. It may also be useful to insert help(obj)
to see the attributes and methods of obj, where obj can be any of the objects created in the script.
For example, let's try to use the show
method of the model.composition()
object, which produces output that is useful to give the user a quick overview of what is in the PDB file. Add the following lines to your script.
The help()
method calls the built-in Python help system. It lets you read the docstring and get an idea of what attributes and methods a class might have. (Press q to exit the help mode)
help
method for the composition
class.
Help on composition in module mmtbx.model.statistics object:
class composition(__builtin__.object)
| Methods defined here:
|
| __init__(self, pdb_hierarchy)
|
| result(self)
|
| show(self, log, prefix='')
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
(END)
We can see from the help
output that the composition
class has two methods ('show' and 'result') and that the 'show' method requires setting the parameter 'log'. Set it to sys.stdout
.
composition.show()
and run the script with the file 1aba_pieces.pdb
.
The output is:
Number of:
all atoms : 48
H or D atoms : 0
chains : 2
a.a. residues : 5
nucleotides : 0
water : 4
other (ligands): 0
Ligands: None
Let's continue working with the model file 1aba_pieces.pdb
. We will write a basic script to truncate all amino acids in the model to Poly-Ala. To achieve this, we will use the pdb_hierarchy object, which is a five-deep nested data structure:
model
chain
residue_group
atom_group
atom
The hierarchy object is obtained with the get_hierarchy method of the model class:
The model, chain, and atom levels of the hierarchy object are probably immediately obvious to someone familiar with the content of model files (such as a PDB file). Note that that is no 'residue type' in the data structure. Instead, there are the two types residue_group and atom_group. They are related to alternative conformations. If there are no alternative conformations in the model, all residue groups contain exactly one atom group, which contains all the atoms of a residue. A file with alternative conformations will lead to residue groups with multiple atom groups, one for each conformer. (Note: about a quarter of the files in the PDB database contain alternative conformations).
To truncate amino-acid residues to alanine, we need to know which residues are amino-acids, and the atom names. A more detailed presentation of the hierarchy object shows where we can find this information:
model(s)
id
chain(s)
id
residue_group(s)
resid
atom_group(s)
altloc, resname
atom(s)
name
segid
element
charge
xyz
occ
b
uij
(The list under each level is not exhaustive.)
Before doing the truncation, let's see what is actually in the levels of the hierarchy. Loop through all the levels of the hierarchy and print out:
Output:
Loop over hierarchy
Chain: A
Chain: B
Loop over hierarchy:
Chain: A
Resnumber: 3
Resname: LYS, Altloc:
N
CA
C
O
CB
CG
CD
CE
NZ
Resnumber: 4
Resname: VAL, Altloc:
N
CA
C
O
CB
CG1
CG2
Resnumber: 5
Resname: TYR, Altloc:
N
CA
C
O
CB
CG
CD1
CD2
CE1
CE2
CZ
OH
Resnumber: 6
Resname: GLY, Altloc:
N
CA
C
O
Resnumber: 7
Resname: TYR, Altloc:
N
CA
C
O
CB
CG
CD1
CD2
CE1
CE2
CZ
OH
Chain: B
Resnumber: 96
Resname: HOH, Altloc:
O
Resnumber: 161
Resname: HOH, Altloc:
O
Resnumber: 169
Resname: HOH, Altloc:
O
Resnumber: 193
Resname: HOH, Altloc:
O
TIP: You can see the attributes and methods of each level by using the dir
method to return a list of valid attributes of the object, for example, replace line 21 with print(dir(rg))
. Don't get confused if there are many methods/attributes. Look out for something ressembling residue names, altlocs and atom names.
Let's look at Tyr 85, which has an alternate conformation for the side chain. This example highlights the usefulness of residue groups and atom groups. Tyr 85 has three atom groups, one for the altloc " " (blank), one for "A" and one for "B".
Resnumber: 85
Resname: TYR, Altloc:
N, CA, C, O, CB
Resname: TYR, Altloc: A
CG, CD1, CD2, CE1, CE2, CZ, OH
Resname: TYR, Altloc: B
CG, CD1, CD2, CE1, CE2, CZ, OH
We have residue names and atom names now, but to perform the truncation, we need to know which residues are amino acids (because we only want to truncate amino acids), and which atom names we want to keep.
The iotbx.pdb
module contains the sub-module amino_acid_codes
. This sub-module contains two Python dictionaries, one of which is (shortened):
one_letter_given_three_letter = {
"ALA": "A",
"ARG": "R",
...
"TYR": "Y",
"VAL": "V"}
We don't need the one-letter codes, but we can re-use the keys of this dictionary to efficently decide if a residue name corresponds to an amino acid. Here are the relevant lines to add in your script.
For the atom names, we use a Python set. Here are the relevant lines to add in your script.
We use a Python set because it uses hashing techniques for element lookup when processing the "in" in the if statement. For a small list like here it doesn't really matter, but in Python it is so easy to use advanced hashing techniques, simply by converting the list of atom names to a set, there is no reason not to take advantage of them.
Loop over hierarchy:
Chain: A
Resnumber: 3
Resname: LYS, Altloc:
CG
CD
CE
NZ
Resnumber: 4
Resname: VAL, Altloc:
CG1
CG2
Resnumber: 5
Resname: TYR, Altloc:
CG
CD1
CD2
CE1
CE2
CZ
OH
Resnumber: 6
Resname: GLY, Altloc:
Resnumber: 7
Resname: TYR, Altloc:
CG
CD1
CD2
CE1
CE2
CZ
OH
Chain: B
Resnumber: 96
Resnumber: 161
Resnumber: 169
Resnumber: 193
Now that we know which residues we want to truncate, and which atom names we want to keep, we just need one more line to remove the side chain atoms:
This removes the atom from the atom group (note that atom stands for the object).
Now let's save the modified hierarchy to a file. The method to write a PDB file from the hierarchy is write_pdb_file
.
TIP: To use the method, you'll need to pass a file name. Use the help
method to find out what the parameter name for the file name is.
For most practical purposes, the script from the previous task is completely sufficient. However, there are some files in the PDB for which this is not true. One example is the structure with the PDB ID 1ysl.
Run your script on the file resname_mix.pdb
and look at the output (use a text editor or PyMol; the file cannot be opened in Coot).
The script did not remove the side chain atoms of the modified residue CSD. This residue has in fact an alternate conformation. Conformation "A" is the modified amino acid CSD, conformation "B" is a CYS.
HETATM 3907 N ACSD B 111 25.006 36.731 16.222 0.50 18.83 N
HETATM 3908 CA ACSD B 111 25.536 35.903 15.152 0.50 19.90 C
HETATM 3909 CB ACSD B 111 25.931 36.658 13.876 0.50 21.09 C
HETATM 3910 SG ACSD B 111 25.414 38.295 13.671 0.50 26.29 S
HETATM 3911 C ACSD B 111 26.713 35.054 15.562 0.50 19.23 C
HETATM 3912 O ACSD B 111 27.472 34.533 14.697 0.50 20.10 O
HETATM 3913 OD1ACSD B 111 23.793 38.008 13.181 0.50 30.17 O
HETATM 3914 OD2ACSD B 111 25.111 39.102 15.048 0.50 26.06 O
ATOM 3915 N BCYS B 111 24.996 36.697 16.246 0.50 13.39 N
ATOM 3916 CA BCYS B 111 25.522 35.913 15.123 0.50 16.53 C
ATOM 3917 C BCYS B 111 26.790 35.104 15.498 0.50 15.20 C
ATOM 3918 O BCYS B 111 27.342 34.391 14.660 0.50 16.26 O
ATOM 3919 CB BCYS B 111 25.840 36.879 13.947 0.50 20.05 C
ATOM 3920 SG BCYS B 111 24.645 38.257 14.039 0.50 29.86 S
Rare cases like this are the reason why the residue_group and atom_group levels are needed in the hierarchy. Here are two different residue names for the same member of a chain. Even though this sitution is rare, it is entirely plausible and valid, and a comprehensive PDB processing library has to be able to handle it.
Our script will only truncate the CYS residue, but it would be better if it also truncated the non-standard CSD residue in the A alternative conformation. Let's find out what it takes to achieve this.
One way would be to add the 3 letter code for all modified amino acids to the dictionary. This however relies on the completeness of the dictionary; what happens if the 3 letter code changes or if other entries are added?
Another possibility is to check if there is at least one amino acid in a residue group, and if so, apply the truncation to all residues in the group, even if they don't have a standard residue name. This means, for each residue group we have to loop over the atom groups twice, first to scan for at least one standard amino-acid residue name, and if there is one, a second time to do the truncation. This will double effort (and computing time), but it is required to accomodate all possible cases.
This is the part of the script we have to work on:
Task Modify the script so that it also truncates CSD.TIP: Replace the if-statement in line 2 (above) with a function that returns True if the residue_group contains at least one amino-acid (and False otherwise).
The script now truncates the model and can take care of rare cases involving alternate conformations. However, it doesn't tell the user anything about the performed tasks. For example, it would be interesting to know how many atoms were deleted, how many residue are affected, and how many residue are unchanged.
To get the desired information, we need counters, and we need to initialize them before we enter the nested loops over the hierarchy:
When the loops over the hierarchy are finished, we print the counts:
Since we can now easily find out if no atoms were removed (e.g. because someone passed in a DNA model), we should take advantage of it and write the output PDB file only if there are changes. For example:
Since we can now easily find out if no atoms were removed (e.g. because someone passed in a DNA model), we should take advantage of it and write the output PDB file only if there are changes. For example:
write_pdb_file
(what changes if we do this?).Number of amino acid residues: 3
Number of other residues: 0
Number of atoms removed: 11
Writing file: resname_mix_truncated_to_ala.pdb
Miller arrays are containers for experimental (or calculated data). A miller array contains the crystal symmetry, an array of Miller indices (h,k,l), a boolean flag indicating anomalous pairs and a flex array containing data (X-ray amplitudes or intensities, experimental sigmas, etc.).
Note that the Miller arrays do not necessarily correspond to a single column of data in a reflection file. There are several major differences:
The code below (tutorial_4_v0.py
) shows how to obtain a miller array from a reflection file:
Let's first see what data are in the input file.
ma.info().label_string()
.
FOBS,SIGFOBS
IOBS,SIGIOBS
DANO,SIGDANO
R-free-flags
F(+),SIGF(+),F(-),SIGF(-)
I(+),SIGI(+),I(-),SIGI(-)
Let's focus on the FOBS,SIGFOBS miller array. We know the data in this array are amplitudes. But what is the resolution, the completeness and the spacegroup? Use the following methods to learn more about this array:
d_max_min()
space_group_info()
f_obs.completeness()
size()
f_obs.show_summary()
(This will print the output on the screen, so no need to use the python print() function)Resolution limits: (49.59133498239787, 5.5000670861140355)
Space group: P 1 21 1
Completeness: 0.970842392819
Size: 17414
Miller array info: 4zyp.mtz:FOBS,SIGFOBS
Observation type: xray.amplitude
Type of data: double, size=17414
Type of sigmas: double, size=17414
Number of Miller indices: 17414
Anomalous flag: False
Unit cell: (114.38, 210.29, 118.2, 90, 100.46, 90)
Space group: P 1 21 1 (No. 4)
We saw in the first exercise that there is also an Rfree array in the MTZ file. Let's have a closer look at that array.
size()
method for the Rfree miller array.
Size R-free array: 17937
The size of the miller arrays FOBS,SIGFOBS and R-free-flags is not the same. In other words, the number of hkl indices with corresponding data or R-free-flag differs in the two arrays. This situation occurs rather frequently, even with deposited data in the PDB. For example, the mismatch can occur when R-free-flags are copied from another data set.
Programs dealing with data has to address this mismatch. Let's say we want to calculate R-factors. If a reflection with an associated intensity does not have an R-free flag, it can't be assigned neither to the working set nor the the test set of reflections. It is therefore useful to obtain a common set of reflections. Let's try to obtain such a common set for out test data set.
It is instructive to determine how many reflections are mismatched. The miller class has the method match_indices
, which will do all the work. Use dir()
to see which methods are available for the result of match_indices. You can use the method as in this example:
Here, f_obs
and r_free
are miller array objects.
We will now figure out how many reflections are "single" in each of these two miller arrays, i.e. we determine how many reflections don't have an equivalent in the other array. The result object of match_indices
has the methods pairs()
and singles()
, which are all we need to get this information.
Common reflections: 17414
Single reflections in FOBS,SIGFOBS: 0
Single reflections in R-free-flags: 523
TIP:
size()
singles()
needs an argument: 0 for the array on which match_indices is performed, and 1 for the array which is the argument of match_indices.We just found out that the R-free-flags array has about 500 additional miller indices. Note that match_indices
is a low level method and is typicially not used. It is nevertheless instructive for this example. Practically, to obtain a common set of reflections, the common_sets()
method can be used:
common_sets()
to create two arrays with the same hkl indices. Use the size()
method to double check the result.
After using common_sets:
Size FOBS,SIGFOBS: 17414
Size R-free-flags: 17414
We have learned how to open a model and reflection files and how to obtain the corresponding CCTBX objects. Let's put this together to calcualte R-factors.
We'll have to make a script which opens a PDB file and a MTZ file. Using the scripts from the previous parts as template, write a basic script that obtains the PDB and mtz file names from the input arguments. Use the os
module to get the extension of a file in order to decide if a file is a model file or a reflection file.
1aba_model.pdb
) and a reflection file (1aba_reflections.mtz
) and prints out the filenames. The goal here is to correctly assign the filetype to the input arguments of the script; it should work regardless of the order of the files.
Model file name: 1aba_model.pdb
Reflection file name: 1aba_reflections.mtz
TIP: filename, extension = os.path.splitext('/path/to/somefile.ext')
yields '/path/to/somefile' for filename and '.ext' for extension. Use this to decide if a file is a pdb file or a mtz file.
This approach to decide about filetype is quite simplistic, but it should be sufficient for "personal" use, i.e. for scripts where we know the input format. For general use however, a more elaborate approach would be necessary. For example, the script will fail for compressed pdb files (model.pdb.gz), which can be opened with iotbx.pdb
, for model files in CIF format or for reflection fils in other formats (we won't cover this here).
The next step is to create model objects and miller arrays.
To calculate R-factors we need a model (coordinates, b-factors, atom-type, occupancy, etc.) to obtain calculated structure factors and data (structure factor amplitudes or intensities, R-free-flags for Rfree). The following lines of code yield the f_model
object combining all these inputs:
You can see that neither the model
object nor the pdb_hierarchy
are passed to f_model
. The reason is that both obects are too "large", containing way more infomation than necessary to calculate structure factors. For example the model
object may carry restraint information, while the pdb_hierarchy
is a complicated data structure conveying the hierarchical levels of the input model. To calculate structure factors, neither is needed. The minimal structural information is contained in the xray_structure
object, which is obtained from model:
The update_all_scales()
method is necessary to calculate scale factors and add a bulk solvent model. This procedure puts the calculated and observed structure factor amplitudes to a common scale and adjusts calculated low resolution structure factors.
The f_model
object has the methods r_free()
and r_work()
to calculate R-factors.
f_model
and print R-factors.
Rwork: 0.180265172752
Rfree: 0.19096914483
TIP: Got an assertion error? Investigate the content of the miller arrays to find out if there is any mismatch.