Learn about the PDB hierarchy, the cctbx object that represents the information stored in a model file. In particular, this object maintains the hierarchical view of the model.
Author: Ralf Grosse-Kunstleve
(This text is adapted from an article by Ralf Grosse-Kunstleve; the original can be found here.)
In macromolecular crystallography, the PDB format was the predominant working format for atomic parameters (coordinates, occupancies, displacement parameters, etc.). Many small-molecule programs also support this format. The PDB format specifications are available at http://www.wwpdb.org/. Technically, the format is very simple, therefore a vast number of parsers exist in scientific packages. The tools in
iotbx.pdb are more complex than many, but have been designed to accommodate all valid scenarios found in real-world situations. Many interaction with models is done through an object called the "PDB hierarchy" (
Because of certain limitations of the PDB format (which was designed in the
1970s), the MX community moved towards greater use of the mmCIF format for storing models.
iotbx.pdb supports reading and writing mmCIF through the
iotbx.cif parser, and the internal representation of the PDB
hierarchy is identical for both formats. Throughout the documentation we
refer to models as "PDB files", but in most cases an mmCIF will work equally
See here for a tutorial illustrating the use of the pdb_hierarchy.
xray_structure(not explained on this page). For this object, all that is not necessary to calculate SFs is dropped for performance reasons.
The evolution of the cctbx PDB handling tools has gone through several stages. A simple PDB parser implemented in Python was used in the beginning. Python's runtime performance is usually sufficient for interactive processing of PDB files, but can be limiting for
large files, or for repeatedly traversing the entire PDB database. This has
prompted us to implement a fast C++ parser that is described in
Grosse-Kunstleve et al. (2006). The current cctbx version provides comprehensive tools for reading, manipulating, and writing PDB files. These tools are available from both Python and C++, under the
iotbx.pdb module.See also http://cci.lbl.gov/hybrid_36/ which describes iotbx.pdb facilities for handling very large models.
Here is a simple example of reading in a single-model PDB (or mmCIF) model and traversing the hierarchy, deleting any zinc atoms:
from __future__ import absolute_import, division, print_function from iotbx.data_manager import DataManager import sys dm = DataManager() # Initialize the DataManager and call it dm dm.set_overwrite(True) # tell the DataManager to overwrite files with the same name model_filename = sys.argv[1:] # Name of model file model = dm.get_model(model_filename) # Deliver model object with model info m = model.deep_copy() # work on a copy of model object pdb_hierarchy = m.get_hierarchy() # Get hierarchy object for chain in pdb_hierarchy.only_model().chains(): for residue_group in chain.residue_groups(): for atom_group in residue_group.atom_groups(): for atom in atom_group.atoms(): if (atom.element.strip().upper() == "ZN"): atom_group.remove_atom(atom) if (atom_group.atoms_size() == 0): residue_group.remove_atom_group(atom_group) if (residue_group.atom_groups_size() == 0): chain.remove_residue_group(residue_group) with open("model_Zn_free.pdb", "w") as f: f.write(m.model_as_pdb())
This will change the hierarchy in place, i.e. we removed the ZN atoms from the hierarchy and cannot recover it.
Note that we obtained the
pdb_hierarchy() object via the
model() object and the DataManager:
pdb_hierarchy = model.get_hierarchy() # Get hierarchy object
It is possible to get a
pdb_hierarchy() object directly:
import iotbx.pdb pdb_hierarchy = iotbx.pdb.input(file_name=model_filename).construct_hierarchy()
However, it is recommended to proceed via the
model object. The lower-level approach may be nevertheless useful for quick scripts and for regression tests.
The hierarchy provides an atom selection syntax that allows us to perform the same operation with much less code than the nested loop above:
sel_cache = pdb_hierarchy.atom_selection_cache() non_zn_sel = sel_cache.selection("not (resname ZN)") hierarchy_new = pdb_hierarchy.select(non_zn_sel) # etc
This however creates a new
hierarchy object, rather than modifying the existing
structure in place. We can avoid using the
atom_selection_cache by doing the selection via the
non_zn_sel= model.selection("not (resname ZN)") hierarchy_new = model.select(non_zn_sel).get_hierarchy()
The simplicity of the PDB format is only superficial and, in the general case, stops after the initial parsing level. The structure of the PDB file implies a hierarchy of objects. A first approximation of this hierarchical view is:
This is only an approximation because of a feature that is easily overlooked: the "altloc", column 17 of PDB ATOM records, which specifies "alternative location" identifiers for atoms in alternative conformations.
About 90% of the development time invested into
iotbx.pdb was in some form related to alternative conformations. Our goal was to provide robust tools that work even for the most unusual (but valid) cases, since this is a vital characteristic of any automated system. The main difficulties encountered while pursuing this goal were:
These difficulties are best illustrated with examples. An old version of PDB entry 2IZQ (from Aug 2008) includes a chain with conformers that have different sequences, the residue with resseq 11 in chain A, atoms 220 through 283:
HEADER ANTIBIOTIC 26-JUL-06 2IZQ ATOM 220 N ATRP A 11 20.498 12.832 34.558 0.50 6.03 N ................ATRP A 11 ................................................... ATOM 243 HH2ATRP A 11 15.522 9.077 38.323 0.50 10.40 H ATOM 244 N CPHE A 11 20.226 13.044 34.556 0.15 6.35 N ................CPHE A 11 ................................................... ATOM 254 CZ CPHE A 11 16.789 9.396 34.594 0.15 10.98 C ATOM 255 N BTYR A 11 20.553 12.751 34.549 0.35 5.21 N ................BTYR A 11 ................................................... ATOM 261 CD1BTYR A 11 18.548 10.134 34.268 0.35 9.45 C ATOM 262 HB2CPHE A 11 21.221 10.536 34.146 0.15 7.21 H ATOM 263 CD2BTYR A 11 18.463 10.012 36.681 0.35 9.08 C ATOM 264 HB3CPHE A 11 21.198 10.093 35.647 0.15 7.21 H ATOM 265 CE1BTYR A 11 17.195 9.960 34.223 0.35 10.76 C ATOM 266 HD1CPHE A 11 19.394 9.937 32.837 0.15 10.53 H ATOM 267 CE2BTYR A 11 17.100 9.826 36.693 0.35 11.29 C ATOM 268 HD2CPHE A 11 18.873 10.410 36.828 0.15 9.24 H ATOM 269 CZ BTYR A 11 16.546 9.812 35.432 0.35 11.90 C ATOM 270 HE1CPHE A 11 17.206 9.172 32.650 0.15 12.52 H ATOM 271 OH BTYR A 11 15.178 9.650 35.313 0.35 19.29 O ATOM 272 HE2CPHE A 11 16.661 9.708 36.588 0.15 11.13 H ATOM 273 HZ CPHE A 11 15.908 9.110 34.509 0.15 13.18 H ATOM 274 H BTYR A 11 20.634 12.539 33.720 0.35 6.25 H ................BTYR A 11 ................................................... ATOM 282 HH BTYR A 11 14.978 9.587 34.520 0.35 28.94 H
Residue 11 has a Trp in conformation A, a Tyr in conformation B and a Phe in conformation C. The original atom numbering does not have gaps. To save space, we omitted blocks of atoms with constant resname and resseq+icode.
As of Jun 8 2010, there are 74 files with mixed residue names in the PDB, i.e. only about 0.1% of the files. However, these files are perfectly valid and a PDB processing library is suitable as a component of an automated system only if it handles them sensibly.
An old version of PDB entry 1ZEH (Aug 2008) includes a chain with consecutive duplicate resseq+icode, atoms 878 through 894:
HEADER HORMONE 01-MAY-98 1ZEH HETATM 878 C1 ACRS 5 12.880 14.021 1.197 0.50 33.23 C HETATM 879 C1 BCRS 5 12.880 14.007 1.210 0.50 34.27 C ................ACRS 5 ................................................... HETATM 892 O1 ACRS 5 11.973 14.116 2.233 0.50 34.24 O HETATM 893 O1 BCRS 5 11.973 14.107 2.248 0.50 35.28 O HETATM 894 O HOH 5 -0.924 19.122 -8.629 1.00 11.73 O HETATM 895 O HOH 6 -19.752 11.918 3.524 1.00 13.44 O HETATM 896 O HOH 7 -1.169 17.936 -6.103 1.00 12.89 O
To a human inspecting the old 1ZEH entry, it is of course immediately obvious that the water with resseq 5 is not related to the previous residue with resseq 5. However, arriving at this conclusion with an automatic procedure is not entirely straightforward. The human brings in the knowledge that water atoms without hydrogen are not covalently connected to other atoms. This is very detailed, specialized knowledge. Introducing such heuristics into an automatic procedure is likely to lead to surprises in some situations and is best avoided, if possible.
In the PDB archive, alternative conformers of a residue always appear consecutively. However, some programs produce files with conformers separated in this way (this file was provided to us by a user):
ATOM 1716 N ALEU 190 28.628 4.549 20.230 0.70 3.78 N ATOM 1717 CA ALEU 190 27.606 5.007 19.274 0.70 3.71 C ATOM 1718 CB ALEU 190 26.715 3.852 18.800 0.70 4.15 C ATOM 1719 CG ALEU 190 25.758 4.277 17.672 0.70 4.34 C ATOM 1829 N BLEU 190 28.428 4.746 20.343 0.30 5.13 N ATOM 1830 CA BLEU 190 27.378 5.229 19.418 0.30 4.89 C ATOM 1831 CB BLEU 190 26.539 4.062 18.892 0.30 4.88 C ATOM 1832 CG BLEU 190 25.427 4.359 17.878 0.30 5.95 C ATOM 1724 N ATHR 191 27.350 7.274 20.124 0.70 3.35 N ATOM 1725 CA ATHR 191 26.814 8.243 21.048 0.70 3.27 C ATOM 1726 CB ATHR 191 27.925 9.229 21.468 0.70 3.73 C ATOM 1727 OG1ATHR 191 28.519 9.718 20.259 0.70 5.22 O ATOM 1728 CG2ATHR 191 28.924 8.567 22.345 0.70 4.21 C ATOM 1729 C ATHR 191 25.587 8.983 20.559 0.70 3.53 C ATOM 1730 O ATHR 191 24.872 9.566 21.383 0.70 3.93 O ... residues 191 through 203 not shown ATOM 1828 O AGLY 203 8.948 14.861 23.401 0.70 5.84 O ATOM 1833 CD1BLEU 190 26.014 4.711 16.521 0.30 6.21 C ATOM 1835 C BLEU 190 26.506 6.219 20.135 0.30 4.99 C ATOM 1836 O BLEU 190 25.418 5.939 20.669 0.30 5.91 O ATOM 1721 CD2ALEU 190 24.674 3.225 17.536 0.70 5.31 C ATOM 1722 C ALEU 190 26.781 6.055 20.023 0.70 3.36 C ATOM 1723 O ALEU 190 25.693 5.796 20.563 0.70 3.68 O ATOM 8722 C DLEU 190 26.781 6.055 20.023 0.70 3.36 C ATOM 8723 O DLEU 190 25.693 5.796 20.563 0.70 3.68 O ATOM 9722 C CLEU 190 26.781 6.055 20.023 0.70 3.36 C ATOM 9723 O CLEU 190 25.693 5.796 20.563 0.70 3.68
In this file, conformers A and B of residue 190 appear consecutively, but conformers C and D appear only after conformers A and B of all residues 191 through 203. While this is not the most intuitive way of ordering the residues in a file, it is still considered valid because the original intention is clear. Since it was our goal to develop a comprehensive library suitable for automatically processing files produced by any popular program, we found it important to correctly handle non-consecutive conformers.
When developing the procedure capable of handling the variety of real-world situations shown above, we strived to keep the underlying rule-set as simple as possible and to avoid highly specific heuristics (e.g. "water is never covalently bound"). Complex rules imply complex implementations, are difficult to explain and understand, tend to lead to surprises, and are therefore likely to be rejected by the community. With this and the real-world situations in mind, we arrived at the following primary organization of the PDB hierarchy:
In this presentation the "(s)" indicates a list of objects of the given type. i.e. a hierarchy contains a list of models, each model has an "id" (a simple string) and holds a list of chains, etc.
Comparing with the "first approximation hierarchy" above, the residue type is replaced with two new types: residue_group and atom_group. These types had to be introduced to cover all the real-world cases shown in the previous section. The residue_group and atom_group types are new and unusual. Before we go into the details of these types, it will be helpful to consider the bigger picture by introducing the alternative secondary view of the PDB hierarchy:
This organization is probably more intuitive at first. The first two levels (model, chain) are exactly the same as in the primary hierarchy. Each chain holds a list of conformer objects, which are characterized by the altloc character from column 17 in the PDB ATOM records. A conformer is understood to be a complete copy of a chain, but usually two conformers share some or even most atoms. A residue is characterized by a unique resname and the resseq+icode.
The secondary view of the hierarchy evolved in the context of generating geometry restraints for refinement (e.g. bond, angle, and dihedral restraints), where this organization is most useful. It is also the organization introduced in Grosse-Kunstleve et al. (2006), where it was actually the primary organization. While working with the conformer-residue organization, we found that it is difficult to manipulate a hierarchy (e.g. add or delete atoms) in obvious ways. Finally, while developing the automatic generation of constrained occupancy groups for alternative conformations, the conformer-residue organization proved to be unworkable. The main difficulty is that the relative order of residues with alternative conformations is lost in the conformer-residue organization; it is only given indirectly by interleaved residues with shared atoms -- if they exist. As convenient as the conformer-residue organization is for the generation of restraints, it is a hindrance for other purposes.
The names for the new types in the primary hierarchy were chosen not to collide with the secondary view. We could have used "conformer" and "residue" again, just reversed, but there would be the big surprise that one residue has different resnames. To send the signal "this is not what you usually think of as a residue", we decided to use residue_group as the type name. A residue group holds a list of atom_group objects. All atoms in an atom group have the same resname and altloc. Therefore "resname_altloc_group" would have been another plausible name, but we favored atom_group since it is more concise and better conveys what is the main content.
When constructing the primary hierarchy given a PDB file, the processing algorithm has to detect models, chains, residue groups, atom groups and atoms. Most steps are fairly straightforward, but none of the steps is actually completely trivial. For example, what to do if TER or ENDMDL cards are missing? What if residue sequence numbers are not consecutive? The ribosome community widely uses segid instead of chain id (even though the segid column is officially deprecated by the PDB). What if a file contains both chain id and segid? Documenting all our answers in full detail is beyond the scope of this article (and would be more distracting than helpful anyway because the source code is openly available). Therefore we concentrate on the most important rules for the detection of residue groups and atom groups.
Then central conflict we had to resolve was:
This lead to a two stage procedure. In the first stage, a residue group is given by a block of consecutive ATOM or HETATM records with identical resseq+icode columns (SIGATM, ANISOU, and SIGUIJ records may be interleaved), e.g.:
ATOM 234 H ATRP A 11 20.540 12.567 33.741 0.50 7.24 H ATOM 235 HA ATRP A 11 20.771 12.306 36.485 0.50 6.28 H ATOM 244 N CPHE A 11 20.226 13.044 34.556 0.15 6.35 N ATOM 245 CA CPHE A 11 20.950 12.135 35.430 0.15 5.92
However, there is an important pre-condition: unless a sub-block of ATOM records with identical resname+resseq+icode columns contains a main-conformer atom (almost "blank altloc"), e.g.:
HEADER ANTIBIOTIC RESISTANCE 07-MAY-97 1AJQ CRYST1 52.120 65.080 76.300 100.20 111.44 105.81 P 1 1 HETATM 6097 CA CA 1 5.676 34.115 52.446 1.00 18.50 CA HETATM 6100 C2 SPA 1 11.860 36.159 33.853 1.00 14.30 C HETATM 6107 C6 SPA 1 13.085 36.522 34.644 1.00 17.34 C
In this case the sub-block is assigned to a separate residue group.
Within each residue group, all atoms are grouped by altloc+resname and assigned to atom groups. The order of the atoms in a residue group does not affect the assignment to atom groups.
After all atom records are assigned to residue groups and atom groups,
are merged in the second processing stage. While two residue groups are merged, atom groups with identical altloc+resname from the two sources are also merged.
Note that the grouping steps in this process may change the order of the atoms. However, our implementation preserves the relative order of the atoms as much as possible. I.e. in the hierarchy, resseq+icode appear in the original "first seen" order, and similarly for altloc+resname within a residue group.
The secondary view of the hierarchy is constructed trivially from the primary hierarchy objects. For each chain, the complete list of altloc characters is determined in a first pass. In a second pass, a conformer object is created for each altloc, and a second loop over the chain assigns the atoms to each conformer. Main-conformer atoms are assigned to all conformers, atoms in alternative conformations only to the corresponding conformer. Because of the difficulties alluded to earlier, the conformer and residue objects in the secondary view are "read-only". I.e. all manipulations such as addition or removal of residues, have to be performed on the primary hierarchy. Re-constructing the secondary view after the primary hierarchy has been changed is very fast (fractions of seconds even for the largest files, e.g. 0.22 s for PDB entry 1HTQ with almost one million atoms).
iotbx/examples/pdb_hierarchy.py script shows how to construct the primary
hierarchy from a PDB file, and how to obtain the secondary view.
A PDB hierarchy can be represented as a PDB-formatted file (as illustrated so far here), or as an mmCIF-formatted file. Both of these file formats can be converted back into PDB hierarchies as well.
Although all three of these representations can be interconverted, they do not contain exactly the same information and it can be useful to know some of the similarities and differences.
The mmCIF format supersedes the PDB format and is the official format for macromolecular structures deposited in the Protein Data Bank. The relationships between PDB and mmCIF formats are specified in PDB to PDBx/mmCIF Data Item Correspondences.
The second level in a PDB hierarchy is a "chain" (see the Real-world PDB files section above). A chain is normally a segment of protein or nucleic acid, or a metal atom, or a group of water molecules, or a ligand. It can actually be any combination of these as well, however.
When working with a hierarchy in cctbx, the chain is a basic unit (the expression "for chain in model.chains():" appears almost 200 times in cctbx.) Every chain has an ID, but importantly, these chain ID values do not have to be unique. Multiple chains can have the same ID. This allows grouping of chains that are related. For example, protein chain A may be next to ligand GOL that part of another chain that has the same chain ID of A. The atom selection syntax (see Atom selections below) allows selecting all the atoms that are part of any chain that has the ID of A. This is done with the selection, "chain A".
One of the important differences between PDB and mmCIF formats is the use of the chain ID. In a PDB formatted file, atoms in a polymer such as a protein molecule are grouped by residue and are listed sequentially, one atom on each line. Each line includes the chain ID. At the end of the polymer, a special line that starts with TER is present, indicating the end of a protein or nucleic acid polymer. Here is an example of part of such a PDB file. It contains two protein chains (just one residue in each chain), and two Au (gold) atoms and four waters (all the coordinates are zeros in this example):
If this file is read in and used to create a PDB hierarchy, the hierarchy will have two protein chains, both with chain ID of A. It will have one additional chain, also with a chain ID of A, containing the gold atoms and waters. In an mmCIF file, there is no "chain" and no "chain ID" and there are no TER lines. There are instead two labels that convey similar information. One is called auth_asym_id. This label generally corresponds to the chain ID, but does not have to. The second label is label_asym_id. This label is unique (no duplicates) and refers to one protein or nucleic acid polymer, or one ligand, or one metal, or one group of waters.
REMARK PDB FORMATTED FILE WITH TWO PROTEIN CHAINS AND ONE CHAIN WITH AU AND WAT ATOM 1 N ALA A 102 0.000 0.000 0.000 1.00 10.00 N ATOM 2 CA ALA A 102 0.000 0.000 0.000 1.00 10.00 C ATOM 3 C ALA A 102 0.000 0.000 0.000 1.00 10.00 C ATOM 4 O ALA A 102 0.000 0.000 0.000 1.00 10.00 O TER ATOM 5 N GLY A 103 0.000 0.000 0.000 1.00 10.00 N ATOM 6 CA GLY A 103 0.000 0.000 0.000 1.00 10.00 C ATOM 7 C GLY A 103 0.000 0.000 0.000 1.00 10.00 C ATOM 8 O GLY A 103 0.000 0.000 0.000 1.00 10.00 O TER HETATM 9 AU AU A 104 0.000 0.000 0.000 1.00 10.00 AU HETATM 10 AU AU A 105 0.000 0.000 0.000 1.00 10.00 AU HETATM 11 HOH WAT A 106 0.000 0.000 0.000 1.00 10.00 O HETATM 12 HOH WAT A 107 0.000 0.000 0.000 1.00 10.00 O HETATM 13 HOH WAT A 108 0.000 0.000 0.000 1.00 10.00 O HETATM 14 HOH WAT A 109 0.000 0.000 0.000 1.00 10.00 O END
Here is part of the mmCIF representation of this same structure. The "loop_" line and the lines following it that start with an underscore describe the items of data that are going to follow in the lines starting with "ATOM" (for protein/nucleic acid atoms) or "HETATM" (for ligands, metals, waters). The auth_asym_id is the A in "ALA A 102" and the label_asym_id is the A in "N ? A ? 1 1" at the end of the line. Notice that the auth_asym_id is A for all the atoms (as the chain ID was in the PDB format), and the label_asym_id is A for the first protein chain, B for the second, then C and D for the gold atoms and E for all four waters:
# PART OF MMCIF FILE WITH TWO PROTEIN CHAINS AND ONE CHAIN WITH AU AND WAT loop_ _atom_site.group_PDB _atom_site.id _atom_site.label_atom_id _atom_site.label_alt_id _atom_site.label_comp_id _atom_site.auth_asym_id _atom_site.auth_seq_id _atom_site.pdbx_PDB_ins_code _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.type_symbol _atom_site.pdbx_formal_charge _atom_site.label_asym_id _atom_site.label_entity_id _atom_site.label_seq_id _atom_site.pdbx_PDB_model_num ATOM 1 N . ALA A 102 ? 0.00000 0.00000 0.00000 1.000 10.00000 N ? A ? 1 1 ATOM 2 CA . ALA A 102 ? 0.00000 0.00000 0.00000 1.000 10.00000 C ? A ? 1 1 ATOM 3 C . ALA A 102 ? 0.00000 0.00000 0.00000 1.000 10.00000 C ? A ? 1 1 ATOM 4 O . ALA A 102 ? 0.00000 0.00000 0.00000 1.000 10.00000 O ? A ? 1 1 ATOM 5 N . GLY A 103 ? 0.00000 0.00000 0.00000 1.000 10.00000 N ? B ? 1 1 ATOM 6 CA . GLY A 103 ? 0.00000 0.00000 0.00000 1.000 10.00000 C ? B ? 1 1 ATOM 7 C . GLY A 103 ? 0.00000 0.00000 0.00000 1.000 10.00000 C ? B ? 1 1 ATOM 8 O . GLY A 103 ? 0.00000 0.00000 0.00000 1.000 10.00000 O ? B ? 1 1 HETATM 9 AU . AU A 104 ? 0.00000 0.00000 0.00000 1.000 10.00000 AU ? C ? . 1 HETATM 10 AU . AU A 105 ? 0.00000 0.00000 0.00000 1.000 10.00000 AU ? D ? . 1 HETATM 11 HOH . WAT A 106 ? 0.00000 0.00000 0.00000 1.000 10.00000 O ? E ? . 1 HETATM 12 HOH . WAT A 107 ? 0.00000 0.00000 0.00000 1.000 10.00000 O ? E ? . 1 HETATM 13 HOH . WAT A 108 ? 0.00000 0.00000 0.00000 1.000 10.00000 O ? E ? . 1 HETATM 14 HOH . WAT A 109 ? 0.00000 0.00000 0.00000 1.000 10.00000 O ? E ? . 1
In cctbx, the mmCIF and PDB formatted files shown above are interconvertible with the hierarchy. You can read in either of these and obtain the same hierarchy, and if you write out the hierarchy in either format you will get back the text shown above.
iotbx.pdb module uses the following nomenclature to describe various
aspects of alternative conformations:
HEADER HYDROLASE 12-MAR-04 1SO7 ATOM 2460 CG1 VAL A 325 -23.284 97.713 15.815 0.66 21.74 C ATOM 2461 CG1AVAL A 325 -23.010 97.616 18.295 0.66 22.88 C ATOM 2462 CG2BVAL A 325 -24.819 96.373 17.146 0.66 22.57 C
Residues are classified as follows:
The tools in iotbx.pdb are designed to be as tolerant as reasonably possible when processing input PDB files. E.g., as of Jun 8 2010, all 65802 files in the PDB archive can be processed without generating exceptions. However, to assist users and developers in creating PDB files without ambiguities, the hierarchy object provides methods for flagging likely problems as errors and warnings. It is up to the application how to react to the diagnostics.
The phenix.pdb.hierarchy command can be used to quickly obtain a summary of the hierarchy in a PDB file and some diagnostics. This example command highlights all main features:
Output: file pdb1jxw.ent.gz total number of: models: 1 chains: 2 alt. conf.: 5 residues: 49 atoms: 786 anisou: 0 number of atom element+charge types: 5 histogram of atom element+charge frequency: " H " 383 " C " 261 " O " 77 " N " 59 " S " 6 residue name classes: "common_amino_acid" 48 "other" 1 number of chain ids: 2 histogram of chain id frequency: " " 1 "A" 1 number of alt. conf. ids: 3 histogram of alt. conf. id frequency: "A" 2 "B" 2 "C" 1 residue alt. conf. situations: pure main conf.: 32 pure alt. conf.: 3 proper alt. conf.: 14 improper alt. conf.: 0 chains with mix of proper and improper alt. conf.: 0 number of residue names: 16 histogram of residue name frequency: "CYS" 6 "THR" 6 "ALA" 5 "ILE" 5 "PRO" 5 "GLY" 4 "ASN" 3 "SER" 3 "ARG" 2 "LEU" 2 "TYR" 2 "VAL" 2 "ASP" 1 "EOH" 1 other "GLU" 1 "PHE" 1 ### WARNING: consecutive residue_groups with same resid ### number of consecutive residue groups with same resid: 2 residue group: "ATOM 378 N PRO A 22 .*. N " ... 12 atoms not shown "ATOM 391 HD3APRO A 22 .*. H " next residue group: "ATOM 392 CB BSER A 22 .*. C " ... 6 atoms not shown "ATOM 399 HB3CSER A 22 .*. H " ------------------------------------------ residue group: "ATOM 432 N LEU A 25 .*. N " ... 18 atoms not shown "ATOM 439 CD1CLEU A 25 .*. C " next residue group: "ATOM 452 CG1BILE A 25 .*. C " ... 13 atoms not shown "ATOM 466 HD13CILE A 25 .*. H "
The diagnostics are mainly intended for PDB working files, but we have tested the iotbx.pdb module by processing the entire PDB archive. (This took about 3700 CPU seconds, but using 40 CPUs the last job finished after only 277 seconds. Disk and network I/O is rate-limiting in this case, not the performance of the PDB handling library.) The results are summarized in the following table:
Total number of .ent files: 65802 (2010 Jun 8) 56873 WARNING: duplicate chain id 3 WARNING: consecutive residue_groups with same resid 2 ERROR: duplicate atom labels 1 ERROR: improper alt. conf. 0 ERROR: duplicate model id 0 ERROR: residue group with multiple resnames using same altloc
About 86% of the PDB entries re-use the same chain id for multiple chains (which are either separated by other chains or TER cards). In many cases, the re-used chain id is the blank character, which is clearly a minor issue. However, we flag this situation to assist people in producing new PDB files with unambiguous chain ids.
The next item in the list, "consecutive residue_groups with same resid" (where resid=resseq+icode), was mentioned before. This situation is best avoided to minimize the chances of mis-interpreting the PDB files. "duplicate atom labels" pose a serious practical problem (therefore flagged as "ERROR") since it is impossible to uniquely select atoms with duplicate labels, for example via an atom selection syntax as used in many programs (CNS, PyMOL, PHENIX, VMD, etc.) or via PDB records in the connectivity annotation section (LINK, SSBOND, CISPEP). Atom serial numbers are not suitable for this purpose since many programs do not preserve them.
Only one PDB entry (1JRT) includes "improper alt. conf." as introduced in the previous section.
There are no PDB entries with two other potential problems diagnosed by the iotbx.pdb module. MODEL ids are unique throughout the PDB archive (as an aside: and all MODEL records have a matching ENDMDL record). Finally, in all 74 files with mixed residue names (i.e. conformers with different sequences), there is exactly one residue name for a given altloc.
The PDB hierarchy enables fast selection lookup based on a simple syntax similar to the one used in CNS or PyMOL. Many programs in Phenix make heavy use of these selections to handle user input, but they are equally important internally, where they simplify the task of pulling specific atoms out of the hierarchy.
The syntax allows for selection of atomic properties such as atom name,
residue number, chain ID, or B-factor, which can be combined by the use of
simple boolean statements (
not). The class
iotbx.pdb.atom_selection.cache implements these selections, and
is created directly by the hierarchy object:
m = model.deep_copy() pdb_hierarchy = m.get_hierarchy() pdb_atoms = pdb_hierarchy.atoms() xray_structure = m.get_xray_structure() sel_cache = pdb_hierarchy.atom_selection_cache() c_alpha_sel = sel_cache.selection("name ca") # XXX not case sensitive! c_alpha_atoms = pdb_atoms.select(c_alpha_sel) c_alpha_xray_structure = xray_structure.select(c_alpha_sel) c_alpha_hierarchy = pdb_hierarchy.select(c_alpha_sel)
These selection operations work equivalently to the
scitbx.array_family selections described elsewhere, and the
iotbx.pdb.atom_selection.cache object also provides an analogous
iselection that returns the corresponding
size_t array of
indices (equivalent to calling
c_alpha_sel.iselection() in the example
cctbx.xray.structure classes are not themselves arrays, but since
they store comparable data (and in the latter case, an actual array internally)
they are also implemented with the
Other practical examples of the syntax:All atoms
All atoms with H in the name (* is a wildcard character)
Atoms names with * (backslash disables wildcard function)
Atom names with spaces
Atom names with primes don't necessarily have to be quoted
name 'O 1'
Boolean "and", "or", and "not"
resname ALA and (name ca or name c or name n or name o) chain a and not altid b resid 120 and icode c and model 2 segid a and element c and charge 2+ and anisou
Note that the
not operators have equal priority, so parentheses may be required to indicate which clauses go together.
The first example above selects for all atoms with the specified names in alanine residues, but if the parentheses are omitted:
it will instead select C-alpha atoms in ALA, plus all atoms named C, N, or O regardless of residue name. A single residue by number
resname ALA and name ca or name c or name n or name o
Note that if there are several chains containing residue number 188, all of them will be selected. To be more specific and select residue 188 in particular chain:
chain A and resid 188
this will select residue 188 only in chain A.A range of residues
This selects all residue numbers falling within this range, independent of their order in the PDB file. The numbering must be ascending; ``resseq 10:2`` will not work.
resid combines the residue number (
resseq) with the insertion code (
icode); thus the following selections are identical
resid 188 resseq 188 and icode ' '
as are these:
resid 27C resseq 27 and icode 'C'
Specifying ordered ranges of residues
For some purposes, such as specifying TLS groups, it may be impractical to specify a numerical range of
resseq attributes. This is especially the
case when the residue numbering is not continuous and ascending, which is
found in some PDB entries, or when the insertion code is non-blank. An
alternative is to define a series of residues by their combined number and
insertion code, using the
chain A and resid 1000 through 17J
In this case, the residues will be examined in the order that they appear in the PDB file, and every atom falling between 1000 and 17J in chain A will be included. This is the only situation where the ordering of atoms is important in atom selections.
B-factors and occupancies
You may also select atoms based on their numerical properties:
bfactor > 80 bfactor = 0 occupancy < 1 occupancy = 0
A few additional keywords are supported:
element H water pepnames hetero
These examples specify hydrogen atoms, water molecules (detected by residue name), common amino acid residues, and atoms labeled as HETATM, respectively.Selecting no atoms
In addition to the string-based selection mechanism, it is also possible to derive selections from any object in the PDB hierarchy:
m = model.deep_copy() pdb_hierarchy = m.get_hierarchy() for chain in pdb_hierarchy.only_model().chains(): chain_atoms = chain.atoms() chain_selection = chain_atoms.extract_i_seq()
This is especially useful when referencing a string-based selection, as it allows us to compare any hierarchy object to the selection:
selection = pdb_hierarchy.atom_selection_cache().selection("hetatm") for chain in pdb_hierarchy.only_model().chains(): for residue_group in chain.residue_groups(): residue_isel = residue_group.atoms().extract_i_seq() if (selection.select(residue_isel).all_eq(True)): #do_something_with_heteroatom_residue(residue_group) pass
One word of caution: this assumes that the
i_seq attribute of each atom
represents its position in the hierarchy (and corresponding array returned by
i_seq will be initialized by default when
constructing the hierarchy, but modifications to the hierarchy will not be
propagated to the individual atoms. You can reset the
i_seq at any time:
pdb_atoms = pdb_hierarchy.atoms() pdb_atoms.reset_i_seq()