Compare secondary structure annotations numerically

There are several methods to determine the secondary structure of macromolecules. This script shows how to automatically determine secondary structure using several methods with cctbx tools and how to numerically compare the result.
Author: Pavel V. Afonine

Code

The following script reads in a model file and numerically compares the secondary structure annotation:

from __future__ import absolute_import, division, print_function
import sys
import mmtbx.secondary_structure
from scitbx.array_family import flex
from libtbx.utils import null_out
from iotbx.data_manager import DataManager

def match_score(x,y):
  assert x.size() == y.size()
  match_cntr = 0
  for x_,y_ in zip(x,y):
    if(x_==y_): match_cntr+=1
  return match_cntr/x.size()

def get_ss(hierarchy,
           sec_str_from_pdb_file=None,
           method="ksdssp",
           use_recs=False):
  if(use_recs): params = None
  else:
    params = mmtbx.secondary_structure.manager.get_default_ss_params()
    params.secondary_structure.protein.search_method=method
    params = params.secondary_structure
  ssm = mmtbx.secondary_structure.manager(
    pdb_hierarchy         = hierarchy,
    sec_str_from_pdb_file = sec_str_from_pdb_file,
    params                = params,
    log                   = null_out())
  alpha = ssm.helix_selection()
  beta  = ssm.beta_selection()
  assert alpha.size() == beta.size() == hierarchy.atoms().size()
  annotation_vector = flex.double(hierarchy.atoms().size(), 0)
  annotation_vector.set_selected(alpha, 1)
  annotation_vector.set_selected(beta, 2)
  return annotation_vector

def run(args):
  dm = DataManager()                    #   Initialize the DataManager and call it dm
  dm.set_overwrite(True)                #   tell the DataManager to overwrite files with the same name
  model_filename = args[0]              #   Name of model file
  model = dm.get_model(model_filename)  #   Deliver model object with model info
  pdb_hierarchy = model.get_hierarchy() #   Get hierarchy object
  sec_str_from_pdb_file = model.get_ss_annotation()
  # get secodary structure annotation vector from HELIX/SHEET records (file header)
  print('Running secondary structure annotation...')
  v1 = get_ss(
    hierarchy             = pdb_hierarchy,
    sec_str_from_pdb_file = sec_str_from_pdb_file)
  # get secodary structure annotation vector from method CA atoms
  v2 = get_ss(hierarchy = pdb_hierarchy, method = "from_ca")
  # secodary structure annotation vector from KSDSSP
  v3 = get_ss(hierarchy = pdb_hierarchy, method = "ksdssp")
  #
  print()
  print("CC REMARK vs from_ca:", flex.linear_correlation(x = v1, y = v2).coefficient())
  print("CC REMARK vs ksdssp:", flex.linear_correlation(x = v1, y = v3).coefficient())
  print("CC from_ca vs ksdssp:", flex.linear_correlation(x = v3, y = v2).coefficient())
  print()
  print("match REMARK vs from_ca:", match_score(x = v1, y = v2))
  print("match REMARK vs ksdssp:", match_score(x = v1, y = v3))
  print("match from_ca vs ksdssp:", match_score(x = v3, y = v2))

if __name__ == '__main__':
  run(args=sys.argv[1:])

Overview

The script consists of several parts:

Imports: At the beginning of the script are import statements. They are used to import the modules needed for a task.

Main function: The run() function is used to read the input file, to call the helper functions and to print the result to standard output.

Helper functions: match_score() and get_ss(): To avoid duplicating code, some part of the calculations are put into functions. The get_ss() function runs automatic secondary structure annotation, then returns a 1dimensional array representing the annotation. This vector is a linear array of length equal to the number of atoms in the model. It is initialized with zeros. Using the information from the SS annotation, the array elements are set to a non-zero number for helices and to a different number for sheets.

The bottom lines represent best practice. They enable the script to be imported and used from other Python scripts.

Imports

The script requires the following imports:

run() function

In the beginning of the run() function, we read in a PDB file using the DataManager and create a model object as well as a hierarchy object:

Model files from the PDB usually have records with secondary structure annotation. Let's get the records from model:

Then we get the annotation vectors from the helper function get_ss(), using three different methods:

v1 represents the annotation vector based on the records in the model file. v2 is obtained from the annotation method "from_ca", which uses CA positions to find out if a residue belongs to secondary structure element. The annotation vector v3 relies on the the KSDSSP method.

The last lines print the results. With the annotation vectors, we can calculatie correlation coefficients using the method flex.linear_correlation(x = v1, y = v2).coefficient(). We can also calculate the percentage of similarity of the vectors, which is done in the match_score() function.

Helper function get_ss()

In the get_ss() function, we calculate the annotation vectors. To do this, we first obtain the default parameters for the secondary structure module:

The params object is a container for the parameters used in the mmtbx.secondary_structure.manager. As we want to get the annotation via different methods, we change the parameter params.secondary_structure.protein.search_method.

Then we get the secondary structure manager object:

We get the selections for helices and sheets with the helix_selection() and beta_selection() methods:

These selections are flex bool arrays. The length of the array corresponds to the number of atoms. We then make an annotation vector that has the same length. Its values are 1 if the atom is part of a helix, 2 if the atom is part of a sheet and 0 if the atom is not part of a secondary structure element.

We can then use this vector to compare the annotations numerically, as done with flex.linear_correlation(x = v1, y = v2).coefficient() and the match_score() function.

Helper function match_score()

The match_score() function calculates how many atoms have the same annotation and then divides the result by the number of atoms. This gives therefore a percentage of how many atoms of the model have the same secondary structure annotation.

Note the assertion in the first line, which makes sure that the arrays x and y have the same length.

Example output

You can run the script via the command line:

python script.py my_model.pdb

For example, if you run the script with model 1us0, the output will be the following (you can conveniently fetch a PDB model with "phenix.fetch_pdb 1us0"):

CC REMARK vs from_ca: 0.803945216619
CC REMARK vs ksdssp: 0.869800857769
CC from_ca vs ksdssp: 0.727252951902

match REMARK vs from_ca: 0.91220440337
match REMARK vs ksdssp: 0.916009785268
match from_ca vs ksdssp: 0.864908942647

You can get some more information about the annotation if you remove the parameter log = null_out() in the generation of the SS manager:

(The default for log is the standard output.)

Then we'll get the additional output:

    Secondary structure from input PDB file:
      14 helices and 2 sheets defined
      46.5% alpha, 8.6% beta
      0 base pairs and 0 stacking pairs defined.
  running find_ss_from_ca liberal...
    Secondary structure from input PDB file:
      12 helices and 3 sheets defined
      47.5% alpha, 13.7% beta
      0 base pairs and 0 stacking pairs defined.
  running ksdssp...
    Secondary structure from input PDB file:
      14 helices and 3 sheets defined
      39.5% alpha, 7.6% beta
      0 base pairs and 0 stacking pairs defined.