l to Ralf 042910: #My original idea was to create some code that could automatically write restraints for crystallographic refinement. I thught that basepairs could be easily identified just by measuring the distances between th bases of pairs involved. You provided a script that would give me a list of atom to atom distances to use for this search, I call these the "entries". I soon realized, however, that it was much more difficult than just identifying these contacts. There are about 30 different types of basepairs and all of them have to be identified. The first part of the program sorts out the output lines from your script into a master list that contains one sublist/basepair type. Initially, an cutoff of 3.0A is used. Distance information about pairs of atoms that are consistent with the basepair geometry are stored in these sublists. The distance information can be consistent with several different basepair geometries but this is sorted out later. To do this the program finds all the entries for a particular basepair geometry that involve the same pair of bases. A minimum of 2 different entries are required to define a "candidate basepair" and if 3 entries are found that involve the same 2 bases, the basepair is considered legitimate wit high confidence. -By using a small molecule, like tRNA, I soon realized that there was an importan problem that would prevent the program from identifying real basepairs, i.e that the distances between basepair-forming atoms of stacking bases are impossible to weed out by this method because they are in the same range a the real basepair interactions. This is particularly important in the case of basepairs with only 2 bond or 3-bond basepairs for which only 2 bonds are identified at the current cutoff but it may also affect the assignement of 3-bond basepairs with all bonds identified. In fact, I have seen some such basepairs being mistakenly identified between stacking bases. To minimize this problem created by the"stacking noise", I decided to first find out wich bases are involved in more of than one possible basepair. In the program I called this "the list of bases involved in multiple interactions" but this name is not totally accurate and I was trying to avoid it towards the end. Regardless of the name, this part of the program identifies all this noise and prevents all these bases from entering the pool of basepair candidates in the initial assignment. 3-bond basepairs with all bonds identified are allowed to go to the next step, though, beacause the identificaton of 3 bonds raises the confidence over the assignment. Actually, if you don´t do allow 3-bond basepairs with all bonds identified at the initial stages of identification you may not detect a single basepair and you need a few basepairs identified with confidence to prime the program. Once the program has identified some basepairs with certan confidence, these baseairs are removed from the the list of groups bases involved in "multiple interactions" and new basepairs are identified from that list by 2 different criteria: 1) by "elimination", if after the removal of basepairs from the list a single pair of bases is left within a group, they are likely to be a real basepair. 2) by what I call the "continuity of helicity criterion" by which a new basepair could be identifiedif it is contiguous to an existing basepair. #Another complication that I found early on is that the distance between atoms involved in basepairs deviate from the ideal distances by quite a bit. Therefore, it is impossible to use a single cutoff to identify all basepairs. If the cutoff distance is low, 3.0A, you will beable to identify a few basepairs with high confidence but you will miss a lot of basepairs with bonds longer than cutoff. If you choose a very high cutoff distance, e.g. 3.8A you will definitely find all the entries related o basepairs but you will not be able to weed out the ones that do not link basepairs. After playing with the program, I realized that I could get very accurate predictions by starting with a low distance cutoff,followd by new cycles dentification and weeding out perfomed at higher cutoffs. #Once I reached this point the program was already pretty complicated but it did what I was asking it to do, to identify the all the basepairs in a tRNA molecule. However, i had some indication then and more later on with larger molecules, that there were some problems. One of this problems was the the program would identify some FALSE basepairs due to the aforementioned noise from stacking bases. These arise in the form of two basepairs that share one base. To sort them out I created a routine in which these base-sharing basepair candidates are evaluated in terms of 3 criteria: 1) whether they can be use to continue a pre-existing helical segment, 2) the cutoff at which they were assigned, and 3) the number of bonds involved. This routine works pretty well but, when tested with 16S rRAN, I found that it cannot accurately predict one helix with "loose secondary structure". Again, stacking bases and real basepair bases cannot be told apart in this region. I must say, however that the program can find the true basepairs but they are in the list of FALSE basepairs. In summary, the program outputs a list of what it thinks are the true basepairs and they almost all are. The program also outputs a list of what it thinks are te FALSE basepairs and, again most of them are. Inspection of this list, however, may reveal places where basepair identification is problematic. If it is determined that the program cannot do any better than this, the user shoud inspect the output and search for possible sites where there are problems. Should any be detected, theuser should have the choice to manually enter pre-assigned basepairs as arguments to the program. I haven´t done this yet. #This failure to sort out FLASE and TRUE basepairs in some regions made me think about using the P-to-P distance to increase the accuracy of basepair prdiction. I haven´t been able to test it yet to completion but I have implemented code lines to deal with entries carrying these distances. To do this, I had to increase the cutoff distance in your original script to 22A. Now, this makes me think that the version of the program that I have submitted may have this 22A cutoff in your part of the script. This makes the program way too slow even with the tRNA but it is the only way so far to get the P-P lines. Sorry, I forgot to mention that. If you reduce the cutoff to 5 it is much faster but the program will probably crash because the P lines are missing. I had a previous version without the P-P identifying code lines that I meant to upload, but I forgot. Anyway, after adding the P-P detection capability, I also added some simple statistic calculations to get an idea of how much the P-P distance varies between the GOOD basepairs and the FALSE basepairs in the problematic helix. From these calculations, I think that it could work but I haven´t done any more work since submission. #I think that the program could be used to get a first approximation to the quality of the RNA helical parameters in atomic structures. I think that it could go a lot further than that towards a detailed account of secondary and tertiary structure parameters for nucleic acids. In fact, once I implement the P-P istance as a resolving factor for coflictive basepairs, I was thinking that the next step would be to attempt the detection of tertiary contacts.