# Copyright (c) 2013-2020, SIB - Swiss Institute of Bioinformatics and
#                          Biozentrum - University of Basel
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#   http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


You can get rotamer libraries from several sources.

#############################
# The Dunbrack 2010 Library #
#############################

You can apply for a licence here: http://dunbrack.fccc.edu/bbdep2010/
You'll get the library in several different options. The options
we can read in are SimpleOpt1 and SimpleOpt2. 
Adapt the according path in CreateDunbrackLibrary.py and run it to read
the provided file and dump it as a binary that can be read by ProMod3.


#################################
# The ProMod3 Rotamer Libraries #
#################################

ProMod3 comes with basic functionality to generate backbone dependent
and backbone independent rotamer libraries. Both types of rotamer libraries are
available in default binary versions and can directly be loaded by the user.
The following sections guide you through the process of building such libraries. 
Feel free to adapt any step to get your custom library.


####################
# FETCH STRUCTURES #
####################

No matter whether you want to generate a backbone dependent or independent
library: you need data

ost fetch_data.py

This script uses a pisces file downloaded from 
http://dunbrack.fccc.edu/Guoli/pisces_download.php
and stores following information for each residue in a csv file:

 * residue name
 * phi / psi backbone dihedral angles in radians
 * sidechain dihedral angles in radians
 * the configuration of sidechain dihedral angles,
   one of [GAUCHE_MINUS, GAUCHE_PLUS, TRANS, NON_ROTAMERIC, INVALID]

Residues from which you'll get data: 
ARG, ASN, ASP, GLN, GLU, LYS, SER, CYH, CYD, CYS, MET, TRP, TYR, THR, VAL, ILE, 
LEU, CPR, TPR, PRO, HIS, PHE

Special cases: 

  CYH, CYD, CYS:

  CYS contains the data for all cysteins. CYD only contains data from cysteins 
  that build a disulfid bond. Criteria: The SG atom from another cysteins is 
  within 2.5A of its own SG atom. Cysteins for which this criteria is not 
  fulfilled are added to CYH

  CPR, TPR, PRO:

  PRO contains the data for all prolines. Prolines with a trans-omega 
  backbone dihedral are added to TPR and prolines with a cis-omega
  backbone dihedral are added to CPR.

The behaviour of the script can be controlled by setting the documented 
variables on the top. 
Only data from residues will be dumped where everything is available, 
e.g. terminal residues without valid phi/psi or residues without
all their sidechain atoms will be skipped. 

####################################################
# OPTION ONE: BACKBONE INDEPENDENT ROTAMER LIBRARY #
####################################################

ost do_lib.py <data> <out_file_name>

Loads csv generated by fetch_data.py, determines a set of rotamers for each 
type of amino acids and dumps a full backbone independent library to disk.

The general approach for each amino acid is to look at each possible 
combination of rotameric configurations and obtain a rotamer by simply 
calculating the mean and standard deviation of all datapoints with
this particular configuration. The probability of that rotamer is the number 
of data points with this configuration divided by the total number of 
datapoints for this amino acid.

The situation becomes more difficult for the so called non rotameric dihedral
angles.
Non rotameric dihedral angles are:  ASN(chi2), ASP(chi2), PHE(chi2), TYR(chi2), 
HIS(chi2), TRP(chi2), GLN(chi3) and GLU(chi3).

The approach used here is exaclty the same as described for option two:
backbone dependent rotamer library with the exception that there is no
weighted mean etc, each datapoint contributes equally to the mean and
std.

##################################################
# OPTION TWO: BACKBONE DEPENDENT ROTAMER LIBRARY #
##################################################

ost do_bb_dep_lib.py <data> <aa_name>

This script uses the csv generated by fetch data (provided as data parameter) 
and the name of any amino acid (provided as aa_name parameter) to generate 
a backbone dependent rotamer library for that amino acid and dumps it as 
<aa_name>.txt.

Some theory (You don't have to read this if you believe we're doing 
             awesome science...):

In order to get weighted distributions etc. we rely on the Mises distribution
that can considered to be a circular gaussian:
p(x,y) = 1.0/(4*pi*pi*i0(kappa)*i0(kappa)) * exp(kappa*(cos(x-x0)+sin(y-y0)))
where i0() is the modified Bessel function of first order and (x0, y0) the mean.
kappa is a measure of concentration. 

This distribution is used to determine weights that are used in:
  * calculating the probability of a rotamer configuration given the backbone 
    dihedral angles
  * calculating sidechain dihedral angles given the backbone dihedral angles

Every datapoint contributes to a final value with a weight. These weights depend
on the distance to the query point. In case of many data close to the query point,
only close data points should contribute. If the region around the query data 
point is only sparsely populated, data points that are a bit further away should 
also contribute. We achieve this by using position dependent kappa values for 
the underlying Mises distribution.

For large kappas, the Mises distribution is basically a repetitive normal
distribution with the relation: sigma*sigma = 1.0/kappa
=> r = sqrt(1.0/kappa) corresponds to the radius of a circle around the mean, 
that covers approximately 67% of the density.
kappa can be estimated such that this "circle" includes exactly N data points.

We sample the backbone dihedral space with n_bins x n_bins bins, where n_bins 
can be set on top of the script. For each bin, we estimate two things:

  * Estimate the probability of every possible rotamer configuration
  * Estimate mean and standard deviation of every sidechain dihedral angle

Step one: Estimation of Rotamer Probabilities

For every bin, a kappa is estimated as outlined above (the whole circle 
story with N). For amino acids with many dihedral angles, N should be larger 
to guarantee enough sampling. N can be controlled on top of the script by
adjusting the p_threshold_67 parameter.

The probability for each rotamer configuration is then estimated with a 
histogram approach where every possible configuration has one bin. 
Every datapoint adds a value to its corresponding bin. 
This value is the evaluation of the Mises distribution for that data point 
with the query bin as mean and the previously determined kappa. 
Subsequent normalization leads to the final probabilities.

Step two: Estimation of mean and standard deviations of sidechain dihedral
angles for each rotamer configuration:

This is a tougher business, as we have to deal with rotameric dihedral angles
and non rotameric dihedral angles...
Non rotameric dihedral angles are:  ASN(chi2), ASP(chi2), PHE(chi2), TYR(chi2), 
HIS(chi2), TRP(chi2), GLN(chi3) and GLU(chi3) and will be discussed later.

First of all, we again determine our query bin dependent kappa but this time
N only relates to the data points with the same configuration. N can be 
controlled by adjusting the chi_threshold_67 parameter on top of the script.
And again, we determine a weight for each datapoint with the same configuration
based on the mises distribution. All those datapoints then contribute to a
weighted mean and weighted standard deviation as estimated in the 
GetChiStats function of the do_bb_dep_lib.py script.

As a last step we have to determine the mean and std values for non rotameric
sidechain dihedral angles. For this we first estimate a full probability density
function (pdf) of chi (the non-rotameric sidechain dihedral angle). 
The pdf is based on a histogram, where we simply add up the Mises based weights 
of all data points with the same rotameric configuration at the according 
histogram position. 
This histogram is then smoothed with a circular gaussian filter with std
non_rotameric_smooth_std and finally normalized. non_rotameric_smooth_std 
can be set on top of the script.
The full non-rotameric chi angle is then sampled in a binned manner, where the 
center of the first bin is determined by the mode of the initial pdf. 
The number of bins can be controlled by the parameter non_rotameric_sampling 
on top of the script.
For each bin we estimate a probability, a mean and a standard deviation
based on the underlying distribution. See the function GetNonRotChiStats
function for more details. Given these values we can then determine all 
possible rotamers.
E.g. ASN has one rotameric dihedral angle => 3 possible rotameric 
configurations. Lets say the trans configuration has probability p.
We then do the described sampling of the non-rotamerich chi2 angle.
with default parametrization, this gives a total of 12 possible chi2 
configurations. Each with a probability p_chi2_n, a mean m_chi2_n and a standard 
deviation std_chi2_n. The probability of a certain rotamer n is then p*p_chi2_n.

###############################
# DO THIS FOR ALL AMINO ACIDS #
###############################

The file rotamer_lib_creation.cmd contains all commands required to generate the 
rotamers for all amino acids. With a working ProMod3 installation and the csv 
file generated at the very beginning, you can do:
source rotamer_lib_creation.cmd

##############
# READ IT IN #
##############

You should now have a file for every amino acid... you can concatenate them 
with the bash script: concatenate_library.sh
The resulting file can be loaded with ProMod3 and converted to binary version:

from promod3 import sidechain
lib = sidechain.ReadDunbrackFile("bb_dep_lib.txt")

And then dump it in binary format:
lib.Save("my_awesome_rotamer_library.dat")

This file can be loaded again with:
lib = sidechain.BBDepRotamerLib.Load("my_awesome_rotamer_library.dat")
