Message-ID: <430996471.259.1632427552239.JavaMail.bigchem@cpu> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_258_940684427.1632427552239" ------=_Part_258_940684427.1632427552239 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
MolPrint (aka MolPrint 2D) descriptors[1,2] are a particular typ= e of circular fingerprint which employ Sybyl MOL2 atom types. More specific= ally, they are based on counts of MOL2 atom types around each heavy atom of= the molecule. In contrast to structural keys they do not draw features fro= m a limited set of structural fragments (such as MACCS keys). Rather, they = enumerate all atom environments present in a molecule. MolPrint 2D descript= ors are similar to SciTegic's (Pipeline Pilot) extended-connectivity finger= prints (ECFP), but MolPrint 2D features are not hashed.[5] The implementati= on of MolPrint 2D used in OCHEM uses the atom types literally as they appea= r in a MOL2 file, i.e., an aromatic carbon is encoded as "C.ar", = a sp2-hybridized oxygen atom as "O.2", etc.
For each heavy atom all neighboring atoms at a given number of bonds awa= y are tallied and encoded as a string. Such a string always starts with the= heavy atom C at the center of the feature, followed by triples of the form= D-T-N, where D is the distance in bonds from the central atom (D in {1, 2,= =E2=80=A6}), T the type of atom (T is a valid Sybyl MOL2 atom type), and N= the number of atoms of type T that can be found at a distance D of the cen= tral atom C. The central atom and all tripled are separated by semicolons. = Overall, that results in feature strings of the form: C;D-T-N;D-T-N;D-T-N;= =E2=80=A6 In practice, it was found that values for D up to 3 should be con= sidered for descriptor generation, with D=3D2 the most commonly employed. T= he higher the value for D, the more specific the features become by nature = of their construction.
A feature that would be generated for the atom marked in bold in this fi= gure (the central atom for this feature)
would be described as follows:
Central atom: C.ar
Distance of one bond from C: two times C.ar =3D&=
gt; 1-C.ar-2; one timee C.co2 =3D> 1-C.co2-1;
Distance of two bonds=
from C: two times O.co2 =3D> 2-O.co2-2; two times C.ar =3D> 2-C.ar-2=
;
The final feature for the above example would be the concatenation of th= e central atom and all the triples: C.ar;1-C.ar-2;1-C.co2-1;2-C.ar-2;2-O.co= 2-2;
For each distance D, the triples are ordered alphabetically, so 1-C.ar-2= would come before 1-F-2 but after 1-Br-1. In the example above, 2-C.ar-2 c= omes before 2-O.co2-2.
This procedure is repeated for every heavy atom in the molecule.
The binary nature of descriptors renders MolPrint descriptors more amena= ble to certain types of modeling methods (such as Bayes or k-NN methods), m= ore than for example neural network models. The models generated are relati= vely easy to interpret, since every feature corresponds to roughly a functi= onal group (though without explicit information about the bond order betwee= n atoms).
MolPrint descriptors have been used successfully in virtual screening[3]= and ligand-target prediction[4] where they have been shown to capture a la= rge amound of the information relating molecular structure to bioactivity a= gainst a protein target.
[1] A. Bender, H.Y. Mussa, R.C. Glen and S. Reiling. Molecular similarit= y searching using atom environments, information-based feature selection, a= nd a naive bayesian classifier. Journal of Chemical Information and Compute= r Sciences, 2004, 44, 170-178. - http://dx.doi.org/10.10= 21/ci034207y
[2] A. Bender, H.Y. Mussa, R.C. Glen and S. Reiling. Similarity searchin= g of chemical databases using atom environment descriptors: evaluation of p= erformance. Journal of Chemical Information and Computer Sciences, 2004, 44= , 1708-1718. - http://dx.doi.org/10.1021/ci0498719= p>
[3] R.C. Glen, A. Bender, C.H. Arnby, L. Carlsson, S. Boyer and J. Smith= . Circular fingerprints: Flexible molecular descriptors with applications f= rom physical chemistry to ADME. IDrugs 2006, 9, 199-204. - http://www.biomedcentral.com/content/pdf/cd-653859.pd= f
[4] Nidhi, M. Glick, J. W. Davies and J. L. Jenkins. Prediction of Biolo= gical Targets for Compounds Using Multiple-Category Bayesian Models Trained= on Chemogenomics Databases. J. Chem. Inf. Model., 2006, 46, 1124=E2=80=931= 133. - http://pubs.acs.org/doi/abs/10.1021/ci0= 60003g
[5] Rogers and Hahn. Extended-Connectivity Fingerprints. J Chem Inf Mode= l. 2010 May 24;50(5):742-54.