Mutagenicity (Ames test)

Dataset pr= ofile

The Ames mutagenicity data set was published in [Sushko et al. Appl= icability Domains for Classification Problems: Benchmarking of Distance to = Models for Ames Mutagenicity Set] [Hansen at al. Benchmark data set for in = silico prediction of Ames mutagenicity. J. Chem. Inf. Model., = 2010, = 50 (12), pp 2094=E2=80=932111].

The The Ames test relies on the determination of the mutagenic effect of= a given compound on histidine-dependent strains of Salmonella typhimurium.= Thus, the measurable mutagenic ability of a compound may signal its potent= ial carcinogenicity. The Ames test can be used with different bacteria stra= ins and can be performed with or without metabolic activation using liver c= ells. For this study, all such diverse data were pooled together. According= to that approach, a molecule can be considered as active if it demonstrate= s mutagenic activity for at least one strain.

Thus, considering that the benchmark set molecules were tested with diff= erent strains, there may be a significant variance in results. Moreover, di= fferent authors used different thresholds to decide whether a given molecul= e is active or not. As shown in the Results and Discussion section, we esti= mated the intra- and interlaboratory accuracies of measurements in the Ames= mutagenicity data set to be 94% and 90%, respectively. The initial data se= t was randomly divided into training and external test sets. The training s= et contained 4361 compounds, including 2344 (54%) mutagens and 2017 (46%) n= onmutagens. The external test set contained 2181 compounds (1/3 of initial = set) including 1172 (54%) mutagens and 1009 (46%) nonmutagens. These data s= ets were used for the 2009 Ames mutagenicity challenge, where the external = test set was given to the participants for =E2=80=9Cblind predictions=E2=80= =9D.

Data preprocess= ing

All chemical 3D structures were cleaned using OCHEM cleaning protocol. T= he standardization was performed in OCHEM.
All salt counter ions were = removed and resulting ions were neutralized.

Descriptors<= /h2>
This model was built using EState descriptors (electrotopological EState= indices) according to OCHEM implementation.

Validation
The model was built using 5-fold cross validation together with an exter= nal validation set.

Statistical= parameters

Prediction accuracy
The basic prediction accuracy parameters according to the 5-fold cross-v= alidation procedure are:

Data Set

# Accuracy Balanced accuracy MCC AUC

Training set 4359 records 77.7% ± 0.6 77.5% ± 0.6 0.55 ± 0.01 0.854 ± 0.01

Test set 2181 records 79.6% ± 0.8 79.5% ± 0.9 0.59 ± 0.02 0.875 ± 0.01

Data Set
#	Accuracy	Balanced accuracy	MCC	AUC
Training set	4359 records	77.7% ± 0.6	77.5% ± 0.6	0.55 ± 0.01	0.854 ± 0.01
Test set	2181 records	79.6% ± 0.8	79.5% ± 0.9	0.59 ± 0.02	0.875 ± 0.01

Applicability= domain

The prediction accuracy is estimated using PROB-STD distance to model an= d sliding window based accuracy averaging. The detailed technical descripti= on of these methodology can be found in a thesis work [Sushko. Applicability Domain of QSAR models. Doctoral thesis. 2= 011. Technical University of Munich.]. The thesis can be downloaded = at http://nbn= -resolving.de/urn/resolver.pl?urn:nbn:de:bvb:91-diss-20110301-1004002-1-2