Message-ID: <1988262942.145.1632427497492.JavaMail.bigchem@cpu> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_144_1055542682.1632427497491" ------=_Part_144_1055542682.1632427497491 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
An experimental measurement can be marked as an =E2=80=9Cerror=E2=80=9D.= Such records are highlighted with a red background and indicate a possible= problem. The system allows users to manually mark a record as an error if = they believe there is a mistake. In this case, the user should provide an e= xplanation of the problem in the comment or discussion field related to thi= s record. The OCHEM system can also automatically mark records as erroneous= if they do not comply with the system rules. Namely, a record is automatic= ally marked as an error if:
Another quality indicator is the =E2=80=9Cto be verified=E2=80=9D flag. = This flag signals that the record has been introduced from a referencing ar= ticle, e.g., benchmarking/methodological article and should be verified aga= inst the original publication. This flag can be set either manually or auto= matically by the system (e.g., in case of batch data upload, see the =E2=80= =9CBatch upload=E2=80=9D section fo= r details).
To ensure data consistency, it is essential to avoid redundancy in the d= atabase. Thus, there is a need for strict rules for the definition of dupli= cates. In OCHEM two experimental records of a physicochemical or biological= property are considered to be duplicates if they are obtained for the same= compound under the same conditions, had the same measured value (with a pr= ecision up to 3 significant digits) and are published in the same article. = We refer to these records as strong duplicates, as opposed to = weak duplicates, for which only part of the information is the same. T= he OCHEM database does not forbid strong duplicates completely, but forces = all the duplicates (except for the record introduced first) to be explicitl= y marked as errors. This ensures that there are no strong duplicates among = the valid (i.e., non-error) records.
The uniqueness of chemical compounds is controlled by special molecular = hashes, referred to as InChI-Keys [19= a>]. Namely, for the determination of duplicated experimental measurements,= two chemical structures are considered the same if they have identical Inc= hi-keys.
OCHEM allows weak duplicates (for example, completely identical experime= ntal values, published in different articles) and provides facilities to fi= nd them. Moreover, in the modeling process, it is always automatically ensu= red that the same compounds in the training set appear only in onefold of t= he N-fold cross-validation process.
Each record has a colored dot indicating the origin of the data. Green d= ots indicate =E2=80=9Coriginal records=E2=80=9D from publications with a de= scription of experimental protocols; these are usually the publications whe= re the property was originally measured (original data). The users can veri= fy experimental conditions and experiments by reading these articles. These= are the most reliable records in the database. The weak duplicates of = original records have magenta dots. The other records have red dots an= d originate from articles that re-use the original data but for which the o= riginal records are not stored. These are frequently methodological QSAR/QS= PR studies. The original records can be easily filtered out by checking a c= orresponding box in the compound property browser. Another filter,= =E2=80=9Cprimary records=E2=80=9D, eliminates all weak duplicates except t= he record with the most early publication date.