RDW - WIKI:WikiProject Chemicals/Chembox validation

From Radio Amteur Station OH5BZR - WIKI
< RDW - WIKI:WikiProject Chemicals
Revision as of 00:11, 10 November 2019 by https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Chemicals/Data>Cherkash (Problems found when validating the Excel file)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Template:Chemical data validation

WikiProject Chemicals and WikiProject Pharmacology are validating the content in the infoboxes Template:Tl and Template:Tl. Values in the infobox are compared with values reported in literature, and when the values match, the current revision is stored in the index for chembox and the index for drugbox, respectively. This is typically done for values that are 'immutable' (e.g., the boiling point of a chemical compound: the boiling point of water under standard conditions is 99.98°C, and there is no plausible reason to suspect it will change).

At the moment, we are verifying the CAS Registry number ('CASNo' in the Template:Tl, 'CAS_number' in the Template:Tl), ChemSpiderID (ChemSpiderID), Unique Ingredient Identifier (UNII), InChI, KEGG, and ChEMBL by comparison with the data on http://commonchemistry.org (the CAS website), http://www.chemspider.com and http://fdasis.nlm.nih.gov/srs/srs.jsp (for the UNII) as well as from lists supplied by (CAS number, ChemSpiderID, InChI, UNII, ChEMBL and ChEBI) or downloaded from these websites (KEGG, DrugBank). In the meantime, we are trying to add, update and/or check as a number of other identifiers (InChI, InChIKey) by comparison of the data with the ChemSpider website http://www.chemspider.com.

CheMoBot is following changes to these articles, and is set up to update the infoboxes. When it detects changes to values, it will change parameters in the infobox accordingly. These parameters are used by the template to show what the status of the fields are in the box.

Boxes that contain verified values that are the same as the values in the verified revision are tagged with Template:Tick at the bottom, and boxes where some of these values are changed are tagged with Template:Cross. Moreover, the individual identifiers are tagged with Template:Tick or Template:Cross, as well. If the boxes contain changes to these verified fields, they are also categorized in Category:Chemboxes which contain changes to verified fields. Boxes that contain changes to other important fields are categorized in Category:Chemboxes which contain changes to watched fields. For an example, see this vandalism, quickly flagged by CheMoBot.

If you encounter a page with a Template:Tl or Template:Tl that shows a Template:Cross, then please check if the current value is wrong (in which case, it can just be changed back to the value in the verified revision; the bot will do the rest), or if there is a mistake in the verified revision (if so, it may need an update of the index; if you need help with that, please ask the appropriate wikiproject).

Verification – tagging references

CheMoBot adds a template to a _Ref parameter (e.g. for CASNo, CASNo_Ref will be filled with Template:Tlx) when the bot finds the field correct. The first parameter of the template is 'correct', or 'changed', and the box will show a tick or a cross accordingly on CASNo. The second parameter is a field that contains a reference for 'where' the parameter was verified. As we are at the moment verifying all fields against the CAS commonchemistry.org site, the bot replaces XXX with 'CAS' (i.e., Template:Tlx). When using another place to verify the CASNo, please adapt this parameter accordingly and will try to retain this field throughout. When there will be significantly more verifications against non-commonchemistry.org-places, I will instruct the bot to fill the field standard with Template:Tlx or something similar.

Method of work

Our approach is to start by checking that the CAS registry number and the structure match with the name. This will be used as a foundation upon which we can build a broader validation effort. Once we have the structure verified, we have the formula, and hence the molar mass, and we can also generate other machine representations such as SMILES, InChI and InChIKey.

First 1000

After our IRC meeting on January 13, 2009, we used an Excel file to validate the first 1000 entries from the CAS XML file. This is available to project members here, on the password-protected site. Meanwhile, User:Physchim62 validated the inorganics separately, and these can be found in the CAVer file.

The work

We are now beginning to work through the list of "problem articles" found by User:Beetstra, and listed at User:Beetstra/CASFoundCorrect. A description of the process will be added soon.

Notes

  • Different CAS numbers are used for each form of a substance. For example, something simple like alanine will have one CAS# for the D form, another for L, another for "unspecified" and a fourth one for racemic. There would be another four CAS#s for the hydrochloride, four for the (1:1) sulfate, four for the (2:1)sulfate, etc. It is very important that we match the correct form CAS# to our Chemboxes!
  • Be aware that CAS uses an unusual system for representing some formulae, which may seem "wrong" to us. These involve describing salts such as sodium nitrate as HNO3·Na, and organic salts follow a similar system. Do not use such formulae on WP, but they are not "wrong" since they are merely a representation, not a formal structure. This also results in incorrect MolarMass in the FW section of the SDF file for salts.
  • For complex chiral structures, such as bleomycin, which may be drawn very differently in WP than in Common Chemistry, I found it best to assign R/S for each center and compare that way. (And yes, Farseer drew bleomycin perfectly!)
  • The CAS No. in a Chembox will receive a green tick (check mark) once Template:Tl is added. This does not happen yet in the Drugbox (there is no change at present), but we hope to enable a similar system there too, if WP:PHARM is in agreement.

Fields to check/upload

Chemboxes

Check structure, CAS no., Formula, MolarMass.

Notes:

  • 1. the bot 'divides' the fields in two sets, watched and unwatched; all changes are reported, but the watched fields are the ones we really want to take care of, those are the fields that contain hardcore, verifiable data that are very unlikely to change (as the boiling point of water, the CAS-number of benzene, the number of carbons in glucose. N.B. the list of 'watched' fields may need to be updated
  • 2. The bot regards an empty field as 'unknown'. It will report changes to this field, but will assign a lower 'warning level' to it.
  • 3. Things between <!-- and --> are 'comments', they can be saved and appear in the editbox, but do not produce visible wikicode.
  • When a 'better' version of a page comes up, change the number on the page. If there are two revids for the same page, it uses the one closest to the bottom of the index-page (the page gets parsed top to bottom, replacing values if duplicates occur).

The workers

Please sign up to work on some of the articles listed at User:Beetstra/CASFoundCorrect. More information later.

The software

Problems found when validating the Excel file

Please note any "to be checked" entries here.

1–100

101–200

201–300

  • Kanamycin One chiral center seems to not match CAS. Are there multiple forms of this? Structure says Kanamycin A.
    • Yes, there are multiple forms (A, B, C, D, X) and several derivatives, but the difference is in the side chains. Fvasconcellos (t·c) 11:48, 10 February 2009 (UTC)
  • Tocopherol One chiral center seems not to match, multiple forms? a-tocopherol, CAS just says tocopherol.
There are multiple isomers. File:RRR alpha-tocopherol.png shows the most common isomer. Tim Vickers (talk) 04:28, 10 September 2009 (UTC)
  • Acetylcholine Parent ion, infobox not chembox.
  • Linoleic_acid WP says cis, cis, CAS says trans trans 'linoelaidic acid', the whole world says linoleic acid is 60-33-3 including the spreadsheet and sigma.
    • 60-33-3 appears to refer to all-cis. Fvasconcellos (t·c) 11:52, 10 February 2009 (UTC)
      • This is very strange, it is trans,trans in the union file and cis,cis in the wikichem file (I have been using the union file to verify CAS numbers). I need to look into this. Ambix (talk) 12:47, 12 February 2009 (UTC)
  • Glucose 1-phosphate One chiral center is not specified (should be up to match CAS). (probably a result of copying glucose skeleton, in which this atom is not chiral?).
    • See anomer. It is likely that both forms (alpha and beta-glucopyranoside) are described by this CAS number. --Tweenk (talk) 21:41, 15 November 2009 (UTC)
  • Streptomycin 57-92-1 Seems to be mirror image of WP structure.
  • Tubocurarine 57-94-3 and 57-95-4 Structure is messed up in union file. I can't make sense of it.

301–400

401–500

  • Cholecalciferol: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be R here, and I think it is)
  • Vitamin B12: The structure diagram does not adequately specify the stereochemistry of the Corrin ring
  • Ellman's reagent: no chembox, and needs text cleanup
    • I added a chembox. -- Ed (Edgar181) 19:12, 11 February 2009 (UTC)
  • Sanger's reagent: no chembox
  • Asparagine: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be S here, and is)
  • Histidine: Structure needs to show stereochemistry
  • Medroxyprogesterone acetate: Redirects to Medroxyprogesterone
  • Veratridine: still to be verified, the structure displays badly in ChemFileBrowser
  • Sodium lactate: old-style chembox; note that CASRN is for unspecified stereochemistry
  • Valine: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be S here, and is)
  • Threonine: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,3R)
  • Endrin: The structure diagram appears to show the endo-isomer whereas the CASRN is for the exo-isomer (or vice versa, I never was very good at this particular bit of nomenclature! in any case, it's not the same compound!) We should recheck with Dieldrin (CASRN [60-57-1]) as well. Neither compound has the stereochemistry correctly specified.
    • I've rechecked Dieldrin, adding the implicit hydrogens to the WP structure and drawing in chemsketch, I also copied the CAS structure exactly and had the program assign stereo labels. They match, which leads me to think my initial verify is OK. It maybe should be noted that while the carbon skeletons look to be the same projection, WP is from above and CAS (turns out to be) from below. If you are still unhappy could you describe your assignment in more detail? I'll try the chemsketch method with Endrin and hopefully we can compare notes Ambix (talk) 23:27, 6 February 2009 (UTC)
    • I have checked Endrin with the same process and it does not match. There is an older version of this image Endrin.png and this does match. Given the difficulties of transposing a 3D structure to more conventional form it would probably be better to have a more conventional structure as well for compounds like this but I would suggest we avoid removing 3D structures providing it is possible to validate them. I will investigate further.
      • I suggest that for our validated structure on such compounds, we should explicitly show the stereochemistry of each chiral centre, which is not the case at present on Endrin and Dieldrin (even if a knowledgeable chemist can figure out what it must be from the diagram). That doesn't necessarily mean changing the structures in the chemboxes (our images for inorganics don't always give a clear idea of the structure), but we should insist on the chembox information being correct and not-misleading, and that the full details be available in the article (maybe in a separate image). Physchim62 (talk) 23:23, 9 February 2009 (UTC)
  • Dichlorodiphenyldichloroethylene: short-form chembox
  • Trypan blue: Structure diagram shows free acid whereas CASRN is for tetrasodium salt
  • Isoleucine: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,3S)
  • Ethambutol: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,2'S)
  • Arginine: Structure needs to show stereochemistry
  • Ethylene: old-style chembox
  • Missing articles: 3,5-Dimethylpyrazole, O-Methylhydroxylamine, Tetraethylammonium iodide, Ethyl 3-bromopyruvate, 1-Methyl-3-nitro-1-nitrosoguanidine, Mercaptosuccinic acid, p-Toluenesulfonamide, 4-Chlorobenzoic acid, N,N'-Diphenyl-1,4-phenylenediamine
  • Ions: Acetate, Bicarbonate

501–600

  • Trimethylaluminium is dimer, CAS is monomer. Is this significant, will CAS have a dimer listed?
  • Camphor Both the WP page and the CAS are for unspecified stereoisomers however if we follow the naturally occurring rule, should the WP page be changed for the natural isomer and the unspecified CAS be relegated to an 'other'?

601–700

701–800

801–900

901–1000

Inorganics

The 677 "inorganics" (neutral compounds without C–C or C–H bonds) have now all been checked. 496 entries gave a perfect match, 74 entries had some sort of problem in the article (often minor and already fixed) and 100 entries had no appropriate corresponding article on Wikipedia. A full report will be available in due course.

Elements and ions

These will require special treatment: please contact Physchim62 for more details.