I thought I’d take you on a little journey though the process of making a chemistry dictionary from scratch. It’s a pretty good story, I think.
It started with searching online for chemistry glossaries. I figured that would be a good repository of chemistry-related words. And, as a bonus, their definitions could give me even more words to add to my list. I copied the six or seven glossaries I liked into one long Word document. It was a messy document. Random lists of words. Some with definitions, some without. Mainly words that are recognized by the standard dictionary, but many words that were prime candidates for my list.
(rest of the story below the jump. It gets a bit technical in the middle… but keep with it for the cool pictures (click the images for larger versions) and a Christmas idea for your favorite ACS copy editor!)
It was relatively easy to take this long document and convert it into a list of words that Word considered ‘misspelled.’ I scrolled through the document looking for red underlined words. I selected and deleted all the words in between the ‘misspelled’ words, so all I had at the end was a list of ‘misspelled’ words.
From here, it gets a bit tricky. Many of the glossaries contained duplicate words. Enter Excel. I piped the list over to Excel and executed two of Excel’s built-in functions – one that I already knew about and one that was new to me. I sorted the list alphabetically, which is something I’ve done numerous times in other circumstances. Then I clicked on ‘remove duplicates.’ I’ve never seen this feature before, so I don’t know if it’s new to the Office 2007 suite or not… but it’s awesome. One clicked removed all the duplicate entries in my list.
After that, I just had to comb the list for obvious actual misspellings and version 1.0 was done. Lots of chemistry words in lots of parts of speech. It took about a month to make a list of about 8,000 words.
But there was one gaping hole I knew I wasn’t filling. Iterations. Most glossaries aren’t going to include iterations of compounds. I’m an organic chemist, so the first thing I thought of was that I needed a list of iterative substituent changes to common chemical backbones. I needed a list that included methylbenzene, ethylbenzene, dimethylbenzene, fluorobenzene, etc. I needed a list that included a number of substituents on a number of backbones. I envisioned an array of chemicals created from two lists: imagine substituents across the top of a grid and backbones along the side of a grid. I needed to merge the grid such that each substituent was appended to each backbone… but I didn’t want to do it by hand.
Now, I consider myself fairly adept at navigating Excel and getting it to do what I want it to do. But I’m nothing compared to my dad. So I asked him if he knew how to fill my array quickly and easily. Of course he did. I needed the concatenate function. The concatenate function allows you to pull text from any number of cells and join them together. Exactly what I needed (thanks, Dad). I wrote the formula for the top left cell, and it concatenated the first iteration perfectly. All I had to do was copy and paste the formula to the rest of the grid. 10,000 more chemicals were systematically created in less than a second. Awesome.
Now I just needed to convert the table to a list… I wrote a few lines of code in a Macro to help me out with this one. I knew how many columns I had, so I opened a for…next loop and had it copy the contents of the top left cell to the bottom of the growing vertical list, then move to the cell to the right and repeat. I knew how many rows I had, so I nested the first for…next loop inside a second for…next loop. When the inner loop had finished copying the contents of the first row to the growing list, the outer loop just moved the focus to the next row and the inner loop started all over again. In a single afternoon, I grew the dictionary from 8,000 words to 18,000 words.
Shortly after the release of the first version of the dictionary, David Bradley at Sciencebase introduced me to Antony Williams of ChemSpider. Completely coincidentally, we happen to live in the same area and were able to meet one afternoon. He agreed to help the effort and release a list of the 1.3 million identifiers in the ChemSpider database (Thanks, Tony).
Integrating the list meant sifting through 1.3 million entries that all read something like “(4aS,8aS)-2-[3-((R)-5-benzyl-3-cyclohex-2-enyl-2-oxo-pyrrolidin-1-yl)-2-hydroxypropyl]-decahydro-isoquinoline-3-carboxylic acid tert-butylamide.” First order of business was to remove all extraneous characters and break the individual entries into several fragment strings. The above would need the numbers, dashes, brackets, and parentheses removed and would be transformed into thirteen unique strings: benzyl, cyclohex, enyl, oxo, etc. I needed to do this 1.3 million times… then remove all the duplicates.
This is probably my favorite part of the story. I wrote a macro in Word to do most of the work for me. It consisted of a series of Find…Replace operations. The first would find all hyphens and replace them with a Paragraph Mark. That is, it would stop the entry at the point of the hyphen and continue the entry on the next line with the hyphen removed. I used several Find…Replace operations to find all the hyphens, parentheses, numbers, etc. and replace them with Paragraph Marks.
That did two things, one good, one bad. It removed all the extraneous characters, but now I had all these words with lots of empty lines between them. Imagine: for an entry like 3-methyl-2-cyclohexene-1-ol, the macro would find the hyphens and continue the entry on the next line. It would find the numbers and continue the entry on the next line, too. So between methyl and cyclohexene there were three empty lines. I ended the macro with a final series of Find…Replace operations. Every time it found multiple Paragraph Mark in a row (that is, whenever there was more than one line between entries) it would replace it with a single Paragraph Mark. This truncated the list and removed all the extra lines between words.
When Word was finished extracting just the useful identifiers from a small piece, I piped the list over to Excel. I used the same two Sort and Remove Duplicate functions and my list would slowly turn from 1.3 million long and complex entries into a smaller list of single entries.
Thanks to this amazing automation, I sifted the 1.3 million identifiers in five days. I couldn’t do all 1.3 million at once, I had to do it in pieces. But 1.3 million entries like “(4aS,8aS)-2-[3-((R)-5-benzyl-3-cyclohex-2-enyl-2-oxo-pyrrolidin-1-yl)-2-hydroxypropyl]-decahydro-isoquinoline-3-carboxylic acid tert-butylamide” were turned into a list of more than 100,000 unique fragments, after the duplicates were removed.
One other cool tech note that deserves mention. I had this list of >100,000 entries. Littered throughout this list were words that were already in the first dictionary or the built-in dictionary. I wanted to sort those out to save some work in the next section. But it wasn’t going to be as trivial as it was when I removed correctly-spelled words from the glossary list. I wanted a macro to go cell by cell and determine if the word is correctly spelled or not. If it was, I wanted it to return a “true” value in the cell next to it, and a “false” if it wasn’t in a dictionary already. I didn’t know how to do that, and unfortunately neither did my dad. But we brainstormed for a while and found a macro someone else had written to do exactly what I needed done. I applied it to my list then sorted the second column alphabetically. This clumped all the words already in a dictionary together, and I was able to delete them as one unit. **End of TechTalk. (I promise)**
Next, I needed to individually read through the remaining >100,000 entries to remove obvious mistakes, foreign words, nonsense entries, etc. That was the hard part. Wow, it was tedious. No macro for this one… On a good day, I could get through about 5,000 entries. Most days were not good days. It took the better part of 10 months to finally look at each and every one of the more than 100,000 entries and decide if it should stay or go. In the end, around 20,000 were manually deleted from that list.
To finish up, I needed to combine the old and new lists. I also added a small, non-exhaustive list of American and European chemists’ names to the mix for fun. To test my dictionary, I copied the html versions of the recent JACS, JOC, and Org Lett ASAPs into Word and turned the dictionary on. Any words that were still showing up as misspelled, I reviewed and added to the dictionary. (Download the dictionary here)
One of the biggest things I noticed during this vetting was the use of plurals in scientific writing. Cyclopentenone is an actual compound and is in the dictionary. If your research requires you to make a family of cyclopentenones, then the plural was probably not in the dictionary (it is now, though). Although, that raises an interesting question: can you pluralize compounds like that? Or is it more correct to say that a library of cyclopentenone derivatives was made? Same thing with families of natural products. Are they members of the brevetoxins? Or are they more correctly members of the brevetoxin family of natural products? I’m not sure I know the answer to that one. One thing I do know is that I did not include plurals of elements. Your 13C NMR doesn’t tell you that you have 2 carbonyl carbons. It tells you that you have 2 carbonyl carbon atoms.
For fun, I thought I’d leave you with some simple analysis I did on the final list of words (again, with the help of awesome Excel formula operations). I’ve also included a before and after picture to show the utility of the dictionary.
• After being combined with the previous list (as well as a list of common reagents and chemists), the list is 104,502 entries long.
• The longest entry is 56 letters long (acetyltyrosylseryltyrosylphenylalanylprolylserylvalinate)
• The longest non-protein entry is 50 letters long (methylsulfinylphenyloxooxazolidinylmethylacetamide)
• 14,109 entries start with the letter ‘d’. Second place is ‘p’ with 9,971. Last place is ‘j’ with 73.
• My favorite entry is tetraazaadamantane because it contains the string ‘aazaa’, which is not a mistake or misspelling.
• The final list is 430 pages in Microsoft Word, one entry per line.
For the Before and After pictures, I copied the html versions of the ASAPs from an ACS journal into Word and took a screencap with and without the dictionary active. Notice how the After picture makes it clear that the technical were spelled correctly, but processes was NOT! What’s even more startling is that the typo makes it all the way to the published PDF version of the ASAP. Anyone at ACS reading this? This would make a great last-minute Christmas present for your favorite ACS copy editor. I noticed these kinds of actual typos happening an uncomfortable number of times.