Chemistry Blog

«

»

Dec 19

On making a chemistry dictionary (Or: Why Excel and Macros are awesome)

by azmanam | Categories: Uncategorized | (38446 Views)

I thought I’d take you on a little journey though the process of making a chemistry dictionary from scratch.  It’s a pretty good story, I think.

It started with searching online for chemistry glossaries.  I figured that would be a good repository of chemistry-related words.  And, as a bonus, their definitions could give me even more words to add to my list.  I copied the six or seven glossaries I liked into one long Word document.  It was a messy document.  Random lists of words.  Some with definitions, some without.  Mainly words that are recognized by the standard dictionary, but many words that were prime candidates for my list.

(rest of the story below the jump. It gets a bit technical in the middle… but keep with it for the cool pictures (click the images for larger versions) and a Christmas idea for your favorite ACS copy editor!)

It was relatively easy to take this long document and convert it into a list of words that Word considered ‘misspelled.’  I scrolled through the document looking for red underlined words.  I selected and deleted all the words in between the ‘misspelled’ words, so all I had at the end was a list of ‘misspelled’ words.

From here, it gets a bit tricky.  Many of the glossaries contained duplicate words.  Enter Excel.  I piped the list over to Excel and executed two of Excel’s built-in functions – one that I already knew about and one that was new to me.  I sorted the list alphabetically, which is something I’ve done numerous times in other circumstances.  Then I clicked on ‘remove duplicates.’  I’ve never seen this feature before, so I don’t know if it’s new to the Office 2007 suite or not… but it’s awesome.  One clicked removed all the duplicate entries in my list.

After that, I just had to comb the list for obvious actual misspellings and version 1.0 was done.  Lots of chemistry words in lots of parts of speech.  It took about a month to make a list of about 8,000 words.

But there was one gaping hole I knew I wasn’t filling.  Iterations.  Most glossaries aren’t going to include iterations of compounds.  I’m an organic chemist, so the first thing I thought of was that I needed a list of iterative substituent changes to common chemical backbones.  I needed a list that included methylbenzene, ethylbenzene, dimethylbenzene, fluorobenzene, etc.  I needed a list that included a number of substituents on a number of backbones.  I envisioned an array of chemicals created from two lists:  imagine substituents across the top of a grid and backbones along the side of a grid.  I needed to merge the grid such that each substituent was appended to each backbone… but I didn’t want to do it by hand.

Now, I consider myself fairly adept at navigating Excel and getting it to do what I want it to do.  But I’m nothing compared to my dad.  So I asked him if he knew how to fill my array quickly and easily.  Of course he did.  I needed the concatenate function.  The concatenate function allows you to pull text from any number of cells and join them together.  Exactly what I needed (thanks, Dad).  I wrote the formula for the top left cell, and it concatenated the first iteration perfectly.  All I had to do was copy and paste the formula to the rest of the grid.  10,000 more chemicals were systematically created in less than a second.  Awesome.

Now I just needed to convert the table to a list…  I wrote a few lines of code in a Macro to help me out with this one.  I knew how many columns I had, so I opened a for…next loop and had it copy the contents of the top left cell to the bottom of the growing vertical list, then move to the cell to the right and repeat.  I knew how many rows I had, so I nested the first for…next loop inside a second for…next loop.  When the inner loop had finished copying the contents of the first row to the growing list, the outer loop just moved the focus to the next row and the inner loop started all over again.  In a single afternoon, I grew the dictionary from 8,000 words to 18,000 words.

At this point the dictionary was released on Sciencebase.  A commenter suggested I protect it with a Creative Commons license, and I did.  It’s protected under the Attribution license.

Shortly after the release of the first version of the dictionary, David Bradley at Sciencebase introduced me to Antony Williams of ChemSpider.  Completely coincidentally, we happen to live in the same area and were able to meet one afternoon.  He agreed to help the effort and release a list of the 1.3 million identifiers in the ChemSpider database (Thanks, Tony).

Integrating the list meant sifting through 1.3 million entries that all read something like “(4aS,8aS)-2-[3-((R)-5-benzyl-3-cyclohex-2-enyl-2-oxo-pyrrolidin-1-yl)-2-hydroxypropyl]-decahydro-isoquinoline-3-carboxylic acid tert-butylamide.”  First order of business was to remove all extraneous characters and break the individual entries into several fragment strings.  The above would need the numbers, dashes, brackets, and parentheses removed and would be transformed into thirteen unique strings: benzyl, cyclohex, enyl, oxo, etc.  I needed to do this 1.3 million times… then remove all the duplicates.

This is probably my favorite part of the story.  I wrote a macro in Word to do most of the work for me.  It consisted of a series of Find…Replace operations.  The first would find all hyphens and replace them with a Paragraph Mark.  That is, it would stop the entry at the point of the hyphen and continue the entry on the next line with the hyphen removed.  I used several Find…Replace operations to find all the hyphens, parentheses, numbers, etc. and replace them with Paragraph Marks.

That did two things, one good, one bad.  It removed all the extraneous characters, but now I had all these words with lots of empty lines between them.  Imagine: for an entry like 3-methyl-2-cyclohexene-1-ol, the macro would find the hyphens and continue the entry on the next line.  It would find the numbers and continue the entry on the next line, too.  So between methyl and cyclohexene there were three empty lines.  I ended the macro with a final series of Find…Replace operations.  Every time it found multiple Paragraph Mark in a row (that is, whenever there was more than one line between entries) it would replace it with a single Paragraph Mark.  This truncated the list and removed all the extra lines between words.

When Word was finished extracting just the useful identifiers from a small piece, I piped the list over to Excel.  I used the same two Sort and Remove Duplicate functions and my list would slowly turn from 1.3 million long and complex entries into a smaller list of single entries.

Thanks to this amazing automation, I sifted the 1.3 million identifiers in five days.  I couldn’t do all 1.3 million at once, I had to do it in pieces.  But 1.3 million entries like “(4aS,8aS)-2-[3-((R)-5-benzyl-3-cyclohex-2-enyl-2-oxo-pyrrolidin-1-yl)-2-hydroxypropyl]-decahydro-isoquinoline-3-carboxylic acid tert-butylamide” were turned into a list of more than 100,000 unique fragments, after the duplicates were removed.

One other cool tech note that deserves mention.  I had this list of >100,000 entries.  Littered throughout this list were words that were already in the first dictionary or the built-in dictionary.  I wanted to sort those out to save some work in the next section.  But it wasn’t going to be as trivial as it was when I removed correctly-spelled words from the glossary list.  I wanted a macro to go cell by cell and determine if the word is correctly spelled or not.  If it was, I wanted it to return a “true” value in the cell next to it, and a “false” if it wasn’t in a dictionary already.  I didn’t know how to do that, and unfortunately neither did my dad.  But we brainstormed for a while and found a macro someone else had written to do exactly what I needed done.  I applied it to my list then sorted the second column alphabetically.  This clumped all the words already in a dictionary together, and I was able to delete them as one unit.  **End of TechTalk.  (I promise)**

Next, I needed to individually read through the remaining >100,000 entries to remove obvious mistakes, foreign words, nonsense entries, etc.  That was the hard part.  Wow, it was tedious.  No macro for this one…  On a good day, I could get through about 5,000 entries.  Most days were not good days.  It took the better part of 10 months to finally look at each and every one of the more than 100,000 entries and decide if it should stay or go.  In the end, around 20,000 were manually deleted from that list.

To finish up, I needed to combine the old and new lists.  I also added a small, non-exhaustive list of American and European chemists’ names to the mix for fun.  To test my dictionary, I copied the html versions of the recent JACS, JOC, and Org Lett ASAPs into Word and turned the dictionary on.  Any words that were still showing up as misspelled, I reviewed and added to the dictionary. (Download the dictionary here)

One of the biggest things I noticed during this vetting was the use of plurals in scientific writing.  Cyclopentenone is an actual compound and is in the dictionary.  If your research requires you to make a family of cyclopentenones, then the plural was probably not in the dictionary (it is now, though).  Although, that raises an interesting question: can you pluralize compounds like that?  Or is it more correct to say that a library of cyclopentenone derivatives was made?  Same thing with families of natural products.  Are they members of the brevetoxins?  Or are they more correctly members of the brevetoxin family of natural products?  I’m not sure I know the answer to that one.  One thing I do know is that I did not include plurals of elements.  Your 13C NMR doesn’t tell you that you have 2 carbonyl carbons.  It tells you that you have 2 carbonyl carbon atoms.

For fun, I thought I’d leave you with some simple analysis I did on the final list of words (again, with the help of awesome Excel formula operations).  I’ve also included a before and after picture to show the utility of the dictionary.

•    After being combined with the previous list (as well as a list of common reagents and chemists), the list is 104,502 entries long.
•    The longest entry is 56 letters long (acetyltyrosylseryltyrosylphenylalanylprolylserylvalinate)
•    The longest non-protein entry is 50 letters long (methylsulfinylphenyloxooxazolidinylmethylacetamide)
•    14,109 entries start with the letter ‘d’.  Second place is ‘p’ with 9,971.  Last place is ‘j’ with 73.
•    My favorite entry is tetraazaadamantane because it contains the string ‘aazaa’, which is not a mistake or misspelling.
•    The final list is 430 pages in Microsoft Word, one entry per line.

For the Before and After pictures, I copied the html versions of the ASAPs from an ACS journal into Word and took a screencap with and without the dictionary active. Notice how the After picture makes it clear that the technical were spelled correctly, but processes was NOT! What’s even more startling is that the typo makes it all the way to the published PDF version of the ASAP. Anyone at ACS reading this? This would make a great last-minute Christmas present for your favorite ACS copy editor. I noticed these kinds of actual typos happening an uncomfortable number of times.

BEFORE:

AFTER

PDF

8 comments

2 pings

Skip to comment form

  1. Macman104

    So many kinds of cool, thanks!

  2. ChemE

    Don’t stop the tech talk, that’s a pretty awesome use of macros. I found it pretty amusing how one of your shortest paragraphs about manually spell checking words corresponded to the longest time period.

    1. mitch

      Only if he compiled it from LOLCODE.

      1. ChemE

        Perhaps I can haz say, macros in Excel.

  3. baoilleach

    This is going to be really useful. Of course, we add “Add to Dictionary” various words over time, but when you switch to a new computer it’s back to square one. I’m impressed to see how far you got with Excel and Macros, but you might find it easier if you learn a scripting language like Python.

    1. azmanam

      Of course, we add “Add to Dictionary” various words over time, but when you switch to a new computer it’s back to square one.

      It doesn’t have to be. See my comment in the other post (here). If you remember before you switch computers (or if it’s backed up before a crash…) you should be able to move your added words to a new machine.

      And there’s no doubt I could have done some of these operations faster. There are a couple of languages I wish I knew better. All in all, though, I’m pretty happy with the way things turned out.

    2. mitch

      I’ve written Excel macros to run several instruments in my lab. I have them collect my data, and even ftp it to a server for safe keeping. They analyze my data and draw pretty pictures for me. I have it communicate to devices by printer ports and even through the USB. And the language and syntax is so easy! I’ve never met a language that can do so much for me. I actually feel sorry for people that use C or python to perform similar tasks.

      1. azmanam

        I also like the ‘record macro’ button in Excel.

        I just do what I want it to do with my mouse… and it codes it for me! no errors or infinite loops or any other silly mistakes! I don’t even have to know the VBA language for what I need to code.

  1. ChemSpider Blog » Blog Archive » A Chemical Dictionary from Adam Azman with Help from ChemSpider

    [...] has explained in detail how he did the work. I encourage you to read his post to fully understand the nature of the work and how much [...]

  2. Chemistry Blog » Blog Archive » Can chemical compounds be pluralized?

    [...] wanted to extract a paragraph from an earlier post and open it up for further discussion.  When I was creating the chemistry dictionary file, one of [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by sweet Captcha