Nicole Powell

The Application of Machine Learning in Analyzing Organic Compounds from NMR Spectral Data

April 10, 2021   /  

Name: Nicole Powell
Major: Computer Science
Minor: Math and Chemistry
Advisors: Dr. Sofia Visa and Dr. Heather Guarnera

Nuclear magnetic resonance (NMR) is used in organic chemistry to identify unknown organic compounds. The data obtained from an NMR spectrometer are typically shown in the form of a spectrum, which is then analyzed by an analytical chemist. The action of analyzing a spectrum, especially one of a large and complex molecule, is a long and tedious process. In this project, Python is used to implement hierarchical clustering on NMR data obtained from an NMR spectrometer at the College of Wooster to explore its application in NMR analysis. MATLAB is used to build a decision tree from the same data, whose accuracy is compared to that of the hierarchical clustering. The decision tree is also examined to gain information about how to better automate the analysis process. These data clustering and classification processes are used to identify major functional groups within the compound from the spectral data, once feature extraction has been performed. Once these functional groups are identified, the compounds are clustered via hierarchical clustering, or classified with a decision tree. These processes provide insight into how to identify unknown organic molecules in a faster and more accurate manner, a much-needed improvement in organic chemistry experimental research. It was found that decision trees are a much more accurate machine learning method to classify the organic compounds, when doing so based on present functional groups.

Nicole will be online to field comments on April 16:
10am-noon EDT (Asia: late evening, PST: 6-8am, Africa/Europe: early evening)

56 thoughts on “The Application of Machine Learning in Analyzing Organic Compounds from NMR Spectral Data”

  1. Nicole- This looks like a cool project. Thank you for sharing it here. Congratulations on a job well done!

  2. Computationally predicting the structure of a molecule from NMR data would be a huge benefit. Software like ChemDraw does the reverse, i.e. predicting the spectrum from a given chemical structure. How similar is spectral prediction to your project? Thanks for a great presentation.

    1. That’s a great question, and spectral prediction was actually one of the inspirations for this project. If we could get from molecule to spectrum, why not the other way around?
      I think the main difference is given a molecule, it is known in what area the peak will be, as well as the splitting pattern (with HNMR). But when given a spectrum, there are many dependencies that have to be considered for each peak. For example, two singlets could be mistaken for a doublet or vice versa. Lab practices also have to be taken into account, like the solvent used and the possibility of impurities.

  3. Congratulations, Nicole. It’s great to see how you combined your major and minors in this project, without needing a chemistry co-advisor.

    1. Thank you! It was fulfilling to combine a lot of what I’ve learned these past four years in one project. And while I didn’t have a chemistry co-advisor, I have prof. Arnholt to thank for providing me the NMR samples and answering some of my questions about the instrument we have at the college

  4. Congratulations, Nicole, on your interdisciplinary I.S. work. I enjoyed our weekly meetings and learning some Chemistry along the way; now I know that the benzene ring is important in classifying organic compounds.

    1. Thank you Dr. Visa! I loved working with you, and will miss our weekly meetings. I’m glad you learned a bit of chemistry through my project!

  5. Congratulations, Nicole! Excellent work on applying machine learning to chemistry. What was the most surprising result you had/found?

    1. Thank you! While I was learning about the machine learning methods I used in my project, the most surprising thing I learned was how simple some of the commands are, knowing how complex the math behind them is. I was also surprised at how accurate and concise I was able to get the decision tree. Results-wise, I was surprised that the presence of a peak in the aldehyde section of the HNMR spectrum could classify one of the classes so well (attribute x2 in the decision tree).
      Thank you for taking the time to read my entire thesis and ask such good questions!

  6. Fascinating project Nicole! Since it sounds like some of this analysis is usually done via manual adjustment your work here has important implications. While the rate of picking the exact compounds hovered around 50% even having the pool of potential compounds narrowed down seems like it would be very useful given how many compounds exist and the rate of new compound discovery. I think you are on to a great idea here!

    Great job!

    1. Thank you so much! I would be interested to see how that percentage changes (or doesn’t change) if the number of compounds that were available was increased. Thanks for taking the time to watch my presentation!

  7. What a fascinating project! Am I understanding correctly that the comparison of spectra was performed strictly on the number and position (chemical shift) of peaks? Or is there a way to take integration and splitting patterns into account as well?

    [Side note: As a teacher of organic chemistry, I’m glad to hear that I won’t be replaced by a robot…yet]

    1. Thank you! The comparison was based on the position and intensity of the peaks. I did not calculate the integration exactly, but there was a rough estimate, giving the peaks with higher integration a larger weight. I could not think of a good way to take splitting pattern into account, but I would be very interested to see how doing so would affect the results. Hopefully much more accurate predictions could be made!

      Just think of how much more content you could cover if it was not necessary to spend so much time teaching NMR analysis!

      In all seriousness, thank you for taking the time to watch my presentation, and for teaching me the basics of organic chemistry which gave me the idea for this project.

  8. Nicole, this is really fascinating, and really well presented. I’m very proud of you. Congratulations!

  9. Congratulations on an awesome project! What was your favorite part of the process?

  10. Interesting project Nicole! It is great to see you combine your academic interests. Has this project influenced your long-term goals?

    1. Thank you! I didn’t have specific long-term goals beforehand, but this definitely opened my eyes to more interdisciplinary and chemistry-related computer science job opportunities

  11. Well done Nic! Congrats. I’m proud of you.
    Is this topic something you are hoping to continue to study?

    1. Thank you! I’m not sure about this topic specifically, but I’d like to stick with a combination of chemistry and computer science, or a similar combination

  12. Congrats Nic, you’ve done some amazing work here! You did a great job explaining this project to someone who neither understands computer science or chemistry.

  13. Nicole this is so cool! Congrats on all your hard work. You mention that there is not necessarily an objectively correct way to group all molecules. How difficult would it be to pivot your program’s process to adapt to new group definitions?

    1. Thank you! Good question. The program itself would actually not need much of a change, but the analysis afterward would have to do redone, which takes longer than I hoped. But It would be interesting to see which group definitions are most similar to the resulting groups from hierarchical clustering

  14. Hi Nicole! Great job! This presentation was very easy to follow even thought the subject matter is quite difficult (NMR was not my strong suit in organic chemistry). You used two different forms of machine learning that could be broken down into supervised and unsupervised learning. Are there certain benefits to using both, or are there scenarios where one would be better than the other?

    1. Thanks, Julia! Good question. There are definitely certain situations where one is more suitable than the other (and situations where you would use reinforcement learning, a third form of machine learning). In this project I had a hard time deciding whether supervised or unsupervised learning would be better, which is part of why I tried both.
      If I were to try to fully automate the process, going from spectrum all the way to compound name like I mentioned in the future work section, I would use supervised learning, where the compound names are the class labels. But this would require a gigantic dataset with multiple instances of each compound, which I’m not sure exists.

  15. Awesome job Nic, really cool work and a great presentation. It made me want to learn more 🙂
    Can you talk a little bit more about the notion of distance you used for your hierarchical clustering algorithm? How did you calculate whether one compound is “close to” another?

    1. Thanks Jas!
      Good question. The measurement used for the distance between compounds specifically was Euclidian distance.
      For the distance between clusters for hierarchical clustering, different linkage types were used, which each define distance between clusters differently. For example, single linkage defines distance between two clusters as the distance between the two data points closest together, where one data point is in one cluster and the other data point is in the other (using Euclidian distance for the distance between data points).

  16. Great job Nicole! The presentation was really engaging and your results were super interesting. Here is a question: when talking about the decision tree you mentioned that it likely overfit the data. What does the mean exactly and why is this a bad thing?

    1. Thanks Erica! Good question, I did not explain that in the presentation.
      When a decision tree overfits a set of data, it means that it classifies that dataset very well, at the expense of classifying other data less well. If you were to imagine that one of the spectra classified had an extra peak because of an impurity, the decision tree might make a rule that says “if this peak exists, it is [some compound, or in some group]”, since the tree does not know this peak is from an impurity. This might be the only way this specific compound can be classified, but then very few other compounds would be classified the same way, even if the compound itself is more similar.

  17. Very impressive! It seems that this work as it continues could really lead to important practical applications. Congratulations on a huge achievement!

  18. Hi Nicole! Awesome job on your interdisciplinary project! It definitely would have helped to have a program like this when taking organic chemistry. One question, what exactly is discretization? You mentioned it a few times and I was a little bit confused. How does this impact your project?

    1. Thanks, Lydia! Yes, the many many NMR analysis problems we worked on in O-chem class are what gave me the idea for this project.
      That’s a good question. Discretization is the process of converting a continuous dataset into a discrete one. So in this project, I took the continuous NMR spectra and chose seven sections (for HNMR, 14 for CNMR) to be the discrete x-values. For each of the sections, I calculated the sum of the y-values in the spectrum to get a single y-value for the discretized form. That way, each “spectrum” became a set of 7 (or 14 for CMR) x,y data points.
      Discretization makes the data much easier to work with, and reduces the computation pawer and time required to perform the machine learning methods.

  19. This was such an interesting presentation to watch! I found it fascinating how you took the tedium of performing NMR analyses and merged it with a computational process to make it easier for chemists. I was curious about the ML methods you used. Did you consider other machine learning methods for this study? What did these two methods present to the study that made you chose them above others?

    1. Thank you!
      Part of this project was dedicated to my learning about machine learning methods, since I had very little experience beforehand. A few methods were suggested to me by my advisor, and I chose these two because I could understand them, could apply them to the data I had access to in the time I had to work on my IS, and thought they could provide some valuable information. Looking back, I believe the decision tree provided more useful results, and if given more time (and maybe more data as well) I would focus more on that method than on hierarchical clustering. But I would be interested to explore other machine learning methods as well.

  20. Congratulations and great job! I know nothing about chemistry, or how computers work. However, I am intrigued by true crime. While maybe not completely comparable, I can see how this can advance forensics and further enhance the solving of mysteries.

    1. Thank you! Yes, automating compound identification could have a great impact on crime solving!

  21. Hello Nicole, Excellent work and presentation. As an *old* Wooster science major (physics ’73), I am so impressed with the resources now available for the sciences at the college, and you have used them to apply the scientific method in a very interesting way that combines your two disciplines. Well done – best of luck!

  22. Great job, Nic! Though I am but a lowly English major, I know the impacts of your study are far-reaching and important.
    I hope you didn’t hate the process quite as much as you seemed to. 😉
    I knew you could do it <3

    1. Thank you!! The process wasn’t too bad, just a lot of work. I’m glad to be done and see the results of all my work!

  23. Congratulations Nic!
    It’s interesting to see how you used all your areas of study for this research. As one of the comments stated above, it would be fascinating to see your findings in relation to forensics. Best of luck in all of your future endeavors!

  24. Great job, Nicole! Brahm and I watched it and were very interested in your processing of the data. The only suggestion for future presentations would be to use larger typeface for your slides for older people. Not me, of course, but others! Congrats! Proud of you!

    1. Thanks Uncle Tom, that means a lot! I’m glad you and Brahm found it interesting. I’m sorry some of the slides were hard to read! I’ll keep that in mind for next time

  25. So very proud of you and I know that you will continue to excel in life. Congratulations on this accomplishment and I can’t wait to see what the future holds for you.

  26. Congrats on an awesome project! This program would’ve been really helpful in organic 😉 You did it, congrats on finIShing!!

Comments are closed.