Large language models are helping scientists to converse with artificial intelligence and even to generate potential drug targets.
Much of the world has been transfixed in recent months by the appearance of text generation engines such as OpenAI’s ChatGPT, artificial intelligence (AI) algorithms capable of producing text that seems as if it were written by a human. While tech companies like Microsoft and Google are focused on using such engines as a way to improve search and others worry they could cause a rash of plagiarized essays, fake news and bad poetry, biotechs are looking at these algorithms to bolster their businesses, as a method to contribute to drug discovery in a variety of ways.
Biotechs that already rely on AI in their search for new drugs can turn to text generation as a simple, intuitive way to interact with some of their other AI and machine learning tools. Andrew Beam, an epidemiologist at the Harvard T.H. Chan School of Public Health and head of machine learning at Generate Biomedicines, calls ChatGPT “a really interesting interface” that allows users to work more easily with other forms of AI than their current interfaces.
For example, Insilico Medicine of New York and Hong Kong, a company set up to search for potential drug targets with its AI-driven platform, is now using ChatGPT as a new way to interact with their target discovery platform, augmenting the relationships and integration provided by knowledge graphs — previously the main method for integrating data. Petrina Kamya, a computational chemist who is head of AI platforms and president at Insilico Medicine in Montreal, says they can talk to their own discovery system thanks to ChatGPT: “Instead of clicking and clicking and clicking, you just ask a question and it composes this text that you read and you understand.”
Beyond embracing chatbots to help produce written materials, such as papers, patents or grant applications, others can repurpose them specifically for drug discovery, as a sort of advanced search engine specifically geared to biological science. “We can have a more specific, for example, Bio ChatGPT or Med ChatGPT,” says Lurong Pan, a computational chemist at the University of Alabama, Birmingham and founder and CEO of Ainnocence, a biotech with a platform to aid drug discovery. “It may change the way people are searching.” For instance, Google and DeepMind earlier this year released Med-PaLM, a chatbot designed to provide answers to medical questions.
All these chatbots are based on large language models (LLMs), algorithms trained on millions of examples of text collected from the internet. LLMs are one type of generative AI — algorithms capable of creating data that did not previously exist. For text, LLMs learn the statistical relationships between words. Then, given a prompt such as a question, they generate text by predicting which word is most likely to follow the previous word. The results seem remarkably natural, though the chatbots often make statements at odds with reality, essentially ‘hallucinating’ facts. ChatGPT is based on an LLM called Generative Pre-trained Transformer, Med-PaLM draws on Google’s Pathways Language Model, and Bard, a more generalized chatbot that Google is incorporating into its search engine, relies on Language Model for Dialogue Applications (LaMDA).
These LLMs are already proving useful for drug hunters, says Kamya. Previously, users of Insilico’s platform were able to look at a knowledge graph, a visual representation of the genes linked to a particular disease and the substances known to interact with those genes. That was useful information, but the way researchers worked with it was limited. Now, with the addition of a chat function, Kamya says the data have become much more accessible. “Being able to have a conversation with the tool is very empowering. It makes it more interesting and more fun if you’re able to query our biomedical knowledge graphs in the way you want to,” she says.
If a scientist wants to investigate psoriasis, for example, the chat function can look at the knowledge graph for that disease. It will deliver a text description that includes the major signaling pathways and genes involved in psoriasis and the compounds known to interact with them. The user can then ask any question — for example, “How many genes are in this graph?” — and get an instant response, or look for associations between genes and specific diseases, such as sarcoma. The Insilico platform, called PandaOmics, will show that the top target gene for sarcoma is PLK1. The user could interrogate further, requesting links to specific pathways — for instance, apoptosis — and get an immediate answer.
ChatGPT produces the conversational output. Insilico then validates what comes out in the chat with additional predictive AI programs trained on their own data, collected over many years. As a result, “Our output is extremely accurate,” says Alex Zhavoronkov, founder and CEO of the company. Zhavoronkov, not a native English speaker, also uses ChatGPT to help him improve his grammar when writing papers, and he stirred controversy recently by listing ChatGPT as a co-author of a journal article.
Scientists also find LLMs helpful for linking data and representing it in different ways. Exscientia, a pharmatech based in Oxford, UK has been experimenting with LLMs to translate ordinary English statements into carefully structured, mechanistic assertions that help generate their knowledge graphs, says Garry Pairaudeau, the company’s chief technology officer.
LLMs are still evolving, with developers adding features at a furious pace. The ChatGPT released in December was based on OpenAI’s GPT version 3.5. An update, GPT-4, was released mid-March and vastly outperforms its predecessor. In late March, ChatGPT added a so-called retrieval plug-in that could prove particularly useful to drug discovery. It is a module that allows the software to search personal or company documents, and Dan Neil, chief technology officer at BenevolentAI, an AI-driven biotech in London, is excited about that as a way to customize the chat function on the basis of the company’s own data. “If you had a specialized assay that you wrote up and described in internal company documents, you can say, ‘Hey, looking over these results that we’ve gotten internally, how does this update your thinking? Can you find or imagine other new approaches in life sciences that actually leverage this information that we found?’” he says.
Despite their name, language models need not be trained on English or other human languages. The same techniques of deriving statistical associations can be applied to the ‘language’ of DNA or protein sequences. Then, instead of a new sentence, they can generate new proteins that might make good drug targets. “It’s the same idea,” Beam says, “but we’re showing it biological data instead of text from the Internet.”
Some people worry that training AI systems to design molecules with a high likelihood of hitting their targets requires large volumes of data, hand-labeled by humans. And such collections are not always forthcoming because companies who regularly produce this information are not always keen to share it. But the same methods that allow ChatGPT to write sentences could potentially provide the perfect solution for new molecule design, Pan says. A language model supplied with abundant unlabeled data — such as the nearly 250 million protein sequences contained in the UniProt database — could derive the right relationships between molecular building blocks on its own.
Bioxcel Therapeutics, a company that uses AI to identify for repurposing drugs that were shelved in phase 2 or 3 trials, or even after approval, is considering LLMs to pick out potential winners from the different databases. But LLMs will only prove valuable, says Frank Yocca, a neuroscientist and the company’s CSO, if they fit into Bioxcel’s suite of AI tools. “Right now it’s not very accurate in terms of what you get back,” he cautions. “But we’re in the beginning stages of this.”
One way to ensure results are accurate and avoid AI hallucinations is what Neil calls ‘evidence surfacing’. When an LLM produces what it purports to be a fact, his company has added an algorithm to provide citations and references to back that up. Their system uses semantic search — a way of assessing the meaning of words — to extract sentences from papers and biology texts that support an assertion. The system selects a few relevant sentences from the millions of documents at its disposal and presents them to a human expert, who can then look at this small subset of data to judge whether the purported fact seems true.
Yocca says people can be seduced by the latest technology and lose sight of whether it really helps them achieve their goals. “You can get sort of consumed by just getting the machine to do what you wanted to do and not necessarily give you a functional answer,” he says. “We try to avoid that.”
Not everyone is hopping aboard the ChatGPT bandwagon. “Basically we already have all the tools to generate what we want and we are already exploring a lot of information, and we are not trying to expand more for now,” says Joao Magalhaes, head of immunology research at Enterome in Paris. For one thing, he worries that providing patient information to train the LLM might compromise privacy.
He’s not averse to adopting new AI techniques, though. For instance, the company uses AlphaFold, an AI system developed by DeepMind that looks at amino acid sequences and uses those to predict the three-dimensional structure of a protein, including many that had previously been unknown. “It was a huge improvement for us,” Magalhaes says. He will be keeping an eye on ChatGPT, and if it looks like it might be useful, the company will consider adopting it.
Beam points out that other types of generative AI, such as diffusion models that can create images out of random noise, have already made their way into biology. Because those models can create new images of protein structures, they “are arguably a more direct line to drug discovery and drug development,” Beam says.
If nothing else, he says, the rise of ChatGPT has created widespread awareness of the potential of generative AI and encouraged biotechs to take a closer look. “What ChatGPT has made everyone realize is the power of generative models,” Beam says.