AI for stability supports efforts to bring machine learning into biomed

AI for stability, the venture-backed startup behind text-to-image AI system Stable Diffusion, is funding a wide-ranging effort to apply AI to the frontiers of biotechnology. Named OpenBioMLthe venture’s first projects will focus on machine learning-based approaches to DNA sequencing, protein folding and computational biochemistry.

The company’s founders describe OpenBioML as an “open research lab” — and it aims to explore the intersection of AI and biology in an environment where students, professionals and researchers can participate and collaborate, according to Stability AI CEO Emad Mostake.

“OpenBioML is one of the independent research communities that Stability supports,” Mostak told TechCrunch in an email interview. “Stability seeks to advance and democratize AI, and through OpenBioML we see an opportunity to advance cutting-edge science, healthcare and medicine.”

given controversy surrounding Stable Diffusion — Stability AI’s AI system that generates art from textual descriptions, similar to OpenAI’s DALL-E 2 — one might understandably be nervous about Stability AI’s first venture into healthcare. The startup has taken a laissez-faire approach to management, allowing developers to use the system as they wish, including for celebrity deepfakes and pornography.

Putting aside the ethically dubious AI solutions to stability so far, machine learning in medicine is a minefield. While the technology has been successfully applied to diagnose conditions such as skin and eye diseases, among others, research shows that the algorithms can develop biases leading to worse care for some patients. April 2021 studyfor example, found that statistical models used to predict suicide risk in mental health patients performed well for white and Asian patients but poorly for black patients.

OpenBioML starts on safer territory, sensibly. His first projects are:

  • BioLMwhich seeks to apply natural language processing (NLP) techniques to the fields of computational biology and chemistry
  • DNA diffusionwhich aims to develop an AI that can generate DNA sequences from text prompts
  • LibreFoldwhich seeks to increase access to AI protein structure prediction systems similar to DeepMind’s AlphaFold 2

Each project is led by independent researchers, but Stability AI provides support in the form of access to its AWS-hosted cluster of over 5,000 Nvidia A100 GPUs to train the AI ​​systems. According to Niccolò Zanicelli, a computer science student at the University of Parma and one of the lead researchers in the OpenBioML, that will be it enough processing and storage power to eventually train up to 10 different AlphaFold 2-like systems in parallel.

“A lot of research in computational biology is now leading to open source publications. However, much of this happens at the individual lab level and is therefore typically limited by insufficient computing resources,” Zanicelli told TechCrunch via email. “We want to change that by fostering large-scale collaborations and, thanks to the support of Stability AI, supporting those collaborations with resources that only the largest industrial labs have access to.”

Generation of DNA sequences

On The current OpenBioML projects, DNA diffusion — led by pathology professor Luca Pinello’s lab at Massachusetts General Hospital and Harvard Medical School — is perhaps the most ambitious. The goal is to use generative AI systems to learn and apply the rules of “regulatory” DNA sequences, or segments of nucleic acid molecules, that influence the expression of specific genes in an organism. Many diseases and disorders result from misregulated genes, but science has yet to discover a reliable process for identifying—much less changing—these regulatory sequences.

DNA-Diffusion proposes the use of a type of AI system known as a diffusion model to generate cell type-specific regulatory DNA sequences. Diffusion models—which underlie image generators such as Stable Diffusion and OpenAI’s DALL-E 2—create new data (eg, DNA sequences) by learning how to destroy and rebuild many existing data samples. As they are fed the samples, the models get better at recovering all the data they previously destroyed to generate new works.

Stability AI OpenBioML

Image Credits: OpenBioML

“Diffusion has achieved widespread success in multimodal generative models and is now beginning to be applied in computational biology, for example to generate new protein structures,” Zanicelli said. “With DNA-Diffusion, we are now exploring its application to genome sequences.”

If all goes according to plan, the DNA-Diffusion project will create a diffusion model that can generate regulatory DNA sequences from textual instructions such as “Sequence that will activate a gene to its maximum level of expression in a type X cell” and “Sequence that activates gene in the liver and heart but not in the brain. Such a model could also help interpret the components of regulatory sequences, Zanichelli says—improving the scientific community’s understanding of the role of regulatory sequences in various diseases.

It’s worth noting that this is largely theoretical. Although preliminary studies of the application of diffusion to protein folding appear promisingit’s very early, Zanichelli admits—hence the push to get the broader AI community on board.

Prediction of protein structures

OpenBioML’s LibreFold, although smaller in scope, is more likely to produce immediate results. The project seeks to achieve a better understanding of machine learning systems that predict protein structures in addition to ways to improve them.

Like my colleague Devin Coldway covered in his paper on DeepMind’s work on AlphaFold 2, AI systems that accurately predict protein shape are relatively new on the scene but transformative in terms of their potential. Proteins consist of sequences of amino acids that fold into shapes to perform various tasks in living organisms. The process of determining what shape an acid sequence will produce was once a difficult, error-prone endeavor. AI systems like AlphaFold 2 have changed this; thanks to them, over 98% of the protein structures in the human body are known to science today, as well as hundreds of thousands of other structures in organisms such as E. coli and yeast.

However, few groups have the engineering expertise and resources needed to develop this kind of AI. DeepMind spent days training AlphaFold 2 tensor processing units (TPU), Google’s expensive AI accelerator hardware. And acid sequence training datasets are often patented or released under non-commercial licenses.

Proteins fold into their three-dimensional structure. Image Credits: Christoph Burgstedt/Science Photo Library/Getty Images

“It’s unfortunate because if you look at what the community has been able to build on the AlphaFold 2 checkpoint that DeepMind released, it’s just amazing,” Zanicelli said, referring to the trained AlphaFold 2 model that DeepMind released last year. “For example, just days after the release, Seoul National University professor Minkyung Baek announced a trick on Twitter that allowed the model to predict quaternary structures — something few, if any, expected the model to be capable of. There are many more examples of this kind, so who knows what the wider scientific community could build if it had the ability to train entirely new methods for protein structure prediction like AlphaFold?

Building on the work of RoseTTAFold and OpenFold, two ongoing community efforts to replicate AlphaFold 2, LibreFold will facilitate “large-scale” experiments with different protein folding prediction systems. Led by researchers from University College London, Harvard and Stockholm, the focus of LibreFold will be to gain a better understanding of what the systems can achieve and why, according to Zanichelli.

“LibreFold is at its core a project for the community, by the community. The same applies to the release of both model benchmarks and datasets, as it may take us only a month or two to start releasing the first results, or it may take significantly longer,” he said. . “However, my intuition is that the former is more likely.”

Application of NLP in biochemistry

It is in a longer time horizon OpenBioML’s The BioLM project, which has the more vague mission of “applying language modeling techniques derived from NLP to biochemical sequences.” In collaboration with EleutherAI, a research group that has released several open source text generation models, BioLM hopes to train and publish new “biochemical language models” for a range of tasks, including generating protein sequences.

Zanichelli points to Salesforce ProGen as an example of the types of work BioLM can undertake. ProGen treats amino acid sequences as words in a sentence. Trained on a dataset of more than 280 million protein sequences and associated metadata, the model predicts the next set of amino acids from previous ones, much like a language model predicts the end of a sentence from its beginning.

Nvidia earlier this year released a language model, MegaMallBART, which was trained on a dataset of millions of molecules to search for potential drug targets and predict chemical reactions. Meta also recently trained NLP, called ESM-2, on protein sequences, an approach the company claims allowed it to predict sequences for more than 600 million proteins in just two weeks.

Folding of meta proteins

Protein structures predicted by the Meta system. Image Credits: Meta

Look forward

While OpenBioML’s interests are broad (and expanding), Mostaque says they are united by a desire to “maximize the positive potential of machine learning and AI in biology,” following in the tradition of open research in science and medicine.

“We aim to enable researchers to gain more control over their experimental pipeline for active learning or model validation purposes,” Mostaque continued. “We also aim to push the state of the art with increasingly general biotechnological models, as opposed to the specialized architectures and learning objectives that currently characterize most computational biology.”

But – as you’d expect from a VC-backed startup that recently raised over $100M — Stability AI does not see OpenBioML as a purely philanthropic effort. Mostaque says the company is open to exploring commercializing technology from OpenBioML “when it’s advanced enough and safe enough and when the time is right.”

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *