MIT Solve

Solution Overview & Team Lead Details

What is the name of your organization?

Oppenomics

What is the name of your solution?

Oppenomics

Provide a one-line summary of your solution.

Using knowledge graphs and LLM-based systems for integrating multi omics data

What specific problem are you solving?

Biology is messy; this captures the beauty and complexity of our beautiful world, including the human body and all its idiosyncrasies. This complexity complicates everything from the basic knowledge of how routine functions like immunity might seem to work, to elucidating targets to treat various diseases. This is further highlighted by the fact that only a measly 12 percent of drugs entering clinical trials are ultimately approved for introduction by the FDA. In recent studies, estimates of the average R&D cost per new drug range from less than $1 billion to more than $2 billion per drug. This has led to many pharma companies pivoting to data-driven drug discovery. With the completion of the Human Genome Project in 2003, we have witnessed an explosion of data fueled by the ‘omics’ revolution. Genomics, proteomics, and metabolomics have been the workhorses churning out vast amounts of data that, for the first time, have given us unparalleled insight into the inner workings of human biology. We have entered an age of big biology data driven by technological improvements such as high-throughput sequencing, mass spectrometry, and advanced computational analyses. However, as Uncle Ben told Peter Parker in Spiderman, with great data comes great responsibility. The great responsibility in this situation has been ethically using this data to derive novel insights that potentially lead to new therapies. This, however, has turned out to be easier said than done, as we have witnessed repeatedly over the past few years as much vaunted, ‘AI-enabled’ drug discovery has led to little to no progress in getting new therapies across to patients. Therefore, Traditional drug and AI-driven discovery have one problem in common, which we hope to solve: a data integration problem. By effectively integrating the ever-increasing big biology data, we will derive novel insights that will drive the discovery of new therapeutic targets for many diseases, including those linked to cardiovascular disease, which has steadily increased over the past decade. This will, in turn, spur newer treatments that will improve the overall quality of life and bring long-awaited relief to patients.

What is your solution?

The solution uses large language models (LLMs) to integrate genomics, proteomics, and metabolomic data. There are many publicly available databases with a lot of omics data, and cross-integration of this data will provide unparalleled insights into, for example, connecting the upregulation of specific metabolites to an upregulation of particular proteins in certain disease states. Specifically, the focus would be to turn these databases into knowledge graphs before using LLMs to integrate the data with the metadata, which, when available, would include the published article, thus providing a way to include both structured and unstructured data and make the models more robust. Using knowledge graphs also would help reduce the hallucinations common with LMMs. Using Open AI's API, the system will initially rely on a multi-LLM agent system where the underlying model, GPT3.5 turbo, is finetuned on publicly available omics data. To reduce costs over time, locally trained base models can be finetuned, leading to a 'mixture of experts' system. As these models can be run on local GPUs, we can tweak the models as needed.

Who does your solution serve, and in what ways will the solution impact their lives?

Cardiovascular disease remains the number one cause of death, even above more funded research areas like cancer. It is also well-documented that drugs used in heart diseases, such as beta-blockers or Angiotensin-converting enzyme (ACE) inhibitors, work differently based on race, where they are usually less effective as monotherapies. The aging boomer population will also add to the growing cardiovascular-related morbidities as the risk for these disease states increases with age. So, by integrating the available data, we hope to discover new therapeutic targets for multiple diseases, including cardiovascular disease. This data will also give us insight into possible racial and genetic differences that might be present and if any targets might be more beneficial to underserved communities such as the African American communities, which unfortunately did not have most of their data available, nor were they suitably represented in clinical trials for a lot of currently available and approved medications. And so, it might have been missed with earlier drug discovery campaigns. Generally, this would significantly impact therapeutic target identification, which would profoundly affect patients everywhere.

How are you and your team well-positioned to deliver this solution?

Currently, we have a team of one and many, the many referring to the LLM agents and a mixture of experts. I have a background in pharmacy and am rounding up my Ph.D. in computational chemistry in a molecular biology and pharmacokinetics-focused program, giving me what I could describe as a unique position at the intersections of several disciplines that drive current drug discovery processes. I also have experience working with large language models and other related artificial intelligence (AI) models and algorithms. I have experience working with large datasets, including drug databases, protein data banks, and omics data. I have worked as a pharmacist in hospitals, community pharmacies, and primary healthcare settings. I ended up in a drug discovery program with an internship in the discovery group at Merck along the way, which has equipped me with a unique perspective. Spending time at the back end, which fuels the discovery of drugs, and having served as a pharmacist at the front end, interacting directly with these patients, has made it personal for me and has crystallized the problems they face. These problems are serious, essential problems plaguing not just the patients but their families and humanity at large, which we aim to solve by discovering newer targets. As a black immigrant in the United States, seeing how underserved populations can respond differently to perceived standard therapy further inspires my pursuit of novel therapies. Getting my Ph.D. at Washington State University also served a good purpose as we have a strong proteomics and metabolomics program from where I hope to bring in consultants if needed. Speaking with people working at producing this data has helped inform this solution. Also, I am consistently engaging with professors who work with underserved communities. As a member of the black community here who has helped in various aspects, including medical outreaches, I think this idea has also partly materialized and will continue to be refined from such engagements.

Which dimension of the Challenge does your solution most closely address?

Collecting, analyzing, curating, and making sense of big data to ensure high-quality inputs, outputs, and insights.
Creating models and systems that process massive data sets to identify specific targets for precision drugs and treatments.

In what city, town, or region is your solution team headquartered?

Spokane, WA, USA

What is your solution’s stage of development?

Concept: An idea for building a product, service, or business model that is being explored for implementation

In which of the following areas do you most need partners or support?

Business Model (e.g. product-market fit, strategy & development)
Financial (e.g. accounting practices, pitching to investors)
Human Capital (e.g. sourcing talent, board development)
Technology (e.g. software or hardware, web development/design)

Who is the Team Lead for your solution?

Peter Obi

More About Your Solution

What makes your solution innovative?

This solution is innovative, as LLMs have taken the world by storm over the past few months. As has been extensively documented over the years, big companies such as the big pharma companies slowly respond to disruptive changes like this. This affords lean, bootstrapped startups like what I am proposing to come in and take advantage of this lag. The use of knowledge graphs and databases like this is not a very popular use of LLMs, and most organizations who have revealed how they use LLMs are still struck at feeding vast amounts of text data to these models. There is also still a need for more confidence in these models in drug discovery due to the hallucination problem. With what I have in mind, this would force a rethink of the use of LLMs in the discovery and validation space. Having agile, lean startups like the one I am proposing will revolutionize how we use LLMs for discovery and, in the future, validate therapeutic targets. I strongly believe this venture can capture a large amount of the available market share as we are in the infancy of discovering what LLMs are capable of.

How does your solution address or plan to address UN Sustainable Development Goal 3 for Good Health and Well-Being?

The UN sustainable development goals' authors will agree that this solution cuts to the heart of the problem addressed in goal 3, which involves ensuring healthy lives and the well-being of people of all ages. For many of the sub-targets, such as those dealing with the reduction of diseases of various types, including communicable and non-communicable diseases, the solution proposed here will provide new therapeutic targets, which will inevitably lead to more therapeutic options. This approach is easily scalable as the solution can be used to find targets that can be licensed to major pharmaceutical industries to continue the development of therapeutics. Overall, this solution can also inspire more companies to get into the space and create more therapeutic options, undoubtedly accelerating efforts toward improving people's health.

Describe the AI components and underlying data that powers your solution.

The core of the solution depends on large language models which can process and generate human-readable text. Its primary strength lies in its ability to understand, generate, and complete text in coherent and contextually relevant ways. The ability to summarize with context makes the model robust enough to make connections that may not be obvious at first because of the size of the dataset. Implementing hard cutoffs to prune these datasets generated from experiments is a common way to visualize these datasets for inference, but this might lead to the loss of important data in some cases. However, by turning this data into knowledge graphs by utilizing libraries like PyG and graph database management systems like Neo4j, we can quickly turn all this information into graphs that can then be used to fine-tune LLMs.

Initially, the plan is to utilize the Proteomics Identifications (PRIDE) database, which is the standard database for depositing omics datasets and is carefully curated. It has over 12000 total datasets for humans and over 27000 datasets overall. Other add-ons to this database, like the Reactome and biological and experimental metadata, make this the ideal database. Other databases of interest include the small molecule pathway database (SMPD), which is a part of the human metabolome database (hmdb). Additionally, if needed, we can obtain commercially available datasets to bloster the available datasets.

How are you ensuring ethical and responsible use of AI in your work? How are you addressing or mitigating potential risks in your solution?

AI safety and its ethical use in this solution is one I take seriously, as further integration of AI into everyday life can inadvertently increase potential risks. The datasets, such as those in the PRIDE database, are already de-identified and publicly available, reducing the risk of compromising private information. However, if we add additional datasets from commercially available sources, we would ensure that the data was obtained ethically and considering privacy.

In cases where I can, I will ensure the datasets being used are balanced, reflecting the diversity inherent in the patients we are screening these new therapeutic targets for. When this solution scales up as I believe it would, we might start approaching institutions and research clinics for more data, which we would ensure is treated with privacy, safety, and fairness as paramount. We would also ensure that there is always a human-in-the-loop scenario whereby we are not overly reliant on AI for all decisions.

What are your impact goals for the next year and the next five years, and how will you achieve them?

In 1 year, the expectation is to have the model up and running and validate at least two major targets, either as monotherapy or targets combined with other known targets to improve therapeutic outcomes or reduce side effects.

In 5 years, this solution should scale into a vertically integrated company where we control the end-to-end process of prediction and validation of targets. I believe vertical integration is the best way to ensure quality, reliability, and low costs. Also, part of the five-year impact is to have at least four drugs that bind to our validated targets in clinical trials.

Your Team

Your Operational Plan & Funding

What is your operational model and plan?

While this solution is still in its infancy, initial thoughts on an operational model and plan include

1. Delivering impact: LLMs are currently driving the search for artificial general intelligence (AGI). This is a testament to its ability to derive new insights from data. Turning this loose on this data will drive the discovery of new targets, which will have an outsized impact on the discovery of new therapeutics.

2. Partner/User engagement: Over the years, working closely with the omics groups at WSU and while interning in Cambridge and Boston at Merck, I have made a lot of connections with policymakers in these organizations who I believe would be incredibly interested in a resource like this. The plan is to have demo days where the model's capability can be showcased and how this solution could be an option for priority disease conditions in which they are interested.

3. Tools and resources: The tools I need are always available, and the organizations OpenAI and Lambda Cloud are reliable resource providers. Barring funding issues, there are no anticipated issues with access to tools needed to build out this solution.

What is your plan for becoming financially sustainable?

A report from Trinity Life Sciences estimates that 90 percent of large pharmaceutical companies initiated AI/ML projects in 2020. Since 2015, almost 100 partnerships have been identified between AI vendors and Big Pharma companies. This number is set to rise in the coming years with the massive adoption of LLMs and generative AI. The plan is to show proof of concept and sign research alliances/partnerships with Pharma companies to identify new therapeutic targets for diseases of interest. This has served as a successful model for other AI companies, and utilizing innovative techniques as proposed will put this solution ahead of many already available ones.

Grants from foundations such as the Bill and Melinda Gates Foundation and the Chan Zuckerberg foundation are also an option as the plan is to apply for these as soon as the solution gains traction.

When this solution scales, the plan is to raise some investment capital to support the long term success of this solution.

What are your current operating costs, and what are your projected operating costs for the next year? Please include human capital estimates.

This is currently at the concept stage and a team of 1, so we do not have any current operating costs. The projected operational costs for this year will be approximately $295k, with $45k going to the API costs and renting GPUs for local training. $200k will go towards relocation and compensation as I intend to move to New York when selected to work full time on this solution.

Applicants can request and receive funding at a minimum of 50k and maximum of $100k. How much funding are you seeking to continue your work in 2024, and how did you select this number? What would you use this funding for? Funding is limited; please consider carefully the right amount to request.

For funding, we will require around $50k for a start. For the GPT3.5 use, training is about $0.008 / 1K tokens, and usage input is about $0.012 / 1K tokens. I am estimating a maximum of $15-20k till we can get the models up and running.

We hope to rent H100s on Lambda Cloud to run the local models once we have optimized the solution for $2.59/hr/GPU, which will amount to $22,377.6 when calculated for 30 days for 12 months.

The remaining money will be allocated to miscellaneous expenses, including going to important conferences related to building this solution and procuring a Linux workstation, depending on the workstations available at CURE.

The Cure Residency will provide winners with seed funding, mentorship, lab space, mentorship, educational programming, and networking opportunities. How do you imagine this opportunity will help support your work? Which aspects of the Cure Residency would you be most excited about?

I am excited about everything the CURE residency brings to the table and believe they will all be very important in crystallizing this vision and building this solution. However, I am more excited about the following:

1. Seed funding: This will be very important to get the concept off the ground and move it from idea to actual product. It is not just the fact that it will fund this solution but shows belief in the proposed solution.

2. Mentorship: As a first-time entrepreneur, having people who have worked in and have experience in this space will be invaluable. Such advice would be good for navigating the obstacles to building up this solution.

3. Networking opportunities: Networking with people working on the same issues and problems will significantly boost brainstorming. I also hope to be inspired by listening to their stories, which are many and varied.