Oppenomics
Biology is messy; this captures the beauty and complexity of our beautiful world, including the human body and all its idiosyncrasies. This complexity complicates everything from the basic knowledge of how routine functions like immunity might seem to work, to elucidating targets to treat various diseases. This is further highlighted by the fact that only a measly 12 percent of drugs entering clinical trials are ultimately approved for introduction by the FDA. In recent studies, estimates of the average R&D cost per new drug range from less than $1 billion to more than $2 billion per drug. This has led to many pharma companies pivoting to data-driven drug discovery. With the completion of the Human Genome Project in 2003, we have witnessed an explosion of data fueled by the ‘omics’ revolution. Genomics, proteomics, and metabolomics have been the workhorses churning out vast amounts of data that, for the first time, have given us unparalleled insight into the inner workings of human biology. We have entered an age of big biology data driven by technological improvements such as high-throughput sequencing, mass spectrometry, and advanced computational analyses. However, as Uncle Ben told Peter Parker in Spiderman, with great data comes great responsibility. The great responsibility in this situation has been ethically using this data to derive novel insights that potentially lead to new therapies. This, however, has turned out to be easier said than done, as we have witnessed repeatedly over the past few years as much vaunted, ‘AI-enabled’ drug discovery has led to little to no progress in getting new therapies across to patients. Therefore, Traditional drug and AI-driven discovery have one problem in common, which we hope to solve: a data integration problem. By effectively integrating the ever-increasing big biology data, we will derive novel insights that will drive the discovery of new therapeutic targets for many diseases, including those linked to cardiovascular disease, which has steadily increased over the past decade. This will, in turn, spur newer treatments that will improve the overall quality of life and bring long-awaited relief to patients.
The solution uses large language models (LLMs) to integrate genomics, proteomics, and metabolomic data. There are many publicly available databases with a lot of omics data, and cross-integration of this data will provide unparalleled insights into, for example, connecting the upregulation of specific metabolites to an upregulation of particular proteins in certain disease states. Specifically, the focus would be to turn these databases into knowledge graphs before using LLMs to integrate the data with the metadata, which, when available, would include the published article, thus providing a way to include both structured and unstructured data and make the models more robust. Using knowledge graphs also would help reduce the hallucinations common with LMMs. Using Open AI's API, the system will initially rely on a multi-LLM agent system where the underlying model, GPT3.5 turbo, is finetuned on publicly available omics data. To reduce costs over time, locally trained base models can be finetuned, leading to a 'mixture of experts' system. As these models can be run on local GPUs, we can tweak the models as needed.
Cardiovascular disease remains the number one cause of death, even above more funded research areas like cancer. It is also well-documented that drugs used in heart diseases, such as beta-blockers or Angiotensin-converting enzyme (ACE) inhibitors, work differently based on race, where they are usually less effective as monotherapies. The aging boomer population will also add to the growing cardiovascular-related morbidities as the risk for these disease states increases with age. So, by integrating the available data, we hope to discover new therapeutic targets for multiple diseases, including cardiovascular disease. This data will also give us insight into possible racial and genetic differences that might be present and if any targets might be more beneficial to underserved communities such as the African American communities, which unfortunately did not have most of their data available, nor were they suitably represented in clinical trials for a lot of currently available and approved medications. And so, it might have been missed with earlier drug discovery campaigns. Generally, this would significantly impact therapeutic target identification, which would profoundly affect patients everywhere.
Currently, we have a team of one and many, the many referring to the LLM agents and a mixture of experts. I have a background in pharmacy and am rounding up my Ph.D. in computational chemistry in a molecular biology and pharmacokinetics-focused program, giving me what I could describe as a unique position at the intersections of several disciplines that drive current drug discovery processes. I also have experience working with large language models and other related artificial intelligence (AI) models and algorithms. I have experience working with large datasets, including drug databases, protein data banks, and omics data. I have worked as a pharmacist in hospitals, community pharmacies, and primary healthcare settings. I ended up in a drug discovery program with an internship in the discovery group at Merck along the way, which has equipped me with a unique perspective. Spending time at the back end, which fuels the discovery of drugs, and having served as a pharmacist at the front end, interacting directly with these patients, has made it personal for me and has crystallized the problems they face. These problems are serious, essential problems plaguing not just the patients but their families and humanity at large, which we aim to solve by discovering newer targets. As a black immigrant in the United States, seeing how underserved populations can respond differently to perceived standard therapy further inspires my pursuit of novel therapies. Getting my Ph.D. at Washington State University also served a good purpose as we have a strong proteomics and metabolomics program from where I hope to bring in consultants if needed. Speaking with people working at producing this data has helped inform this solution. Also, I am consistently engaging with professors who work with underserved communities. As a member of the black community here who has helped in various aspects, including medical outreaches, I think this idea has also partly materialized and will continue to be refined from such engagements.
- Collecting, analyzing, curating, and making sense of big data to ensure high-quality inputs, outputs, and insights.
- Creating models and systems that process massive data sets to identify specific targets for precision drugs and treatments.
- Concept: An idea for building a product, service, or business model that is being explored for implementation
- Business Model (e.g. product-market fit, strategy & development)
- Financial (e.g. accounting practices, pitching to investors)
- Human Capital (e.g. sourcing talent, board development)
- Technology (e.g. software or hardware, web development/design)
This solution is innovative, as LLMs have taken the world by storm over the past few months. As has been extensively documented over the years, big companies such as the big pharma companies slowly respond to disruptive changes like this. This affords lean, bootstrapped startups like what I am proposing to come in and take advantage of this lag. The use of knowledge graphs and databases like this is not a very popular use of LLMs, and most organizations who have revealed how they use LLMs are still struck at feeding vast amounts of text data to these models. There is also still a need for more confidence in these models in drug discovery due to the hallucination problem. With what I have in mind, this would force a rethink of the use of LLMs in the discovery and validation space. Having agile, lean startups like the one I am proposing will revolutionize how we use LLMs for discovery and, in the future, validate therapeutic targets. I strongly believe this venture can capture a large amount of the available market share as we are in the infancy of discovering what LLMs are capable of.
The UN sustainable development goals' authors will agree that this solution cuts to the heart of the problem addressed in goal 3, which involves ensuring healthy lives and the well-being of people of all ages. For many of the sub-targets, such as those dealing with the reduction of diseases of various types, including communicable and non-communicable diseases, the solution proposed here will provide new therapeutic targets, which will inevitably lead to more therapeutic options. This approach is easily scalable as the solution can be used to find targets that can be licensed to major pharmaceutical industries to continue the development of therapeutics. Overall, this solution can also inspire more companies to get into the space and create more therapeutic options, undoubtedly accelerating efforts toward improving people's health.
The core of the solution depends on large language models which can process and generate human-readable text. Its primary strength lies in its ability to understand, generate, and complete text in coherent and contextually relevant ways. The ability to summarize with context makes the model robust enough to make connections that may not be obvious at first because of the size of the dataset. Implementing hard cutoffs to prune these datasets generated from experiments is a common way to visualize these datasets for inference, but this might lead to the loss of important data in some cases. However, by turning this data into knowledge graphs by utilizing libraries like PyG and graph database management systems like Neo4j, we can quickly turn all this information into graphs that can then be used to fine-tune LLMs.
Initially, the plan is to utilize the Proteomics Identifications (PRIDE) database, which is the standard database for depositing omics datasets and is carefully curated. It has over 12000 total datasets for humans and over 27000 datasets overall. Other add-ons to this database, like the Reactome and biological and experimental metadata, make this the ideal database. Other databases of interest include the small molecule pathway database (SMPD), which is a part of the human metabolome database (hmdb). Additionally, if needed, we can obtain commercially available datasets to bloster the available datasets.
AI safety and its ethical use in this solution is one I take seriously, as further integration of AI into everyday life can inadvertently increase potential risks. The datasets, such as those in the PRIDE database, are already de-identified and publicly available, reducing the risk of compromising private information. However, if we add additional datasets from commercially available sources, we would ensure that the data was obtained ethically and considering privacy.
In cases where I can, I will ensure the datasets being used are balanced, reflecting the diversity inherent in the patients we are screening these new therapeutic targets for. When this solution scales up as I believe it would, we might start approaching institutions and research clinics for more data, which we would ensure is treated with privacy, safety, and fairness as paramount. We would also ensure that there is always a human-in-the-loop scenario whereby we are not overly reliant on AI for all decisions.
In 1 year, the expectation is to have the model up and running and validate at least two major targets, either as monotherapy or targets combined with other known targets to improve therapeutic outcomes or reduce side effects.
In 5 years, this solution should scale into a vertically integrated company where we control the end-to-end process of prediction and validation of targets. I believe vertical integration is the best way to ensure quality, reliability, and low costs. Also, part of the five-year impact is to have at least four drugs that bind to our validated targets in clinical trials.
- Not registered as any organization
It is a concept at this point and has a staff of 1. However, if needed during the process, there are plans to engage at least one consultant.
I have been working on graphs and graph-related representations of different kinds of data for over a year. With LLMs, I have been building on top of local LLMS and Open AI's API for over five months. I have been working on efficiently representing omics data for three months but have very high hopes that this would revolutionize how we organize and use such data.
While we have a team of one, I am a black immigrant from Nigeria in West Africa who understands how diversity can be helpful in terms of idea generation and implementation. Diversity strengthens teams by bringing together people with diverse backgrounds and experiences, and we would ensure that the team makeup cuts across all races and genders. When this solution scales, the hiring ethos will ensure that diversity, equity, and inclusion remain crucial in hiring. Examples include activities like recruiting visits to historically black colleges and conferences like SACNAS, which will ensure we provide enough opportunities for minorities while balancing that with hiring from elite universities. Balance is the key; we will make it a melting pot for all genders and races.
While this solution is still in its infancy, initial thoughts on an operational model and plan include
1. Delivering impact: LLMs are currently driving the search for artificial general intelligence (AGI). This is a testament to its ability to derive new insights from data. Turning this loose on this data will drive the discovery of new targets, which will have an outsized impact on the discovery of new therapeutics.
2. Partner/User engagement: Over the years, working closely with the omics groups at WSU and while interning in Cambridge and Boston at Merck, I have made a lot of connections with policymakers in these organizations who I believe would be incredibly interested in a resource like this. The plan is to have demo days where the model's capability can be showcased and how this solution could be an option for priority disease conditions in which they are interested.
3. Tools and resources: The tools I need are always available, and the organizations OpenAI and Lambda Cloud are reliable resource providers. Barring funding issues, there are no anticipated issues with access to tools needed to build out this solution.
A report from Trinity Life Sciences estimates that 90 percent of large pharmaceutical companies initiated AI/ML projects in 2020. Since 2015, almost 100 partnerships have been identified between AI vendors and Big Pharma companies. This number is set to rise in the coming years with the massive adoption of LLMs and generative AI. The plan is to show proof of concept and sign research alliances/partnerships with Pharma companies to identify new therapeutic targets for diseases of interest. This has served as a successful model for other AI companies, and utilizing innovative techniques as proposed will put this solution ahead of many already available ones.
Grants from foundations such as the Bill and Melinda Gates Foundation and the Chan Zuckerberg foundation are also an option as the plan is to apply for these as soon as the solution gains traction.
When this solution scales, the plan is to raise some investment capital to support the long term success of this solution.
This is currently at the concept stage and a team of 1, so we do not have any current operating costs. The projected operational costs for this year will be approximately $295k, with $45k going to the API costs and renting GPUs for local training. $200k will go towards relocation and compensation as I intend to move to New York when selected to work full time on this solution.
For funding, we will require around $50k for a start. For the GPT3.5 use, training is about $0.008 / 1K tokens, and usage input is about $0.012 / 1K tokens. I am estimating a maximum of $15-20k till we can get the models up and running.
We hope to rent H100s on Lambda Cloud to run the local models once we have optimized the solution for $2.59/hr/GPU, which will amount to $22,377.6 when calculated for 30 days for 12 months.
The remaining money will be allocated to miscellaneous expenses, including going to important conferences related to building this solution and procuring a Linux workstation, depending on the workstations available at CURE.
I am excited about everything the CURE residency brings to the table and believe they will all be very important in crystallizing this vision and building this solution. However, I am more excited about the following:
1. Seed funding: This will be very important to get the concept off the ground and move it from idea to actual product. It is not just the fact that it will fund this solution but shows belief in the proposed solution.
2. Mentorship: As a first-time entrepreneur, having people who have worked in and have experience in this space will be invaluable. Such advice would be good for navigating the obstacles to building up this solution.
3. Networking opportunities: Networking with people working on the same issues and problems will significantly boost brainstorming. I also hope to be inspired by listening to their stories, which are many and varied.