Serafim Batzoglou is Chief Knowledge Officer at Seer. Previous to becoming a member of Seer, Serafim served as Chief Knowledge Officer at Insitro, main machine studying and information science of their strategy to drug discovery. Previous to Insitro, he served as VP of Utilized and Computational Biology at Illumina, main analysis and expertise improvement of AI and molecular assays for making genomic information extra interpretable in human well being.
What initially attracted you to the sphere of genomics?
I got interested within the discipline of computational biology in the beginning of my PhD in laptop science at MIT, once I took a category on the subject taught by Bonnie Berger, who turned my PhD advisor, and David Gifford. The human genome undertaking was choosing up tempo throughout my PhD. Eric Lander, who was heading the Genome Middle at MIT turned my PhD co-advisor and concerned me within the undertaking. Motivated by the human genome undertaking, I labored on whole-genome meeting and comparative genomics of human and mouse DNA.
I then moved to Stanford College as college on the Pc Science division the place I spent 15 years, and was privileged to have suggested about 30 extremely gifted PhD college students and lots of postdoctoral researchers and undergraduates. My group’s focus has been the applying of algorithms, machine studying and software program instruments constructing for the evaluation of large-scale genomic and biomolecular information. I left Stanford in 2016 to guide a analysis and expertise improvement group at Illumina. Since then, I’ve loved main R&D groups in business. I discover that teamwork, the enterprise facet, and a extra direct impression to society are attribute of business in comparison with academia. I labored at modern corporations over my profession: DNAnexus, which I co-founded in 2009, Illumina, insitro and now Seer. Computation and machine studying are important throughout the expertise chain in biotech, from expertise improvement, to information acquisition, to organic information interpretation and translation to human well being.
During the last 20 years, sequencing the human genome has grow to be vastly cheaper and quicker. This led to dramatic progress within the genome sequencing market and broader adoption within the life sciences business. We at the moment are on the cusp of getting inhabitants genomic, multi-omic and phenotypic information of adequate measurement to meaningfully revolutionize healthcare together with prevention, prognosis, remedy and drug discovery. We are able to more and more uncover the molecular underpinnings of illness for people via computational evaluation of genomic information, and sufferers have the possibility to obtain therapies which might be customized and focused, particularly within the areas of most cancers and uncommon genetic illness. Past the apparent use in medication, machine studying coupled with genomic info permits us to realize insights into different areas of our lives, similar to our family tree and diet. The subsequent a number of years will see adoption of customized, data-driven healthcare, first for choose teams of individuals, similar to uncommon illness sufferers, and more and more for the broad public.
Previous to your present position you had been Chief Knowledge Officer at Insitro, main machine studying and information science of their strategy to drug discovery. What had been a few of your key takeaways from this time interval with how machine studying can be utilized to speed up drug discovery?
The standard drug discovery and improvement “trial-and-error” paradigm is plagued with inefficiencies and intensely prolonged timelines. For one drug to get to market, it will possibly take upwards of $1 billion and over a decade. By incorporating machine studying into these efforts, we will dramatically scale back prices and timeframes in a number of steps on the best way. One step is goal identification, the place a gene or set of genes that modulate a illness phenotype or revert a illness mobile state to a extra wholesome state could be recognized via large-scale genetic and chemical perturbations, and phenotypic readouts similar to imaging and purposeful genomics. One other step is compound identification and optimization, the place a small molecule or different modality could be designed by machine learning-driven in silico prediction in addition to in vitro screening, and furthermore desired properties of a drug similar to solubility, permeability, specificity and non-toxicity could be optimized. The toughest in addition to most vital facet is probably translation to people. Right here, selection of the appropriate mannequin—induced pluripotent stem cell-derived traces versus main affected person cell traces and tissue samples versus animal fashions—for the appropriate illness poses an extremely vital set of tradeoffs that finally replicate on the flexibility of the ensuing information plus machine studying to translate to sufferers.
Seer Bio is pioneering new methods to decode the secrets and techniques of the proteome to enhance human well being, for readers who’re unfamiliar with this time period what’s the proteome?
The proteome is the altering set of proteins produced or modified by an organism over time and in response to setting, diet and well being state. Proteomics is the examine of the proteome inside a given cell sort or tissue pattern. The genome of a human or different organisms is static: with the vital exception of somatic mutations, the genome at beginning is the genome one has their complete life, copied precisely in every cell of their physique. The proteome is dynamic and adjustments within the time spans of years, days and even minutes. As such, proteomes are vastly nearer to phenotype and finally to well being standing than are genomes, and consequently extra informative for monitoring well being and understanding illness.
At Seer, we’ve developed a brand new solution to entry the proteome that gives deeper insights into proteins and proteoforms in complicated samples similar to plasma, which is a extremely accessible pattern that sadly to-date has posed an awesome problem for standard mass spectrometry proteomics.
What’s the Seer’s Proteograph™ platform and the way does it provide a brand new view of the proteome?
Seer’s Proteograph platform leverages a library of proprietary engineered nanoparticles, powered by a easy, fast, and automatic workflow, enabling deep and scalable interrogation of the proteome.
The Proteograph platform shines in interrogating plasma and different complicated samples that exhibit giant dynamic vary—many orders of magnitude distinction within the abundance of varied proteins within the pattern—the place standard mass spectrometry strategies are unable to detect the low abundance a part of the proteome. Seer’s nanoparticles are engineered with tunable physiochemical properties that collect proteins throughout the dynamic vary in an unbiased method. In typical plasma samples, our expertise permits detection of 5x to 8x extra proteins than when processing neat plasma with out utilizing the Proteograph. In consequence, from pattern prep to instrumentation to information evaluation, our Proteograph Product Suite helps scientists discover proteome illness signatures that may in any other case be undetectable. We wish to say that at Seer, we’re opening up a brand new gateway to the proteome.
Moreover, we’re permitting scientists to simply carry out large-scale proteogenomic research. Proteogenomics is the combining of genomic information with proteomic information to determine and quantify protein variants, hyperlink genomic variants with protein abundance ranges, and finally hyperlink the genome and the proteome to phenotype and illness, and begin disentangling the causal and downstream genetic pathways related to illness.
Are you able to talk about among the machine studying expertise that’s at present used at Seer Bio?
Seer is leveraging machine studying in any respect steps from expertise improvement to downstream information evaluation. These steps embrace: (1) design of our proprietary nanoparticles, the place machine studying helps us decide which physicochemical properties and combos of nanoparticles will work with particular product traces and assays; (2) detection and quantification of peptides, proteins, variants and proteoforms from the readout information produced from the MS devices; (3) downstream proteomic and proteogenomic analyses in large-scale inhabitants cohorts.
Final yr, we printed a paper in Superior Supplies combining proteomics strategies, nanoengineering and machine studying for enhancing our understanding of the mechanisms of protein corona formation. This paper uncovered nano-bio interactions and is informing Seer within the creation of improved future nanoparticles and merchandise.
Past nanoparticle improvement, we’ve been growing novel algorithms to determine variant peptides and post-translational modifications (PTMs). We just lately developed a way for detection of protein quantified trait loci (pQTLs) that’s sturdy to protein variants, which is a identified confounder for affinity-based proteomics. We’re extending this work to instantly determine these peptides from the uncooked spectra utilizing deep learning-based de novo sequencing strategies to permit search with out inflating the scale of spectral libraries.
Our group can be growing strategies to allow scientists with out deep experience in machine studying to optimally tune and make the most of machine studying fashions of their discovery work. That is achieved through a Seer ML framework primarily based on the AutoML instrument, which permits environment friendly hyperparameter tuning through Bayesian optimization.
Lastly, we’re growing strategies to cut back the batch impact and improve the quantitative accuracy of the mass spec readout by modeling the measured quantitative values to maximise anticipated metrics similar to correlation of depth values throughout peptides inside a protein group.
Hallucinations are a standard challenge with LLMs, what are among the options to stop or mitigate this?
LLMs are generative strategies which might be given a big corpus and are educated to generate related textual content. They seize the underlying statistical properties of the textual content they’re educated on, from easy native properties similar to how usually sure combos of phrases (or tokens) are discovered collectively, to larger stage properties that emulate understanding of context and which means.
Nonetheless, LLMs should not primarily educated to be appropriate. Reinforcement studying with human suggestions (RLHF) and different methods assist prepare them for fascinating properties together with correctness, however should not absolutely profitable. Given a immediate, LLMs will generate textual content that the majority intently resembles the statistical properties of the coaching information. Typically, this textual content can be appropriate. For instance, if requested “when was Alexander the Nice born,” the right reply is 356 BC (or BCE), and an LLM is probably going to present that reply as a result of inside the coaching information Alexander the Nice’s beginning seems usually as this worth. Nonetheless, when requested “when was Empress Reginella born,” a fictional character not current within the coaching corpus, the LLM is more likely to hallucinate and create a narrative of her beginning. Equally, when requested a query that the LLM might not retrieve a proper reply for (both as a result of the appropriate reply doesn’t exist, or for different statistical functions), it’s more likely to hallucinate and reply as if it is aware of. This creates hallucinations which might be an apparent drawback for severe purposes, similar to “how can such and such most cancers be handled.”
There are not any good options but for hallucinations. They’re endemic to the design of the LLM. One partial resolution is correct prompting, similar to asking the LLM to “consider carefully, step-by-step,” and so forth. This will increase the LLMs probability to not concoct tales. A extra refined strategy that’s being developed is using information graphs. Information graphs present structured information: entities in a information graph are related to different entities in a predefined, logical method. Setting up a information graph for a given area is after all a difficult job however doable with a mix of automated and statistical strategies and curation. With a built-in information graph, LLMs can cross-check the statements they generate towards the structured set of identified info, and could be constrained to not generate an announcement that contradicts or is just not supported by the information graph.
Due to the elemental challenge of hallucinations, and arguably due to their lack of adequate reasoning and judgment skills, LLMs are at present highly effective for retrieving, connecting and distilling info, however can not substitute human consultants in severe purposes similar to medical prognosis or authorized recommendation. Nonetheless, they’ll tremendously improve the effectivity and functionality of human consultants in these domains.
Are you able to share your imaginative and prescient for a future the place biology is steered by information moderately than hypotheses?
The standard hypothesis-driven strategy, which includes researchers discovering patterns, growing hypotheses, performing experiments or research to check them, after which refining theories primarily based on the information, is changing into supplanted by a brand new paradigm primarily based on data-driven modeling.
On this rising paradigm, researchers begin with hypothesis-free, large-scale information era. Then, they prepare a machine studying mannequin similar to an LLM with the target of correct reconstruction of occluded information, robust regression or classification efficiency in numerous downstream duties. As soon as the machine studying mannequin can precisely predict the information, and achieves constancy akin to the similarity between experimental replicates, researchers can interrogate the mannequin to extract perception in regards to the organic system and discern the underlying organic rules.
LLMs are proving to be particularly good in modeling biomolecular information, and are geared to gas a shift from hypothesis-driven to data-driven organic discovery. This shift will grow to be more and more pronounced over the subsequent 10 years and permit correct modeling of biomolecular techniques at a granularity that goes properly past human capability.
What’s the potential impression for illness prognosis and drug discovery?
I consider LLM and generative AI will result in important adjustments within the life sciences business. One space that may profit significantly from LLMs is medical prognosis, particularly for uncommon, difficult-to-diagnose ailments and most cancers subtypes. There are super quantities of complete affected person info that we will faucet into – from genomic profiles, remedy responses, medical information and household historical past – to drive correct and well timed prognosis. If we will discover a solution to compile all this information such that they’re simply accessible, and never siloed by particular person well being organizations, we will dramatically enhance diagnostic precision. This isn’t to indicate that the machine studying fashions, together with LLMs, will have the ability to autonomously function in prognosis. As a consequence of their technical limitations, within the foreseeable future they won’t be autonomous, however as a substitute they’ll increase human consultants. They are going to be highly effective instruments to assist the physician present beautifully knowledgeable assessments and diagnoses in a fraction of the time wanted up to now, and to correctly doc and talk their diagnoses to the affected person in addition to to the whole community of well being suppliers related via the machine studying system.
The business is already leveraging machine studying for drug discovery and improvement, touting its potential to cut back prices and timelines in comparison with the standard paradigm. LLMs additional add to the out there toolbox, and are offering wonderful frameworks for modeling large-scale biomolecular information together with genomes, proteomes, purposeful genomic and epigenomic information, single-cell information, and extra. Within the foreseeable future, basis LLMs will undoubtedly join throughout all these information modalities and throughout giant cohorts of people whose genomic, proteomic and well being info is collected. Such LLMs will assist in era of promising drug targets, determine doubtless pockets of exercise of proteins related to organic perform and illness, or recommend pathways and extra complicated mobile capabilities that may be modulated in a particular manner with small molecules or different drug modalities. We are able to additionally faucet into LLMs to determine drug responders and non-responders primarily based on genetic susceptibility, or to repurpose medicine in different illness indications. Lots of the present modern AI-based drug discovery corporations are undoubtedly already beginning to suppose and develop on this course, and we should always count on to see the formation of further corporations in addition to public efforts aimed on the deployment of LLMs in human well being and drug discovery.
Thanks for the detailed interview, readers who want to be taught extra ought to go to Seer.