Genomics, PRS, genetic ancestry, race & ethics — Anna Lewis

Over the past two decades, the cost of sequencing a whole human genome has fallen ten million-fold. However, the advances in sequencing have not equally progressed our understanding of the genetic code. Alongside the scientific struggle to interpret the Book of Life, we must also confront the ethical and social issues brought up by ongoing research.

Anna Lewis, a researcher at the Edmond & Lily Safra Center for Ethics at Harvard, guides us through some of the most complex issues in this field in this week’s episode. She discusses the use of polygenic risk scores, and how a poor choice of scientific framework can hinder medical advances, or even worse, lend support to racist ideologies.

In 2003, the Human Genome Project announced it had sequenced the 3.2 billion base pairs that constitute the human genetic code, more precisely the genetic code of a few humans, on just one-half of each chromosome. This project cost around $3 billion and took thirteen years. In 2023, it takes less than 24 hours and under $300 to sequence a whole genome. The learning rates for renewable technology and computing (Moore’s law) pale in comparison, making genomics a leading candidate for the technology that has seen the steepest price declines in history.

Proteomics and gene editing have also witnessed revolutionary advances in recent years:

  • Protein folding: Google’s DeepMind unveiled AlphaFold in 2021, a deep learning model that can predict the shapes formed by proteins with speed and accuracy, thus enhancing our ability to understand their functions.
  • CRISPR: a technology that adopts the mechanism by which bacteria splice in DNA from their predators (phages) to remember them. CRISPR allows for low-cost copy-paste editing of DNA.

Despite these breakthroughs, Anna points out that many problems remain partially understood. One of the deeper reasons for this is the highly complex interactions of genes. Prior to the sequencing of the human genome, scientists anticipated they would find 80,000 – 140,000 genes. Instead, depending on our understanding of the term ‘gene’, they found 20,000 – 30,000. This initial overestimate reflected an under-appreciation of the intricacies of how genes coordinate their functions.

Pathologies such as Huntington’s disease and hemophilia can be traced very clearly to single genes. In many cases, health outcomes depend on multiple genes; or single genes that lead to disease may not always be expressed (“switched on”), furthermore, non-human DNA can play a role in health. Epigenetics and hologenomics are two fast-developing fields that study, respectively, the mechanisms driving the expression of human genes and the importance to the health of non-human cells such as those in the gut microbiome.

We have transcribed The Book of Life, but we do not yet know how to read it. Not fully.

As a consequence, progress in medical technology has not kept pace with the improvements in gene sequencing. It should be conceded, however, that this lag is also due to the essential regulatory hurdles involved in bringing therapies safely to market. Undoubtedly, a similarly rigorous framework for AI would have prevented Language Learning Models (LLMs) such as ChatGPT from emerging as early as they did.

Nonetheless, there is cause for optimism in the treatments of monogenic diseases such as Huntington’s, hemophilia, and Duchenne’s muscular dystrophy. But for the many human traits, whether pathogenic or not, that are polygenic — and thus depend on the expression of multiple genes — progress is slower. Considering that 12,000 genetic variants influence height we can appreciate the difficulties of linking traits back to the genome.

One of the tools developed to cut through this complexity is the Polygenic Risk Score (PRS). By examining a population, researchers apply statistical methods to understand how the frequency of occurrence of a trait (for example, type 2 diabetes) correlates with the genes observed. This is used to predict the probability of a trait based on an individual’s genome, with varying results. One of the key factors in the efficacy of such scores is that the genome for which risk is being calculated is similar to those sampled in the population. But what does that mean?

Population, Anna argues, is a term with a dangerously confusing array of meanings. The population of New York is a number, but it also represents a group of people who do not necessarily share anything meaningful for the purposes of genomics. In the past, PRSs have relied on concepts of race for creating a sample set and understanding the portability of results from that set to other individuals. However, while race remains an important and valid variable for studies in sociology and economics—for example, in understanding and correcting for the historical and ongoing consequences of discrimination—it lacks a scientific basis. Race is a social construct subject to shifts and changes; who is considered “white,” for instance, has more to do with power structures than phenotype. A century ago, Southern European immigrants to the USA were not counted as such.

Recently, there has been a move away from race to the notion of genetic ancestry. However, Anna and her colleagues have observed that unless properly defined, this risks being no more than race renamed. In particular, if genetic ancestry is understood as simple continental groupings—East Asian origin, and so forth—it neither serves science nor society. On one hand, it fails because these groupings do not accurately represent genomically salient features. For example, a continental African grouping contains as much diversity as all other groups combined. On the other hand, by suggesting it has scientific meaning, the use of these groupings may inadvertently propagate racist ideologies.

It is crucial to correctly develop Polygenic Risk Scores. If done correctly, medical advice and interventions can be more effectively targeted. Anna and her colleagues argue that the most appropriate way to understand genetic ancestry is through the Ancestral Recombination Graph (ARG). Unlike continental ancestry groupings, which are flat in time and wide in extent, the ARG looks at how an individual’s DNA has been traced back from various different points and places in time. It traces how an individual’s genome branches out back in time among ancestors.

The ARG is much more specific than continental groupings. However, because genes tend to be passed down in chunks rather than individually, it offers sufficient coarse-graining to highlight where variations are significant and apply metrics for grouping people such that highly predictive PRSs can be calculated. The ARG is a precisely defined and scientifically meaningful object, lacking the connotations of race. As Anna and her colleagues put it in their recent paper, this gets genetic ancestry right for science and society. So far the research community is yet to combine ARGs and PRSs, Anna and her colleagues’ paper is a rallying call to do so.

To make the most of this window of opportunity to move away from race as a biological variable,
we would urge the adoption of a multidimensional and continuous conceptualization of ancestry,
free wherever possible of population categories, and not relying on continental labels that bear
striking resemblance to prior racist groups.

Anna Lewis et al, Getting genetic ancestry right for science and society. Science. 2022 Apr 15

Anna and I studied together at Oxford, starting our degrees in 2003, just as the human genome was first sequenced. It has been a pleasure to follow her career since then, through a PhD in systems biology, medtech startups, and back into academia. More technological and medical breakthroughs are on the horizon—perhaps polygenic CRISPR edits will allow us to influence intelligence, strength, and tendency towards violence. The work that Anna and others are doing in the ELSI field (Ethical, Legal & Social Implications) is crucial to the choices we will make as a species.

 We are learning how we can use our tools, we need to be mindful of how we should use them.

Notes

Poetry, Constraints, DNA & The Xenotext — Christian Bök

Poetry is a game that can be played in many ways. Perhaps the most traditional and popular is “emotion recollected in tranquility” as Wordsworth termed it, whereby the poet’s feelings are carefully expressed. Christian Bök is a bestselling poet who plays a very different game. To use his turn of phrase, he fights with icosahedra, not swords.

Multiverses.xyz with Christian Bök — the four games of poetry

Above: the four games of poetry that Christian outlines. The two dimensions are self-expression (seeming to say something) and self-consciousness (meaning to say something).

Christian imposes tight sets of constraints on his writing. His bestseller Eunoia, for example, uses univocalic lipograms — each chapter can only make use of a single vowel. This produces such gems as “Writing is inhibiting. Sighing, I sit, scribbling in ink this pidgin script.” But that’s not all. That would be too easy:

Eunoia abides by many subsidiary rules. All chapters
must allude to the art of writing. All chapters must de-
scribe a culinary banquet, a prurient debauch, a pas-
toral tableau and a nautical voyage. All sentences must
accent internal rhyme through the use of syntactical
parallelism. The text must exhaust the lexicon for each
vowel, citing at least 98% of the available repertoire
(although a few words do go unused, despite efforts
to include them: parallax, belvedere, gingivitis, mono-
chord and tumulus). The text must minimize repeti-
tion of substantive vocabulary (so that, ideally, no word
appears more than once). The letter Y is suppressed.

— Eunoia, Christian Bök

For the past two decades, Christian has pursued an even more ambitious project, The Xenotext. This project involves enciphering an “alien text” within the DNA of a resilient bacterium, Deinococcus radiodurans. One goal of The Xenotext is to create a text that could outlast human civilization. To add to the genomic challenge Christian has set a remarkable rule: the symbols of the text should be interpretable in two different ways, resulting in two poems that are encoded within the same string.

Christian combines scientific techniques, trial and error, and computer programming to construct his poems, adhering to the rules he has established within his own poetic universe. Furthermore, he transforms art back into science by employing gene-editing to inscribe his poetic creation into the “book of life,” the DNA of a living organism.

Instead of looking back and inwards (the ideal of “emotion recollected in tranquility”, Christian looks outwards and to the future, fusing science and art to produce uncanny and unforgettable (and perhaps ineradicable) verse.

References