5 Comments

1) You may be interested in the “nucleotide dependency” preprint which has some interesting ideas of how to go beyond LL for variant interpretation with DNA LMs https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1

2) That GPN MSA, a much much smaller model that mixes a short context DNA LM with evolutionary conservation from MSA as input comes kinda close Evo 2 for both coding and noncoding, suggests that for variant interpretation, a model as large as Evo 2 probably isn’t necessary in the long run.

3) There have been multiple updates on HARs, such as 312 reported in https://www.science.org/doi/10.1126/science.abm1696. And there are also elements such as HAQERs, which are previously neutrally evolving regions that show accelerated evolution in humans https://pubmed.ncbi.nlm.nih.gov/36423581/.

4) What are these models even learning when genomes like humans are nearly 50% repeats, only a minority of which are functional?

Expand full comment

Wow, this was an amazing piece. Didn't know that I also enjoy Socratic dialogue essays as well.

Expand full comment

Same

Expand full comment

Nice post. Part 2 commentary on wet-lab validation?

Expand full comment

Proteins are encoded in DNA. The challenges/problem sets of the DNA space are supersets containing all the challenges in the protein space, with an extra layer of regulatory complexity to tackle.

Expand full comment