2) That GPN MSA, a much much smaller model that mixes a short context DNA LM with evolutionary conservation from MSA as input comes kinda close Evo 2 for both coding and noncoding, suggests that for variant interpretation, a model as large as Evo 2 probably isn’t necessary in the long run.
Proteins are encoded in DNA. The challenges/problem sets of the DNA space are supersets containing all the challenges in the protein space, with an extra layer of regulatory complexity to tackle.
1) You may be interested in the “nucleotide dependency” preprint which has some interesting ideas of how to go beyond LL for variant interpretation with DNA LMs https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1
2) That GPN MSA, a much much smaller model that mixes a short context DNA LM with evolutionary conservation from MSA as input comes kinda close Evo 2 for both coding and noncoding, suggests that for variant interpretation, a model as large as Evo 2 probably isn’t necessary in the long run.
3) There have been multiple updates on HARs, such as 312 reported in https://www.science.org/doi/10.1126/science.abm1696. And there are also elements such as HAQERs, which are previously neutrally evolving regions that show accelerated evolution in humans https://pubmed.ncbi.nlm.nih.gov/36423581/.
4) What are these models even learning when genomes like humans are nearly 50% repeats, only a minority of which are functional?
Wow, this was an amazing piece. Didn't know that I also enjoy Socratic dialogue essays as well.
Same
Nice post. Part 2 commentary on wet-lab validation?
Proteins are encoded in DNA. The challenges/problem sets of the DNA space are supersets containing all the challenges in the protein space, with an extra layer of regulatory complexity to tackle.