HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena

Home Page:https://arxiv.org/abs/2306.15794

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some doubts about downstream tasks

zhguo9 opened this issue · comments

Hello, first of all, thank you for your open-source contributions and the detailed README file!

I'm an undergraduate student who has just started exploring deep learning. My research focus is on DNA tokenization, but there are very few datasets available. My idea is to use prompt learning to tackle this task.

Given the limited research in this area, I've come across the possibility of using DNA tokenization as a downstream task for your model.However, after carefully reading your paper and the repository's README file, especially the "More advanced stuff below" section, I find it challenging to understand all the content due to my limited expertise. I'm still unsure if DNA tokenization can be used as a downstream task for your model. I would like to ask if this is possible?

I would greatly appreciate it if you could provide some advice or guidance on this.

I understand that this may not be within your obligations, so if you're too busy to respond, please feel free to close this issue. Thank you for taking the time to consider my request!

Not sure what you mean DNA tokenization as a task. We just treat every character as a token (ie the smallest unit of data fed into the model) here, so there's nothing to learn.

Not sure what you mean DNA tokenization as a task. We just treat every character as a token (ie the smallest unit of data fed into the model) here, so there's nothing to learn.

Sorry for my unclear expression!

“DNA tokenization” means segment DNA sequence into some words . For example , in English , we segment "howareyou" into "how"、 “are" 、 ”you" . In terms of DNA , we segment "AGCTAGCT" into "AGC" 、"TAGCT", 2 wrods .

I want to break long DNA sequence into meaningful words , in order to find the secrect of non-coding regions of DNA.

i already know what you mean for "treat every character as a token". But i am still not sure whether your model can fit this specific task?

People usually use byte pair encoding tokenizers to learn meaningful aggregating of characters (in natural language). It's based on frequency of the subwords though, not semantics.

I don't know how you would do that here. I'm guessing there are DNA motif finding algorithms, but I wouldn't know where to begin, sorry.

People usually use byte pair encoding tokenizers to learn meaningful aggregating of characters (in natural language). It's based on frequency of the subwords though, not semantics.

I don't know how you would do that here. I'm guessing there are DNA motif finding algorithms, but I wouldn't know where to begin, sorry.

ok,thanks for your time and patient again! I will explore more about this repo and other methods.

thanks again!