jerryji1993 / DNABERT

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Home Page:https://doi.org/10.1093/bioinformatics/btab083

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding additional information for a classification task

danarte opened this issue · comments

Hello,
I'm wondering what is the best method of adding additional information to each sequence for a classification task? Information like genome location, or some annotation information?

This repository is great and I was able to adapt my data (bunch of sequences) and to get much better classification ability than I got with other models, but I believe I can improve the classification (and perhaps publish the results if they are good enough) if I could add additional data to the classifier.

What would be the best method to develop such model? simply add the data to the sequence (data in different format like IDs, ints, floats, characters...)? write a costume task? train a classifier and then "envelope" it inside a bigger model while including the additional information? some other method?

I'm not very versed in the huggingface framework therefore I feel a bit lost while looking for a straightforward solution like I would do with other less complex models.

PS - It would be a great and a guaranteed publication if I could use the visualization/importance feature like you showed in your example while including the additional data.