Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low busco scores for transcripts called on large Cupressus genome (10Gb) with helixer #126

Open
Bios4Biol opened this issue Apr 19, 2024 · 2 comments

Comments

@Bios4Biol
Copy link

Hi,

We have tried helixer on the Cupressus sempervirens genome assembly
https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_028749045.1/
but obtained poor busco results on the 38652 called transcripts.

We tried with standard plant parameters and with
--subsequence-length 1069200 --overlap-offset 534600 --overlap-core-length 801900
but in both cases busco scores were low

~ C:22.9%[S:19.5%,D:3.4%],F:18.5%,M:58.6%,n:1614

What should we modify to improve results?

@alisandra
Copy link
Collaborator

Hi Bios4Biol,

Sorry to hear that. Admittedly Cupressus is phylogenetically far from most of our training and testing genomes. The existing trained models may simply not perform well in this case.

Thanks for letting us know, that's something that we can work on in the future, however prompt improvements in available models are unlikely.

If you have better annotations available (preferable, and either for this or perhaps a few closely related species), or simply want to try from the pseudolabels (riskier starting from a low Busco score), fine tuning the plant models is likely to help.
https://github.com/weberlab-hhu/Helixer/blob/main/docs/fine_tuning.md

@alisandra
Copy link
Collaborator

Oh, and fine tuning is only recommended if you're determined feeling adventurous. It's experimental.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants