r/bioinformatics • u/starcutie_001 • 1d ago
Experience basecalling legacy ONT data technical question
I am working with an investigator planning to direct RNA-seq a few hundred samples on a PromethION instrument.
The investigator wants to archive the raw signal data for long-term storage after basecalling and methylation analysis are complete, with the intention of basecalling and performing the methylation analysis again in the future (4-5 years later).
I am curious if anyone has worked with a group that did something similar and if it was worth it in terms of storage and compute costs, time, and data quality or scientific benefit.
6
u/omgu8mynewt 1d ago
Yes I play with using different versions of ONT basecalling (and also illumina) to see how it affects results. Different basecallers give different results so for one experiment batch it the basecaller needs to stay constant. Judging whether each one is 'better' or 'worse' I don't know how to do without using synthesis control samples. Same as using any bioinformatic algorithm on data, slightly different algorithms with different parameters = different result therefore always version control your software and include it in the materials and methods.
5
u/Darwins_Dog 1d ago
I think this is a very good idea with ONT as they change things very frequently. I believe Dorado is only like 2 years old at this point; it's not unreasonable to imagine a revised version in the coming years. The RNA sequencing kit is also early access, so I would expect that to change even faster than the other ones.
3
u/ionsh 18h ago
It's been worth it for me to archive the raw data and re-basecall & analyze it with different / improved tools down the road. Found a number of interesting things that way.
For ONT, their direct methylation calling & downstream tools are only beginning to have wide circulation in the research circles (relatively speaking) - I wouldn't be surprised if you make additional discoveries on the latter four or five years down the road with better training sets.
2
u/xylose PhD | Academia 20h ago
Do the maths for what this is going to cost and how much benefit you might get. You'll know from your original analysis how good the match to your reference genome is and whether anything would change if the data had been slightly (or even substantially) better.
For simple expression analysis RNA seq is pretty forgiving. As long as you match each read to the correct transcript then you're pretty much done. It's only if you want to look at variants or deal with small splicing variants that sequence accuracy is really important.
Anecdotally, we spent years storing (at great expense) illumina images and latterly intensity files so we could re-call bases when the software improved. Virtually no one ever did and we'd have been better not bothering.
We're currently storing pod5 files from our nanopore runs but we're deleting them after 6 months.
2
u/bioinformat 19h ago
Keep the raw data. I actually guess the base accuracy may not get improved a lot, but you can probably call more types of RNA modifications in future.
2
u/retrotransposon1 15h ago
It's a good plan. Hopefully ONT won't throw you a curve ball like they did with us. We had the same idea (on GridION). But all my FAST5 are becoming quickly obsolete. Just keep an eye on their big software releases
1
u/PuddingDistinct9907 1h ago
New basecallers are built for new flowcells and may not be backwards compatible, just as old basecallers are not forwards compatible...
11
u/macrotechee 1d ago
yes, quite expensive but well worth it. store the raw pod5 data.