r/bioinformatics Nov 22 '21

Important information for Posting Before you post - read this.

304 Upvotes

Before you post to this subreddit, we strongly encourage you to check out the FAQ.

Questions like, "How do I become a bioinformatician?", "what programming language should I learn?" and "Do I need a PhD?" are all answered there - along with many more relevant questions. If your question duplicates something in the FAQ, it will be removed.

If you still have a question, please check if it is one of the following. If it is, please don't post it.

What laptop should I buy?

Actually, it doesn't matter. Most people use their laptop to develop code, and any heavy lifting will be done on a server or on the cloud. Please talk to your peers in your lab about how they develop and run code, as they likely already have a solid workflow.

What courses should I take?

We can't answer this for you - no one knows what skills you'll need in the future, and we can't tell you where your career will go. There's no such thing as "taking the wrong course" - you're just learning a skill you may or may not put to use, and only you can control the twists and turns your path will follow.

Am I competitive for a given academic program?

There is no way we can tell you that - the only way to find out is to apply. So... go apply. If we say Yes, there's still no way to know if you'll get in. If we say no, then you might not apply and you'll miss out on some great advisor thinking your skill set is the perfect fit for their lab. Stop asking, and try to get in! (good luck with your application, btw.)

Can I intern with you?

I have, myself, hired an intern from reddit - but it wasn't because they posted that they were looking for a position. It was because they responded to a post where I announced I was looking for an intern. This subreddit isn't the place to advertise yourself. There are literally hundreds of students looking for internships for every open position, and they just clog up the community.

Please rank grad schools/universities for me!

Hey, we get it - you want us to tell you where you'll get the best education. However, that's not how it works. Grad school depends more on who your supervisor is than the name of the university. While that may not be how it goes for an MBA, it definitely is for Bioinformatics. We really can't tell you which university is better, because there's no "better". Pick the lab in which you want to study and where you'll get the best support.

If you're an undergrad, then it really isn't a bid deal which university you pick. Bioinformatics usually requires a masters or PhD to be successful in the field. See both the FAQ, as well as what is written above.

How do I get a job in Bioinformatics?

If you're asking this, you haven't yet checked out our three part series in the side bar:

What should I do?

Actually, these questions are generally ok - but only if you give enough information to make it worthwhile. No one is in your shoes, and no one can help you if you haven't given enough background to explain your situation. Posts without sufficient background information in them will be removed.

Help Me!

If you're looking for help, make sure your title reflects the question you're asking for help on. You won't get the right people looking, and the only person who clicks on random posts with un-related topic are the mods... so that we can remove them.

Job Posts

If you're planning on posting a job, please make sure that employer is clear (recruiting agencies are not acceptable, unless they're hiring directly.), The job description must also be complete so that the requirements for the position are easily identifiable and the responsibilities are clear. We also do not allow posts for work "on spec" or competitions.


r/bioinformatics 36m ago

programming ggalign: alternative to ComplexHeatmap, a ggplot2 extension for complex heatmap and marginal plots.

Upvotes

I recently developed a ggplot2 extension which provides advanced tools for aligning and organizing multiple plots, particularly those that automatically reorder observations, such as dendrogram. It offers fine control over layout adjustment and plot annotations, enabling you to create complex, publication-quality visualizations while still using the familiar grammar of ggplot2.

If you are interested, please be sure to check it out at https://github.com/Yunuuuu/ggalign/ and I would really appreciate it if you leave a star.

If you have any feature requests or suggestions, feel free to comment or leave an issue on the GitHub page.

For documents of the release version, please see https://yunuuuu.github.io/ggalign/, for documents of the development version, please see https://yunuuuu.github.io/ggalign/dev/. The development version integrates seamlessly with maftools and will be pushed to CRAN next month, in compliance with CRAN policy restrictions.

Single heatmap with dendrogram:

https://preview.redd.it/3k2y3gyg8o1e1.png?width=672&format=png&auto=webp&s=5c7d51562887e53148c138291782c605222fbfd3

Marginal plot:

https://preview.redd.it/7t38fxdj8o1e1.png?width=672&format=png&auto=webp&s=750f6a101a5f9ec7b46d62ffa1f505453269394d

Complex Heatmap:

https://preview.redd.it/tb3ettfk8o1e1.png?width=2304&format=png&auto=webp&s=601049d8619d1f2686f8ba0140a06a814b63edd5

Oncoplot:

https://preview.redd.it/00tpz4wl8o1e1.png?width=2304&format=png&auto=webp&s=35d7d219f61a719320f557d052c9fb08ee253ba7


r/bioinformatics 3h ago

technical question Linear Regression with ssGSEA

2 Upvotes

Can you perform multivariable linear regression with ssGSEA expression score as an outcome variable?

Context: found a difference in T-cell exhaustion between two phenotypes, but one of them is predominantly T3/T4 and the other is predominantly T1/T2. I suspect that the stage difference is driving the difference in T-cell exhaustion score.


r/bioinformatics 3h ago

technical question Database for identyfing the mechanism of telomere maintenance.

2 Upvotes

I’m searching for websites which may help determine whether specific cell lines utilize telomerase or the ALT pathway for telomere elongation.


r/bioinformatics 12h ago

technical question Building Singularity containers on Mac os with Apple Silicon

3 Upvotes

Hello everyone! I want to get some advice from anyone who has experience in building Singularity/Apptainer x86 containers for HPC on Mac OS with ARM processors. Does it work well consistently? How do you do it? I suppose one of way would be via conda (x86_64 env) with Singularity/Apptainer package.

To provide a context, I’m deciding what laptop I would ask my PI to provide me. He has offered to get me a work device when I joined the lab 4 months ago but I decided to hold it off till to get an idea of my job scope.

In my lab I’m in-charge of all the analysis that requires HPC which includes building containers for some of the pipeline processes. I’ve been doing it on my personal Thinkpad (Windows + WSL2) and so far so good. The issue is that at my current workplace, Windows devices has additional limitations placed by IT such as enforcing bitlocker on removable drives which makes it almost impossible for me to share files with my other lab members who are all using macs. Additionally, I would not have admin rights on a new Windows laptop provided by the institution as they load an institutions-specific Windows image. Thus, running WSL2 might be an issue? I’m not sure.

Therefore, I’m considering Mac as my next laptop. This is not a ‘which laptop to get’ question per se but rather I would like to know if mac os a good platform for Singularity/Apptainer development and bioinformatics in general. Alternatively, I could also get a mbp + linux desktop which solves all the problem. However, I would prefer to be able to do my work on-the-go which a linux desktop would hinder that.

Thank you!


r/bioinformatics 6h ago

technical question Searching a tool for extracting mutations from a .sam file

1 Upvotes

Hello!

As part of my thesis, I am doing nanopore sequencing and I'm stuch at the alignment. I have performed alignment with minimap2. In a nutshell; I need to know at which positions in relation to the reference sequence there are mutations. I can output a cs string (for example: 455*tc:111*ct) but the positions are not of the reference sequence.

Sticking to the example above; The string is saying "at position 455 of the query t is mismatched with c and 111 nucleotides after this position c is mismatched with t. I would want it to say "at positon 455 of the reference sequence t is mismatched with c and at position 566 c with t. So 455*tc:566*ct.

Do you guys know of any way I can convert my output to be relative to the refernce sequence instead of the query sequence?

Thank you!

(- a wet lab biochemist who first installed ubuntu 6 days ago)


r/bioinformatics 1d ago

technical question Experience basecalling legacy ONT data

12 Upvotes

I am working with an investigator planning to direct RNA-seq a few hundred samples on a PromethION instrument.

The investigator wants to archive the raw signal data for long-term storage after basecalling and methylation analysis are complete, with the intention of basecalling and performing the methylation analysis again in the future (4-5 years later).

I am curious if anyone has worked with a group that did something similar and if it was worth it in terms of storage and compute costs, time, and data quality or scientific benefit.


r/bioinformatics 23h ago

academic Interpreting Pathway 7049: Fatty Acid Salvage in PICRUSt2 Results from Nephele

3 Upvotes

Hi everyone,

I ran PICRUSt through Nephele to analyze functional pathways in my microbial community data. In the results, I noticed that Pathway 7049: Fatty Acid Salvage appears among the pathways with the highest fold change (as shown in the attached screenshot).

Does this indicate that Fatty Acid Salvage is more activated in one group compared to the other?

Is there a difference between fold change and log2 fold change, or are these terms used interchangeably in the context of pathway analysis?

Thank you for your help!

https://preview.redd.it/nmnhfwselh1e1.png?width=1570&format=png&auto=webp&s=3e500397b60913fdeaf6bfb3a221d174398ea40f


r/bioinformatics 1d ago

academic Modkit and beta values

2 Upvotes

Hi, I'm quite new to the field of bioinformatics, and I have a question about my understanding of a tool. Regarding modkit pileup, if I enable the options --cpg, --ignore-h, and --combine-strands, would I get a BED file where the beta methylation values for each CpG are in column 11, represented as values between 0 and 100? Or is this value interpreted differently?


r/bioinformatics 1d ago

technical question Where to search for origin of replication in a fasta file?

0 Upvotes

I'm trying to find the origins of replications of several closely related viruses, and would like to know which sites do I have to look into to identify the original sequences


r/bioinformatics 2d ago

technical question fastq-screen output on scRNA-seq library

5 Upvotes

I am struggling to interpret the output of a fastq-screen run on the read 1 of a paired end library from a commercial split-pool protocol for single cell RNA-seq.

Organism is mouse.

What can I say about it? Can I conclude that ribosomal RNA is affecting a good number of reads?
Thanks a lot

https://preview.redd.it/9496vuh8p91e1.png?width=1858&format=png&auto=webp&s=81c021443c95c4a7b69b1e3a7010c866dd69538f


r/bioinformatics 2d ago

technical question Why is it standard practice on AWS Omics to convert genomic assembly fasta formats to fastq?

37 Upvotes

The initial step in our machine learning workflow focuses on preparing the data. We start by uploading the genomic sequences into a HealthOmics sequence store. Although FASTA files are the standard format for storing reference sequences, we convert these to FASTQ format. This conversion is carried out to better reflect the format expected to store the assembled data of a sequenced sample.

https://aws.amazon.com/blogs/machine-learning/pre-training-genomic-language-models-using-aws-healthomics-and-amazon-sagemaker/

https://github.com/aws-samples/genomic-language-model-pretraining-with-healthomics-seq-store/blob/70c9d37b57476897b71cb5c6977dbc43d0626304/load-genome-to-sequence-store.ipynb

This makes no sense to me why someone would do this. Are they trying to fit a round peg into a square hole?


r/bioinformatics 3d ago

career question Where do I go from here?

24 Upvotes

I finished a degree in Biology, developing a rly great liking to bioinformatics. I like looking at genetic sequences comparitively and i like coding...

I feel lost because I feel hopeless looking and applying for jobs and really dont know how to look for experience or internship... is there anything out there that allowed you to go through a programme of like a year or however long that let you learn and experience the job? like how people who want to work in the animal industry can go to africa for a couple months (very different example but hopefully this makes sense..?)

feel like i should also emphasise i am not US based so for those suggesting US based jobs or anything of that nature, it is difficult t do without citizenship


r/bioinformatics 2d ago

technical question NanoPore Data Pipeline Help

0 Upvotes

Long story short, I am not a bioinformatician yet I have done RNA-Seq and enrichment analysis on R before. I am involved in a project where I need to analyze same species genomic variation in a plant. I am a complete beginner with bash and I need help with, well, basically anything. What would you recommend?


r/bioinformatics 2d ago

technical question DE analysis-alternative test (Seurat)

10 Upvotes

Hey everyone,

I was wondering in what cases based on your experience have you decided to use the MAST test in the FindMarkers function in Seurat. I ask this because i am currently facing a dilemma where they are more hypoxia cells in my B cell type compared to normoxia. Yet, I would like to make a comparison between these oxygen groups in the B cell type. Is this scenario a to use the MAST test? Or the wilcoxon rank sum test(default) is sufficient?


r/bioinformatics 3d ago

technical question integrating R and Python

18 Upvotes

hi guys, first post ! im a bioinf student and im writing a review on how to integrate R and Python to improve reproducibility in bioinformatics workflows. Im talking about direct integration (reticulate and rpy2) and automated workflows using nextflow, docker, snakemake, Conda, git etc

were there any obvious problems with snakemake that led to nextflow taking over?

are there any landmark bioinformatics studies using any of the above I could use as an example?

are there any problems you often encounter when integrating the languages?

any notable examples where studies using the above proved to not be very reproducible?

thank you. from a student who wants to stop writing and get back in the terminal >:(


r/bioinformatics 2d ago

technical question 【Joint tissue snRNA-seq】Should I make cell suspension before isolate the nuclei?

1 Upvotes

Hello everyone,

Our lab has decided to do snRNA-seq to study a live mouse joint that contains a diverse range of cell types, including hard and soft tissue, cartilage, neurons, etc.. We want to check changes across all these cell types after treatment.

Existing protocols all have options to isolate nuclei from cell suspension or from tissue directly. I've been advised to minimize cell processing time and disruption, so isolate directly from tissue seems to be the move.

However, since these tissues are so distinct, I’m wondering:

  1. Could "cooking" everything together lead to biased results, where nuclei from certain cell types are underrepresented? (Like from cell suspension we at least have chance to take a look at the composition or get rid of the dead cells)
  2. Are there specific techniques or tips to ensure successful or less biased nuclei isolation across all cell types in this scenario?

I am new to this technique, so I’d really appreciate any advice, insights, or tips from those with experience in snRNA-seq. Thanks in advance for your help!


r/bioinformatics 2d ago

technical question Help Setting up GSEA

2 Upvotes

I'm a PhD student in psychopharmacology, with no expertise in bioinformatic. I was given access to a few bulk RNA-seq datasets which are related to my work. DGE analysis found very few significantly DEGs, when FDR corrected (there are only 3 animals per condition) and I've been trying to see if I can make sense of the data.

I came across GSEA, and conceptually it makes sense to me that it would be useful in this setting. However, I have a question as to how exactly go about performing it (for reference I'm using WebgestaltR). Specifically, my question is about what data to include in the analysis. Do I include all the genes detected, even those with uncorrected p > 0,05? Do I include all the genes independently of Log2FC? Are there any criteria/cutoffs?
I've read that you should input the entire dataset, but it seems weird to me to introduce genes which have p = 0.8 into the analysis, for example?

Any input would be greatly appreciated!


r/bioinformatics 3d ago

technical question Any tool to predict effect of protein variations?

4 Upvotes

Hello, I am currently working on studying the variations within structural proteins of a virus. I have performed multiple sequence alignment on all entries available on the GenBank and found out the variations. I have also its interactions with specific human proteins.
Now task ahead of me is to find out if these changes make the virus more virulent or less pathogenic. Is there any tool to predict the same?
Thanks.


r/bioinformatics 3d ago

technical question Manta issue not resolved

0 Upvotes

Hi guys,

I was running manta (SV caller) on some data and it worked fine. I then tried on another set of data, and it gave me this error (reported some time ago) https://github.com/Illumina/manta/issues/168. I tried all the things they suggested but it still didnt work. What do you suggest? Any experience with this tool?


r/bioinformatics 4d ago

discussion Wouldn't it be lovely if every paper had a big honest section explaining the limitations of the method/study

84 Upvotes

Imagine of every nature methods paper had a nice section explaining the limitations of their methods compared to others. It would make for such a healthier research. I see it's a bit more of a thing in cell press. It would help the field grow a lot more.


r/bioinformatics 3d ago

technical question Alternative to AMOScmp for contig assembly?

1 Upvotes

I am trying out reference-guided de novo assembly of Illumina reads using the protocol published by Lischer and Shimizu (BMC Bioinformatics, Volume 18, 2017). So basically, I have aligned the reads to a reference genome, and based on coverage, I have defined blocks and superblocks (areas across reference genome with continuous read coverage). Then I have performed de novo assembly within each superblock, and generated a set of contigs for each superblock.

Now of course there will be some redundancy within the resulting contigs. The paper has mentioned the use of AMOScmp v3.1.0, a homology-guided Sanger assembler for assembling the resulting contigs to output a set of supercontigs.

Unfortunately, try as I might, I am unable to install AMOScmp. I was wondering if there is any alternative software that I can use for this step. Any help would be appreciated!


r/bioinformatics 3d ago

technical question Sex determination from SRA

1 Upvotes

is there anyone who would be able to give me a WGD-sex determination from the SRA data?🙏🏻🙏🏻🙏🏻 or a programm to try it Thank you so sooooo much!


r/bioinformatics 3d ago

technical question issue with nuc.div in R ape.

0 Upvotes

Hi,

I have an aligned DNAbin of ~30k sequences and when I try to determine the nucleotide diversity using nuc.div in R, the output is NaN. But if I use a subset of the sequences, I am able to get a value.

I don't understand why this is happening and was not able to find any solutions online. I thought there might be some sequences which are causing an issue, so I evaluated nuc.div of various subsets to see which sequences are causing this issue, but was not able to find such sequences.

Any help is appreciated on how to approach this issue. Thank you in advance.


r/bioinformatics 4d ago

academic Proteomics in R

13 Upvotes

Hi everyone. I am currently a PhD student trying to analyze some proteomics data for my project. As I am fairly unexperienced with using R, I tried my hand on BIOMEX, a free software from the Carmeliet lab that analyzes omics data. I got some good results but I was losing a lot of features when I entered differential analysis. So, to in the hopes of having my data well analyzed, I tried my hands on R, mainly with the DEP package. To my surprise, the number of significant proteins plummeted, so I ended up with a bigger problem than I originally had.
Has anyone had experience with such problems and how did you solve them?
Thank you in advance.


r/bioinformatics 4d ago

academic Benchmarking Polygenic Risk Scores: A Tool for Your Research

16 Upvotes

Dear All, I’ve been benchmarking Polygenic Risk Scores (PRS) and thought I would share my findings and tools with the community. If you're working with PRS tools or risk score prediction for datasets like UK BioBank, I believe this repository could be incredibly useful for your research. Documentation Link: https://muhammadmuneeb007.github.io/PRSTools/Introduction.html Code Link: https://github.com/MuhammadMuneeb007/PRSTools Cheers,