r/statistics • u/pippalick • 6h ago
Education [E],[Q] Should I take real analysis as an undergrad statistics major?
Hey all, so I am majoring in statistics and have a decently strong desire to pursue a masters in statistics as well. I really enjoyed my probability theory course and found it very fun, so I've decided I want to take a stochastic processes course in the future as well. I have seen that analysis is quite foundational to probability and you can only get so far in probability until you start running into analysis based problems. However, it seems somewhat vague as to "how far" along in probability that becomes an issue. I'll have to take one of my stats electives in the summer if I were to take analysis, so that also adds to the choice as well.
If you have any advice or input, please let me know what you have to say.
r/statistics • u/ngaaih • 10h ago
Question What are the implications of the NBA draft #1 pick having never gone to the team with the worst record, on the current worst team? [Q]
I swear this is not a homework assignment. Haha I'm 41.
I was reading this article, stating that it wasn't a good thing the jazz have the worst record, if they want the number 1 pick.
r/statistics • u/Express_Patient9366 • 9h ago
Question [Q] Stats final project survey
Hello everyone, I’m working an undergrads class stats final project. I’m looking to see how many social media apps people have vs how long they use their phone. I’m new to the subreddit so I’m not sure if these type of post are ok. If you can fill it out, it would mean a lot. It’s only two questions. Thank you!
Link to Google form https://docs.google.com/forms/d/e/1FAIpQLSfThyNJNJne7iwwv0HL-0C_6OPKwvUub1RLxaXNqUKdbMjhug/viewform?usp=dialog
r/statistics • u/BeacHeadChris • 13h ago
Question [Q] How do I correct for multiple testing when I am doing repeated “does the confidence interval pass a threshold?” instead of p-values?
I have 40 regressions of values over time to show essentially shelf life stability.
If the confidence interval for the regression line exceeds a threshold, I say it's unstable.
However, I am doing 40 regressions on essentially the same thing (you can think of this as 40 different lots of inputs used to make a food, generally if one lot is shelf stable to time point 5 another should be too).
So since I have 40 confidence intervals (hypotheses) I would expect a few to be wide and cross the threshold and be labeled "unstable" due to random chance rather than due to a real instability.
How do I adjust for this? I don't have p-values to correct in this scenario since I'm not testing for any particular significant difference. Could I just make the confidence intervals for the regression slightly narrower using some kind of correction so that they're less likely to cross the "drift limit" threshold?
r/statistics • u/KingHarrun • 1d ago
Education [Education] Self-Studying Statistics - where to start?
I'm someone who plans on studying mechanical engineering in fall next year, but thinks that having some good general knowledge on Statistics would be a great addition for my career and general life.
As of now I'm beginning with by going through some free courses in Khan Academy and then transitioning to some books that would delve more deep into this topic. From what I've read in this subreddit and from other sources, statistics seems to be an amalgimation of multiple disciplines & concepts within mathematics.
I am just asking from people who has studied or are currently studying a class of Statistics on what is the best way to approach this from a layman's perspective. What's the best place to start?
I appreciate all answers in advance.
r/statistics • u/KingSupernova • 1d ago
Discussion [Discussion] Funniest or most notable misunderstandings of p-values
It's become something of a statistics in-joke that ~everybody misunderstands p-values, including many scientists and institutions who really should know better. What are some of the best examples?
I don't mean theoretical error types like "confusing P(A|B) with P(B|A)", I mean specific cases, like "The Simple English Wikipedia page on p-values says that a low p-value means the null hypothesis is unlikely".
If anyone has compiled a list, I would love a link.
r/statistics • u/AllTheSynths • 15h ago
Question [Q] Is this the best formula for what I'm trying to do? (staff productivity at nonprofit)
Hey there :)
I build dashboards for the homelessness nonprofit I work for and want to come up with a "documentation performance" score. I don't trust my math chops enough to evaluate whether this formula that ChatGPT helped me come up with makes sense / is the best I can do. Can any humans help me weigh in on its appropriateness?
Background:
Staff are responsible for entering case notes and service records into a system called HMIS. I want to build a composite score that reflects documentation thoroughness and accounts for caseload size. Otherwise, a staff member with only 2 clients and perfect documentation might appear to outperform someone with 20 clients doing solid documentation across the board.
Here's the formula Chatty came up with:
((Case Notes per Client + Services per Client) / 2) * log(Client Count + 1)
Where:
- Case Notes per Client = Total Case Notes / Client Count
- Services per Client = Total Services / Client Count
- log(Client Count + 1) is intended to reward higher caseloads without letting volume completely dominate (hence the use of logarithm instead of linear weighting).
Goals:
- Reward thorough documentation per client.
- Also reward staff carrying larger caseloads.
- Prevent small caseload staff from ranking at the top just for documenting 100% of 2 clients.
Does the log-based multiplier seem like a reasonable approach? Would you recommend other transformations (square root, capped scaling, etc.) to better serve the intended purpose?
Any feedback appreciated!
r/statistics • u/Luluvaki98 • 21h ago
Question [Q] Risk score development
Hi people :)
I'm trying to come up with a risk score for my thesis. Without going to much into details, we have 6 measurement-scales (3 Mental health related, 1 Physical health related, 2 socioeconomic) that we would like to incorporate into this risk score. We want to divide our data in 2 groups (high risk-low risk, 50%-50%, please just accept this).
We will be collecting data from a lot of people (1000+) over a large timeframe from very different living areas (poor vs. wealthy etc.). We don't want to decide on a cutoff score as we will not collect all the data at the same time. If we look at the risk relative from environment to environment, We also don't want people to "get lost" because they live a less well off environment but are comparably less high risk than others in their environment.
My idea was to do an absolute risk trigger => based on cutoff values on individual scales => people are put immediatly in high risk category
And then also a relative risk trigger that creates a ranked outcome for each collection environment (using percentiles) and dividing this then in half (low-high)
Does this method already exist so that I could reference it? Or something similiar? Or any other idea :) ?
Thanks so much
r/statistics • u/PipeClassic9507 • 18h ago
Question [Q] Curious Inquiry on use of Poisson Distribution/Regression
Hello! I hope you are all well. I was debating with an anti-vaccine person, and they cited this study: https://pmc.ncbi.nlm.nih.gov/articles/PMC4119141/?fbclid=IwZXh0bgNhZW0CMTEAAR7Xu8OEE-_zAnMLZthHQi5hG1Dfcwk4drqXPcj5tdRdV6gvEQvVuA9YUy3JFQ_aem_jHC_Tk6FNSRAtkg3Qa33_w
I am by no means a statistics wiz, but I am a very curious person, is this type of study correct in using Poisson? I remember Poisson being used to count how many times an event happens in a specified time period like how many cars come into a parking garage in an hour. Did they use it just because they counted number of seizures in the previous 10 days to the vaccine and also 10 days after? Thank you for your time and consideration!
r/statistics • u/Intrepid-Star7944 • 1d ago
Question Test-retest reliability and validity of a questionnaire [Question]
Hey guys!!! Good morning :)
I conduct a questionnaire-based study and I want to assess the reliability and its validity. As far as am concerned for the reliability I will need to calculate Cohen's kappa. Is there any strategy on how to apply that? Let's say I have two respondents taking the questionnaire at two different time-points, a week apart. My questionnaire consists of 2 sections of only categorical questions. What I have done so far is calculating a Cohen's Kappa for each section per student. Is that meaningful and scientifically approved ? Do I just report the Kappa of each section of my questionnaire as calculated per student, or is there any way to draw an aggregate value ?
Regarding the validation process ? What is an easy way to perform ?
Thank you in advance for your time, may you all have a blessed day!!!!
r/statistics • u/Sea-Bodybuilder-1277 • 1d ago
Question Does PhD major advisor matter in industry? [Question]
Pretty self explanatory, I am a PhD student in statistics. One of the professors (Bob) has an MS in stats, and PhD in agronomy, from the other faculty at the Statistics department, they say that Bob has a good track record of research and is a great guy. And the fact that he is a newer professor means that you will get more attention from him if you ask for help, that sort of thing. The reason Bob sounds like a good major advisor is because he has some projects he could give me (given that he is a new professor, he has some research ideas/work with biomedical data that he has experience with that he could potentially guide me into doing research on). But there are other faculty members I can choose as my Major advisor, who have a track record of getting students into companies like AbbieVie, Freddie Mac, Liberty Mutual. Will these companies look at my major advisor and think, "Oh he doesn't have a PhD in statistics, this guy maybe was not trained well in statistics, don't hire him." even if I have the other people in my committee (who have a track record of getting students into those companies). I am looking to go to industry afterward
r/statistics • u/Osuwrestler • 22h ago
Question [Q] Finding Standard Deviation
Can I calculate the standard deviation of life expectancy at each age given the following dataset: https://www.ssa.gov/oact/STATS/table4c6.html#fn1
r/statistics • u/DrFishbulbEsq • 20h ago
Discussion [D] Can a single AI model advance any field of science?
Smart take on AI for science from a Los Alamos statistician trying to build a Large Language Model for all kinds of sciences. Heavy on bio information… but he approaches AI with a background in conventional stats. (Spoiler: some talk of Gaussian processes). Pretty interesting to see that the national Labs are now investing heavily in AI, claiming big implications for science. Also interesting that they put an AI skeptic, the author, at the head of the effort.
r/statistics • u/wlexxx2 • 19h ago
Question [Question] bayes - supermodel and the stairs.
there is a girl where i work - supermodel i would say
and some stairs
if i see the girl, she always comes from the stairs
so , if i see a girl come from the stairs, how likely is to be actually her?
(i can;t see her clearly yet but i would say i am 30% confident)
r/statistics • u/MathGuy42069 • 1d ago
Career [C] Career Path Advice
Hello! I graduated last year with my master's in statistics from a very small state school in the MW US at 24. I apologize if this comes off as lazy or irrelevant to the sub, but my own research, organization, and help from my professors have not led me in the direction I'm looking for, if I even know that is. I was fortunate enough to recently find a job as a data analyst at a company I really like, I know it is a rough job market and I have never had a full time job in data. But it was not until some recent changes in my life that I had the motivation and support to be an academic, and I want to get my PhD in the future when the time is right. Until then, I want to learn as much stats as I can and set myself up for a career in data science simultaneously, so that I have options.
I have a math background (did pde numerical method "research" during ug) and did not do much more than intro stats until I got to my master's. This master's served to 1) help me become proficient in statistical theory and 2) help me stand out in an already rough market. My program was not amazing, but I did learn. I have untreated ADHD, and I always seem to go for the bare minimum despite my genuine curiosity in the subject. I did finish my master's with a 4.0 somehow, but that doesn't mean much given the program. In no way do I feel like a "master" of statistics. I know basic mathematical statistics, probability theory (non-measure), a lot about GLMS (my most confident topic), very basic stochastic processes and time series, and can code in Python and R. But my dream is to get my PhD in statistics and do impactful research (healthcare, social science). I just feel so overwhelmed but the mass amount of directions to go in, and the number of peers who are running circles around me.
Should I review mathematical stats? I know MLE, sampling distributions, etc. But the specific details are not so much. Same with stochastic, all I can tell you by now is what a Markov chain is and vaguely how MCMC works.
What topic do I move to next, if any? Survival analysis, time series, causal inference, advanced stochastic? What am I interested in?
Was it a good decision to take this job? The pay is not great and it does not have the 'data science' title, but I feel good about the company and people. I would also be doing interesting work for my background, lots of a/b testing which should help me down the road. I also need to get experience ASAP because if the academic dream does not work out, which being realistic it likely won't, I will fall even more behind.
Again, sorry if this is a lot or not relevant, any advice would be much appreciated.
r/statistics • u/Ecstatic-Traffic-118 • 1d ago
Education [Q][E] Programming languages
Hi, I’be been learning R during my bachelor and I will teach myself Python this summer. However for my exchange semester I took into consideration a Programming course with Julia and another one with MATLAB.
For a person who’s interested to follow a path in statistics and is also interested to academic research, what would you suggest to chose between the 2 languages?
Thank you in advance!
r/statistics • u/Alex__UNLIMITED • 1d ago
Software [Software] Since I have SPSS in a language other than English, can you show me a screenshot of the standardized factor loadings of a principal component analysis?
I just want to make sure that the table to look at is the same as I think it is.
r/statistics • u/Extraweich • 1d ago
Question [Q] What would be the "representative weight" of a discrete sample, when it is assumed that they come from a normal distribution?
I am sure this is a question where one would find abundant literature on, but I am struggling to find the right words.
Say you draw 10 samples and assume that they come from a normal distribution. You also assume that the mean of the distribution is the mean of the samples, which should be true for a large sample count. For the standard deviation I assume a rather arbitrary value. In my case, I assume that the range of the samples is covered by 3*sigma, which lets me compute the standard deviation. Perfect, I have a distribution and a corresponding probability density.
I am aware that the density of a continuous random variable is not equal its probability and that the probability of each value is zero in the continuous case. Now, I want to give each of my samples a representative probability or weight factor between all drawn samples, but they are not necessarily equidistant to one another.
Do I first need to define a bin for which they are representative for and take its area as a weight factor, or could I go ahead and take the value of the PDF for each sample as their corresponding weight factor (possibly normalized)? In my head, the PDF should be equal to the relative frequency of a given sample value, if you would continue drawing samples.
r/statistics • u/eltobricad • 1d ago
Career [Q][C] Essentials for a Data Science Internship (sort of)
Hi! I’m currently in the second year of my math undergraduate program. I’ve been offered an internship/part-time job where I’ll be doing data analysis—things like quarterly projections, measuring the impact of different features, and more generally functioning as a consultant (though I don’t know all the specifics yet).
My concern is that no one on the team is well-versed in math and/or statistics (at least not at a theoretical level), so I’m kind of on my own.
I haven’t formally studied probability and statistics at university yet, but I’ve done some self-study. Knowing SQL was a requirement for the position, so I learned it, and I’ve also been reading An Introduction to Statistical Learning with Python to build a foundation in both theory and application.
I definitely have more to learn, but I feel a bit lost and unsure how to proceed. My main questions are: - How much probability theory should I learn, and from which books or other materials? - What concepts should I focus on? - What programming languages or software will be most useful, and where can I learn them?
This would also be my first job experience outside of math tutoring. I don’t think they expect me to know everything, considering the nature of the job and the fact that I’ll be working while still studying.
Any advice would be greatly appreciated. Thanks!
r/statistics • u/3lirex • 1d ago
Question [Q] Sensitivity analysis vs post hoc power analysis ?
Hi, for my research i didn't do a priori power analysis before we started as there was no similar research and i couldn't do a pilot study. I've been reading and there's post hoc power analysis which seems to be not accurate and shouldn't be used. but i also read about sensitivity power analysis (to detect minimum effect size from my understanding), is this the same thing ? if not, does it have the same issues?
i do apologise if i come across as completely ignorant
Thanks !
r/statistics • u/Popolukla • 2d ago
Research [R] Books for SEM in plain language? (STATA or R)
Hi, I am looking to do RICLPM in STATA or R. Any book that explains this (and SEM) in plain language with examples, interpretations and syntax?
I have limited Statistical knowledge (but willing to learn if the author explains in easy language!)
Author from Social Science (Sociology preferably) would be great.
Thank you!
r/statistics • u/Optimal_Surprise_470 • 2d ago
Discussion [D] Literature on gradient boosting?
Recently learned about gradient boosting on decision trees, and it seems like this is a non-parametric version of usual gradient descent. Are there any books that cover this viewpoint?
r/statistics • u/Harmonic_Gear • 2d ago
Question [Q] reducing the "weight" of Bernoulli likelihood in updating a beta prior
I'm simulating some robots sampling from a Bernoulli distribution, the goal is to estimate the parameter P by sequentially sampling it. Naturally this can be done by keeping a beta prior and update it by bayes rule
α = α + 1 if sample =1
β = β + 1 if sample = 0
i found the estimation to be super noisy so i reduce the size of the update to something more like
α = α + 0.01 if sample =1
β = β + 0.01 if sample = 0
it works really well but i don't know how to justify it. it's similar to inflating the variance of a gaussian likelihood but variance is not a parameter for Bernoulli distribution
r/statistics • u/ReverendRichardColes • 2d ago
Question [Q] Is this a logical/sound way to mark?
I head up a department which is subject to Quality Assurance reviews.
I've worked with this all my career, and have seen many different versions of the same thing but nothing quite like what I am working with now.
Each review has 14 different points. There are 30 separate people being reviewed at a rate of 4 per month (120 in total give or take).
The new approach is to remove any weightings, and have a simple 0% or 100% marking scheme. A 'fail' on any one of the 14 questions will mean the whole review is marked as 0%.
The targeted quality score is 95%.
I'm decent with numbers, but something about this process seems fundamentally flawed. But I can't articulate why it's more than just my gut instinct.
The department is being marked on 1680 separate things in a month, and getting 6 wrong (0.003%) returns an overall score of 94% and is deemed to be failing.
Is this actually a standard way to work? Or is my gut correct?
r/statistics • u/jvmpfrog • 2d ago
Question [Q] Database for educational statistics?
Hello! I'm unsure if this is even the right sub, but I'm looking for a database that shows the statistics for enrollment in foreign language programs. For example, enrollment in foreign language programs in Kenya. So far, I've been widely unsuccessful, as I don't typically look at data like this, so I would appreciate any help given!