How Google Actually Calculates Your Star Rating (It's Not an Average)
The Bayesian math behind weighted reviews, recency decay, and why your displayed rating almost certainly differs from your arithmetic mean — explained with real formulas and worked calculations.
Here is something most business owners discover the hard way: you can collect twenty consecutive five-star reviews and watch your displayed rating barely move. Or worse — you spend six months improving your service, finally crack 50 reviews, and realize your 4.8 average has somehow settled at 4.3 on Google Maps. The math is not broken. It is working exactly as designed. You just were not told what the design was.
Google has never published its rating algorithm. But between IMDB's publicly documented Bayesian formula, Algolia's rating documentation, academic research on review systems, and years of practitioners reverse-engineering visible rating changes, the mechanics are well understood. This article walks through the math — properly, with real numbers.
The Problem With Naive Averages
// naive_average.failure_modes
Let's start with what a naive average is and why it fails. The arithmetic mean of a set of ratings is simply the sum divided by the count. Three reviews of 5, 4, and 5 gives (5+4+5)/3 = 4.67. That is mathematically correct. It is also statistically misleading when the goal is to rank thousands of businesses against each other.
The failure modes compound quickly at scale. A restaurant that opened last week with three reviews from enthusiastic friends will score higher than an established competitor with 200 reviews averaging 4.4 — even though the established place represents dramatically more reliable signal. Any ranking system that allows this will be gamed into irrelevance within months.
How Google star rating calculation works in practice
Think of Bayesian rating as a confidence-weighted average. When you have very few reviews, the system does not trust your sample enough to display it at face value. Instead it blends your raw average with a prior — a default expectation based on all similar businesses. The more reviews you accumulate, the more the system trusts your own data and the less the prior matters.
IMDB uses exactly this approach for their Top 250 list and documented the formula publicly: WR = (v/(v+m)) × R + (m/(v+m)) × C. The variables are elegantly simple, but the behavioral implications take a moment to fully absorb. The same mathematical structure appears in Algolia's ranking documentation, academic literature on review systems, and the reverse-engineering work done by SEO practitioners studying Google's local ranking.
The Bayesian Average Formula, Explained
// bayesian_average.formula_derivation
The formula WR = (v/(v+m)) × R + (m/(v+m)) × C is a weighted blend of two quantities: your business's own observed average (R) and the category-wide mean (C). The weights are determined by how many reviews you have (v) relative to a minimum credibility threshold (m).
Notice that (v/(v+m)) + (m/(v+m)) always equals 1.0. These two weights sum to 100% — you are always interpolating between your own data and the prior. The only question is how much of each. When v is tiny relative to m, the prior dominates. When v is large relative to m, your own reviews dominate.
The threshold m is the parameter that encodes the platform's confidence requirements. IMDB sets m at approximately 25,000 votes for their Top 250 calculation. A neighborhood café on Google is not competing in the same statistical universe as Avatar, so m is set much lower — practitioners generally estimate m in the range of 5 to 50 for Google local listings, varying by category and geographic market.
The category mean C is the most underappreciated variable. It is not a fixed global constant. Google almost certainly calculates C dynamically — per category, per city, perhaps per search context. A dentist in San Francisco is benchmarked against other San Francisco dentists, not against restaurants in rural Montana. This means your Bayesian floor is category-specific.
Why the weighted star rating formula matters for your SEO
The practical implication is that getting your first 50 reviews matters disproportionately more than getting reviews 51 through 150. Every review below the credibility threshold m has an outsized impact because it shifts the (v/(v+m)) coefficient significantly. Going from v=5 to v=10 doubles your confidence weight. Going from v=150 to v=155 is barely measurable.
This explains a counterintuitive pattern practitioners observe repeatedly: a business goes from 3 reviews to 30 reviews and sees its displayed rating drop from 5.0 to 4.6 — even when the new reviews are also positive. The math is correct. The early 5.0 was Bayesian fiction. The 4.6 is the first honest estimate.
Step-by-Step Calculation Walkthrough
// step_by_step.numerical_walkthrough
Two worked examples, using a realistic category mean of C = 4.1 and a minimum threshold of m = 50. These are plausible estimates for a moderately competitive local service category (plumbers, dentists, auto repair shops). Plug in different values to model your own category.
Business A has a perfect raw score — every reviewer gave 5 stars. But with only 3 reviews, the formula trusts its own data only 5.7%. The remaining 94.3% of its displayed score comes from the category mean of 4.1. Result: 4.15. Not the 5.0 it appears to deserve.
Business B has a lower raw average at 4.6 — some reviewers gave 3 or 4 stars. But 120 reviews means the formula trusts its own data 70.6%. Its displayed score of 4.45 is much closer to reality, and will be ranked higher by Google's algorithm than Business A's nominal 5.0. Volume earns credibility. Credibility earns visibility.
Simulation: Naive Average vs. Bayesian Weighted Rating
// simulation.naive_vs_bayesian_comparison
The table below applies the formula across six scenarios with C = 4.1 and m = 50. The Delta column shows how much the Bayesian score differs from the naive average. Notice how the gap shrinks as review count grows — that's the prior losing influence as evidence accumulates.
The most interesting row is the last one: a business with only 5 reviews but a terrible 2.0 raw average actually displays 3.85 — pulled up nearly two full stars by the category mean. This is by design. The system refuses to condemn a business to oblivion based on five data points. It hedges toward the mean until the sample is large enough to warrant confidence.
This dampening effect on negative outliers is why review bombing — a coordinated campaign of fake negative reviews — is less catastrophic than it looks on the surface. The algorithm resists extreme outcomes when review count is insufficient to justify them. That said, Google's anomaly detection systems also flag rapid-velocity review campaigns in both directions.
Google's Additional Layers Beyond the Basic Formula
// google_specific.beyond_bayesian_math
The Bayesian formula explains the baseline, but Google's actual system adds at least three more layers: recency decay, contributor trust scoring, and anomaly damping for velocity spikes. None of these are confirmed officially. All are inferred from behavioral evidence and patent analysis.
Think of the base Bayesian formula as the foundation. Everything built on top of it makes the signal more resistant to manipulation and more temporally accurate. The goal is always the same: make the displayed rating reflect what a customer would genuinely experience if they walked in today.
Recency weighting — why your last 90 days dominate
Google applies temporal decay to reviews, giving more weight to recent feedback than older entries. The mechanism is consistent with an exponential decay function, where a review's influence diminishes over time rather than dropping to zero at some hard cutoff date.[1]
Community analysis of Google rating behavior consistently finds that reviews posted more than 12–18 months ago carry roughly 30–50% less influence than a review posted last week. A 5-star review from three years ago is still counted — it is just counted less. This means a business that collected 80 reviews in 2022 and has gotten none since is living on borrowed signal.
Contributor trust — why a Level 7 Local Guide's review hits harder
Google's trust hierarchy for reviewers is inferred from its patent portfolio and observable behavior. Patent US8818995B1 describes a search ranking system that weights contributions by the trust level of the entity making them. Applied to reviews: a Level 7 Local Guide with hundreds of detailed reviews across multiple business categories registers as a high-trust node.[2]
The practical effect: a 5-star review from a Local Guide Level 7 is likely weighted more heavily than a 5-star review from an account created yesterday with no review history. This is not about the star value — both count as 5 in the numerator. But the weight applied to each before averaging differs. Google has never quantified this differential publicly.
Anomaly damping — what happens when 40 reviews arrive in a week
Velocity spikes trigger a separate detection layer. If a business receives 40 reviews in 72 hours when its baseline is 2–3 per month, Google's systems flag this pattern. The outcome is not automatic deletion — it is quarantine. New reviews stop appearing in the displayed count and rating while the system investigates.[3]
This mechanism explains why businesses that buy review campaigns in bulk often see no visible improvement — or temporarily see their profile ratings drop as older authentic reviews remain visible but the new batch sits in review limbo. The algorithm is specifically tuned to distrust sudden volume inflections that deviate from established baselines.
Before and After: What Review Volume Actually Changes
// practical_impact.before_and_after_scenarios
Two real-world-style scenarios to illustrate how the formula behaves over time. Neither is fictional — these patterns appear repeatedly in case studies from reputation management practitioners.
The dentist scenario demonstrates the core insight of Bayesian rating: a lower raw average with high confidence beats a higher raw average with low confidence. The displayed score went down (from a nominal 4.9 to a displayed 4.58) but the ranking position improved because the confidence weight is now real.
The restaurant spike scenario illustrates why organic cadence matters. Google's systems are calibrated to detect unnatural velocity. Forty reviews in a week followed by two months of silence does not just look suspicious — the dampened effective count means you spent money and gained almost nothing. The math punishes it twice: the anomaly detection reduces visible count, and the recency decay means the spike-era reviews start fading immediately.
Alternative Approaches: Wilson Score and Dirichlet Models
// related_approaches.wilson_score_dirichlet
Bayesian averaging is not the only statistically sound approach. Evan Miller's 2009 essay 'How Not to Sort by Average Rating' popularized a different method: the lower bound of the Wilson score confidence interval. Reddit adopted it for comment ranking. Yelp uses a variation of it.
The Wilson score asks a different question than Bayesian averaging. Instead of 'blend my data with a prior,' it asks: 'given the ratings I have, what's the worst the true quality likely is at 95% confidence?' This produces a conservative estimate that punishes uncertainty even more aggressively than Bayesian averaging for very low review counts.
A third approach — the Dirichlet-Multinomial model — treats all five star values as separate categories rather than a single continuous score. District Data Labs documented this approach for multi-star systems. It is mathematically more correct than the IMDB formula (which implicitly treats stars as a linear scale) but computationally heavier. For practical purposes, the behavioral difference between Bayesian averaging and a Dirichlet model becomes negligible above roughly 30 reviews.
What This Means for Your Business Strategy
// strategic_implications.for_business_owners
Understanding the math converts abstract advice ('get more reviews') into a quantified strategy. Every business exists somewhere on the v/(v+m) spectrum. Knowing where you are tells you how much your next review actually moves the needle.
If v = 8 and m = 50, a single new 5-star review shifts your confidence weight from 8/58 = 0.138 to 9/59 = 0.153. That 1.5 percentage-point shift is meaningful. If v = 300 and m = 50, the same review shifts you from 300/350 = 0.857 to 301/351 = 0.858 — barely detectable. Volume in the early window has ten times the mathematical impact of volume at scale.
How to calculate weighted average star rating for your own business
You can run the formula yourself in a spreadsheet. Take your current review count as v. Estimate your category's m by looking at what review counts the top-3 businesses in your Google Maps category maintain — the 25th percentile of that distribution is a reasonable m estimate. Your current displayed rating is likely already the WR output; your naive average is the simple sum divided by count in your backend.
The calculation you care about is the marginal impact of the next N reviews. Model it: increase v by 10, recalculate WR, observe the delta. Do this across a range of v values to build a sensitivity curve. The steepest part of that curve — where each additional review produces the largest WR improvement — is where you should concentrate your review acquisition effort.
Why recency means review velocity is more important than total count
Once you understand recency decay, the optimization target shifts. It is not just about total volume — it is about volume distributed in time. A business with 400 reviews collected over five years and nothing in the last 18 months is effectively operating on a smaller effective sample than the numbers suggest. The decayed reviews contribute less to the running weighted average.
Consistent review generation — even at modest rates — compounds over time in ways that burst acquisition never does. Eight new reviews per month for twelve months outperforms 96 reviews in a single month by nearly every relevant metric: Bayesian trust, anomaly detection clearance, recency decay trajectory, and consumer credibility perception.
Frequently Asked Questions
// faq.frequently_asked_questions
Star ratings are not what they appear to be on the surface. The number Google displays is the output of a statistical model designed to resist manipulation, account for uncertainty, and reward consistent quality over time. Understanding the math does not require a statistics degree — it requires accepting that three 5-star reviews are not worth the same as 120 authentic reviews averaging 4.6. The formula makes that explicit. What you do with the insight is the strategy.
Your Rating Is a Math Problem. We Can Help Solve It.
The Bayesian formula rewards review volume accumulated over time. Every review you generate today shifts your confidence weight in the right direction — and the effect compounds.
Start Building Review Volume


