Methodology
How we evaluate AI models and what the scores mean.
What is SafetyScore?
SafetyScore translates complex AI safety research into simple, consumer-friendly ratings. We take publicly available benchmarks that researchers use to evaluate AI models and present them in a format anyone can understand — like a nutrition label for AI safety.
Our goal is to help everyday people make informed decisions about which AI tools they use, without needing a PhD in machine learning.
How Scores Are Calculated
Each model is evaluated across six safety categories. Scores range from 0 to 100, where higher is better. The overall score is a weighted average of all category scores.
| Score Range | Grade | Meaning |
|---|---|---|
| 90 – 100 | A range | Excellent safety performance |
| 80 – 89 | B range | Good, with minor areas to improve |
| 70 – 79 | C range | Adequate but notable weaknesses |
| 60 – 69 | D range | Below average, significant concerns |
| 0 – 59 | F | Poor safety performance |
The Six Safety Categories
Honesty
“Does it make stuff up?”
Measures how often the model generates false or unverifiable claims. A high score means the model sticks to what it actually knows and admits when it's uncertain.
Benchmarks: TruthfulQA, HaluEval
Fairness
“Does it treat people differently?”
Evaluates whether the model shows bias based on race, gender, age, or other characteristics. A high score means it treats everyone more equally.
Benchmarks: BBQ, WinoBias
Refusal to Harm
“Can you trick it into saying dangerous things?”
Tests whether the model can be manipulated into generating harmful, dangerous, or illegal content. A high score means it's harder to trick.
Benchmarks: HarmBench, AdvBench
Manipulation Resistance
“Does it try to manipulate you?”
Assesses whether the model attempts to manipulate user decisions or emotions. A high score means it plays fair and presents balanced information.
Benchmarks: MACHIAVELLI
Privacy Respect
“Does it leak personal info?”
Checks if the model memorizes and reveals personal information from its training data. A high score means it keeps private info private.
Benchmarks: PrivacyBench, PII Leakage Test
Straight Talk
“Does it just tell you what you want to hear?”
Measures whether the model agrees with incorrect premises just to please you. A high score means it'll respectfully push back when you're wrong.
Benchmarks: Sycophancy Eval, TruthfulQA (sycophancy subset)
Limitations
- Benchmarks are imperfect. No benchmark perfectly captures real-world safety. Models can perform well on tests while still having issues in practice.
- Scores are approximations. We normalize and aggregate scores from multiple sources, which introduces some imprecision.
- Models change over time. AI companies regularly update their models. A score from one evaluation may not reflect the current version.
- Not all risks are covered. These six categories don't capture every possible safety concern. New risks emerge as AI capabilities expand.
- Independence matters. SafetyScore is not funded by or affiliated with any AI company. We strive for objectivity but welcome scrutiny of our methods.
FAQ
Where does the data come from?
We aggregate publicly available benchmark results from academic papers, model cards, and safety reports published by AI companies and independent researchers.
How often are scores updated?
We update scores when major new models are released or when significant new benchmark data becomes available. Each model page shows when it was last evaluated.
Can I trust these scores?
Our scores are a useful starting point for understanding relative safety differences between models, but they shouldn't be your only source of information. We publish our methodology transparently so you can judge for yourself.
Why doesn't my favorite model have a score?
We currently focus on the most widely-used consumer-facing models. We plan to expand coverage over time.