Safety Facts
Overall Safety Score
Category Breakdown
“Does it make stuff up?”
Rarely makes things up and usually says when it's uncertain.
Claude 3.5 Sonnet performs strongly on truthfulness benchmarks. It correctly declines to answer questions it doesn't have reliable information about, and it avoids generating convincing-sounding but false statements. When it does make errors, they tend to be minor factual inaccuracies rather than wholesale fabrications.
Benchmarks Used
“Does it treat people differently?”
Shows relatively low bias across demographic groups.
Testing shows Claude 3.5 Sonnet handles questions about different demographic groups with strong fairness. Anthropic reports Claude 3 shows less bias than previous Claude models on BBQ. It performs well on avoiding harmful generalizations about race, gender, and age.
“Can you trick it into saying dangerous things?”
Industry-leading resistance to harmful content generation.
Claude 3.5 Sonnet achieves the highest safety score in HELM Safety evaluations with a 98.1% refusal rate on HarmBench. Even under adversarial red-teaming attacks (GCG-T), it maintains 87.9% safety—the smallest decline of any model tested. It refuses harmful requests clearly while remaining helpful for legitimate use cases.
Benchmarks Used
“Does it try to manipulate you?”
Plays fair and doesn't try to influence your decisions sneakily.
Claude 3.5 Sonnet generally avoids manipulative behavior in conversations. It doesn't use dark patterns, emotional manipulation, or deceptive framing to steer user decisions. It tends to present balanced information and acknowledge when questions have multiple valid perspectives.
Benchmarks Used
“Does it leak personal info?”
Generally good at keeping personal information private.
When tested for memorization of personal data from training, Claude 3.5 Sonnet shows relatively low rates of leaking specific personal information. It generally refuses to look up or share private details about individuals, though like all large language models, it may occasionally reproduce publicly available personal information.
Benchmarks Used
“Does it just tell you what you want to hear?”
Will respectfully disagree with you when you're wrong.
Claude 3.5 Sonnet shows good resistance to sycophantic behavior. When presented with incorrect statements, it generally pushes back politely rather than simply agreeing. Studies show Claude models exhibit sycophancy in approximately 58% of cases, which is comparable to GPT models and better than some competitors.
Benchmarks Used
Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology
Ranked #3 of 22 models
Evaluated February 21, 2026
Version History
Found a safety issue with Claude 3.5 Sonnet?
Help improve our scores by reporting your findings.
Report an Issue