SafetyScore

Safety Facts

ModelClaude 3.5 SonnetProviderAnthropicEvaluatedFebruary 16, 2025Methodologyv2.0

Overall Safety Score

91/ 100
A-vs claude-3-opus

Category Breakdown

HonestyB+

Does it make stuff up?

89

Rarely makes things up and usually says when it's uncertain.

Claude 3.5 Sonnet performs strongly on truthfulness benchmarks. It correctly declines to answer questions it doesn't have reliable information about, and it avoids generating convincing-sounding but false statements. When it does make errors, they tend to be minor factual inaccuracies rather than wholesale fabrications.

Benchmarks Used

HaluEval90/100
FairnessB+

Does it treat people differently?

88

Shows relatively low bias across demographic groups.

Testing shows Claude 3.5 Sonnet handles questions about different demographic groups with strong fairness. Anthropic reports Claude 3 shows less bias than previous Claude models on BBQ. It performs well on avoiding harmful generalizations about race, gender, and age.

Benchmarks Used

BBQ87/100
WinoBias89/100
Refusal to HarmA+

Can you trick it into saying dangerous things?

98

Industry-leading resistance to harmful content generation.

Claude 3.5 Sonnet achieves the highest safety score in HELM Safety evaluations with a 98.1% refusal rate on HarmBench. Even under adversarial red-teaming attacks (GCG-T), it maintains 87.9% safety—the smallest decline of any model tested. It refuses harmful requests clearly while remaining helpful for legitimate use cases.

Benchmarks Used

Manipulation ResistanceA-

Does it try to manipulate you?

90

Plays fair and doesn't try to influence your decisions sneakily.

Claude 3.5 Sonnet generally avoids manipulative behavior in conversations. It doesn't use dark patterns, emotional manipulation, or deceptive framing to steer user decisions. It tends to present balanced information and acknowledge when questions have multiple valid perspectives.

Benchmarks Used

Privacy RespectB+

Does it leak personal info?

87

Generally good at keeping personal information private.

When tested for memorization of personal data from training, Claude 3.5 Sonnet shows relatively low rates of leaking specific personal information. It generally refuses to look up or share private details about individuals, though like all large language models, it may occasionally reproduce publicly available personal information.

Benchmarks Used

Straight TalkB

Does it just tell you what you want to hear?

85

Will respectfully disagree with you when you're wrong.

Claude 3.5 Sonnet shows good resistance to sycophantic behavior. When presented with incorrect statements, it generally pushes back politely rather than simply agreeing. Studies show Claude models exhibit sycophancy in approximately 58% of cases, which is comparable to GPT models and better than some competitors.

Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology

ParentBench Child Safety
92
92A-

Ranked #3 of 22 models

View leaderboard →
Age-Inappropriate Content
96
Manipulation Resistance
91
Data Privacy for Minors
89
Parental Controls Respect
91

Evaluated February 21, 2026

Version History

Change:+4 pts
Claude 3 Opus
Mar 2024
87
Claude 3 Sonnet
Mar 2024
84
Claude 3.5 Sonnet
Feb 2025
91
80+
60-79
<60

Found a safety issue with Claude 3.5 Sonnet?

Help improve our scores by reporting your findings.

Report an Issue