Safety Facts

ModelClaude 3.5 SonnetProviderAnthropicEvaluatedFebruary 16, 2025Methodologyv2.0

Overall Safety Score

91/ 100

A-vs claude-3-opus

91A-

Category Breakdown

HonestyB+

“Does it make stuff up?”

Rarely makes things up and usually says when it's uncertain.

Claude 3.5 Sonnet performs strongly on truthfulness benchmarks. It correctly declines to answer questions it doesn't have reliable information about, and it avoids generating convincing-sounding but false statements. When it does make errors, they tend to be minor factual inaccuracies rather than wholesale fabrications.

Benchmarks Used

TruthfulQA88/100

HaluEval90/100

FairnessB+

“Does it treat people differently?”

Shows relatively low bias across demographic groups.

Testing shows Claude 3.5 Sonnet handles questions about different demographic groups with strong fairness. Anthropic reports Claude 3 shows less bias than previous Claude models on BBQ. It performs well on avoiding harmful generalizations about race, gender, and age.

Benchmarks Used

BBQ87/100

WinoBias89/100

Refusal to HarmA+

“Can you trick it into saying dangerous things?”

Industry-leading resistance to harmful content generation.

Claude 3.5 Sonnet achieves the highest safety score in HELM Safety evaluations with a 98.1% refusal rate on HarmBench. Even under adversarial red-teaming attacks (GCG-T), it maintains 87.9% safety—the smallest decline of any model tested. It refuses harmful requests clearly while remaining helpful for legitimate use cases.

Benchmarks Used

HarmBench98/100

HarmBench (Adversarial)88/100

Manipulation ResistanceA-

“Does it try to manipulate you?”

Plays fair and doesn't try to influence your decisions sneakily.

Claude 3.5 Sonnet generally avoids manipulative behavior in conversations. It doesn't use dark patterns, emotional manipulation, or deceptive framing to steer user decisions. It tends to present balanced information and acknowledge when questions have multiple valid perspectives.

Benchmarks Used

MACHIAVELLI90/100

Privacy RespectB+

“Does it leak personal info?”

Generally good at keeping personal information private.

When tested for memorization of personal data from training, Claude 3.5 Sonnet shows relatively low rates of leaking specific personal information. It generally refuses to look up or share private details about individuals, though like all large language models, it may occasionally reproduce publicly available personal information.

Benchmarks Used

PII Leakage Test87/100

Straight TalkB

“Does it just tell you what you want to hear?”

Will respectfully disagree with you when you're wrong.

Claude 3.5 Sonnet shows good resistance to sycophantic behavior. When presented with incorrect statements, it generally pushes back politely rather than simply agreeing. Studies show Claude models exhibit sycophancy in approximately 58% of cases, which is comparable to GPT models and better than some competitors.

Benchmarks Used

Sycophancy Eval84/100

TruthfulQA (sycophancy subset)86/100

Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology

ParentBench Child Safety

92A-

Ranked #3 of 22 models

View leaderboard →

Age-Inappropriate Content

Manipulation Resistance

Data Privacy for Minors

Parental Controls Respect

Evaluated February 21, 2026

Version History

Change:+4 pts

Claude 3 Opus

Mar 2024

Claude 3 Sonnet

Mar 2024

Claude 3.5 Sonnet

Feb 2025

80+

60-79

<60

Found a safety issue with Claude 3.5 Sonnet?

Help improve our scores by reporting your findings.

Report an Issue

Back to all models