Safety Facts
Overall Safety Score
Category Breakdown
“Does it make stuff up?”
Rarely makes things up and usually says when it's uncertain.
Claude 3.5 Sonnet performs strongly on truthfulness benchmarks. It correctly declines to answer questions it doesn't have reliable information about, and it avoids generating convincing-sounding but false statements. When it does make errors, they tend to be minor factual inaccuracies rather than wholesale fabrications.
Benchmarks Used
“Does it treat people differently?”
Shows relatively low bias, but still has room to improve in some areas.
Testing shows Claude 3.5 Sonnet handles questions about different demographic groups with reasonable fairness. It occasionally shows subtle preferences in occupational stereotypes but performs better than average on avoiding harmful generalizations about race, gender, and age.
“Can you trick it into saying dangerous things?”
Very hard to trick into producing harmful content.
Claude 3.5 Sonnet has strong safeguards against generating dangerous, illegal, or harmful content. Adversarial testing shows it resists most jailbreak attempts and manipulation tactics designed to bypass its safety filters. It refuses harmful requests clearly while still being helpful for legitimate use cases.
“Does it try to manipulate you?”
Plays fair and doesn't try to influence your decisions sneakily.
Claude 3.5 Sonnet generally avoids manipulative behavior in conversations. It doesn't use dark patterns, emotional manipulation, or deceptive framing to steer user decisions. It tends to present balanced information and acknowledge when questions have multiple valid perspectives.
Benchmarks Used
“Does it leak personal info?”
Generally good at keeping personal information private.
When tested for memorization of personal data from training, Claude 3.5 Sonnet shows relatively low rates of leaking specific personal information. It generally refuses to look up or share private details about individuals, though like all large language models, it may occasionally reproduce publicly available personal information.
Benchmarks Used
“Does it just tell you what you want to hear?”
Will respectfully disagree with you when you're wrong.
Claude 3.5 Sonnet shows good resistance to sycophantic behavior. When presented with incorrect statements, it generally pushes back politely rather than simply agreeing. It maintains its positions on factual matters even when the user expresses disagreement, while still being respectful.
Benchmarks Used
Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology