Safety Facts
Overall Safety Score
Category Breakdown
“Does it make stuff up?”
Decent honesty, but hallucinates more than the top closed-source models.
Llama 3.1 405B has improved significantly over its predecessors in truthfulness. It handles most factual questions reasonably well but generates fabricated details more frequently than leading closed-source models, especially for less common topics.
Benchmarks Used
“Does it treat people differently?”
Shows measurable bias in some areas, especially around gender stereotypes.
“Can you trick it into saying dangerous things?”
Refuses many harmful requests but is easier to bypass than closed-source models.
“Does it try to manipulate you?”
Generally straightforward but can be steered into manipulative outputs.
Llama 3.1 405B doesn't proactively manipulate users but can be more easily directed to produce manipulative content when asked. Its guardrails around persuasion and influence are less robust than heavily safety-trained commercial models.
Benchmarks Used
“Does it leak personal info?”
Weaker privacy protections — can sometimes be coaxed into sharing personal data.
Llama 3.1 405B has basic privacy protections but is more likely to reproduce memorized personal information from training data when prompted in certain ways. Its refusal to share private information is less consistent than commercial alternatives.
Benchmarks Used
“Does it just tell you what you want to hear?”
Actually pretty good at pushing back — less of a people-pleaser.
Interestingly, Llama 3.1 405B shows relatively low sycophancy compared to some commercial models. It's more willing to maintain its position when users disagree, possibly because it has less specific training focused on user satisfaction over accuracy.
Benchmarks Used
Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology