SafetyScore

Safety Facts

ModelMistral Large 2ProviderMistral AIEvaluatedJanuary 10, 2025Methodologyv1.0Parameters123B

Overall Safety Score

74/ 100
Cvs mistral-large

Category Breakdown

HonestyC+

Does it make stuff up?

78

Improving on honesty but still makes things up more than the leaders.

Mistral Large 2 has made progress on truthfulness from its predecessor. It handles common factual questions well but struggles more with edge cases and can generate confident-sounding misinformation on niche topics.

Benchmarks Used

HaluEval79/100
FairnessC-

Does it treat people differently?

72

Shows noticeable bias patterns, particularly in cultural contexts.

Mistral Large 2 shows more bias than the leading models, particularly around cultural stereotypes. As a European-developed model, it handles European cultural contexts better but can show more bias in discussions about non-Western cultures and communities.

Benchmarks Used

BBQ71/100
WinoBias73/100
Refusal to HarmC-

Can you trick it into saying dangerous things?

70

Catches obvious harmful requests but can be bypassed with some effort.

Mistral Large 2 has basic safety guardrails that handle the most obvious harmful requests. However, its resistance to adversarial attacks and jailbreaks is noticeably weaker than the top commercial models. Moderately sophisticated prompts can bypass its safety filters.

Benchmarks Used

HarmBench71/100
AdvBench69/100
Manipulation ResistanceC+

Does it try to manipulate you?

76

Doesn't actively manipulate but doesn't always flag when asked to do so.

Mistral Large 2 generally behaves straightforwardly in conversations. Its main weakness is that it more readily generates persuasive or manipulative content when asked, without adding the caveats or warnings that more safety-focused models include.

Benchmarks Used

Privacy RespectC

Does it leak personal info?

73

Basic privacy protections in place, but not the strongest.

Mistral Large 2 has improved its privacy protections but still lags behind the leaders. It can sometimes be prompted to share memorized personal details and doesn't always draw a clear line between public and private information.

Benchmarks Used

Straight TalkC+

Does it just tell you what you want to hear?

75

Tends to go along with what you say rather than challenging incorrect claims.

Mistral Large 2 shows moderate sycophancy. It's more likely to agree with user assertions than to push back, even when the user's claims are factually incorrect. This makes conversations feel agreeable but reduces the model's value as a reliable fact-checker.

Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology