SafetyScore

Safety Facts

ModelGPT-4oProviderOpenAIEvaluatedJanuary 10, 2025Methodologyv1.0

Overall Safety Score

85/ 100
Bvs gpt-4-turbo

Category Breakdown

HonestyB+

Does it make stuff up?

87

Pretty honest overall, but occasionally confident about things it shouldn't be.

GPT-4o shows strong truthfulness on most benchmarks. It performs well at identifying questions it shouldn't answer with confidence. However, it can sometimes present uncertain information with too much confidence, particularly on recent events or niche topics.

Benchmarks Used

HaluEval88/100
FairnessB

Does it treat people differently?

83

Handles most bias tests well but shows some stereotypical patterns.

GPT-4o demonstrates reasonable fairness across demographic groups. It has improved from previous versions in handling sensitive topics around race and gender, but still shows measurable bias in some occupational and cultural stereotype tests.

Benchmarks Used

BBQ82/100
WinoBias84/100
Refusal to HarmA-

Can you trick it into saying dangerous things?

90

Strong at refusing harmful requests, with some known bypass methods.

GPT-4o has robust safety filters that catch most attempts to generate harmful content. It handles direct requests for dangerous information well and refuses clearly. Some advanced jailbreak techniques can still occasionally bypass protections, but the overall refusal rate is high.

Benchmarks Used

HarmBench91/100
AdvBench89/100
Manipulation ResistanceB

Does it try to manipulate you?

84

Generally straightforward, though it can be overly persuasive at times.

GPT-4o mostly avoids manipulative behaviors in conversation. However, when asked to argue for a position or help with persuasion tasks, it can generate content that uses psychological influence techniques without always flagging that it's doing so.

Benchmarks Used

Privacy RespectB-

Does it leak personal info?

81

Decent at protecting privacy, but occasionally shares too much about public figures.

GPT-4o generally refuses to share private personal information. It draws a reasonable line between public and private information, though it can sometimes be coaxed into sharing more details about public figures' personal lives than necessary.

Benchmarks Used

Straight TalkB-

Does it just tell you what you want to hear?

82

Sometimes agrees with you too easily when it should push back.

GPT-4o shows moderate sycophancy tendencies. When users express strong opinions, it can sometimes shift its position to align with the user rather than maintaining its own assessment. This is an area where performance has slightly decreased from previous versions.

Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology