Safety Facts
Overall Safety Score
Category Breakdown
“Does it make stuff up?”
Pretty honest overall, but occasionally confident about things it shouldn't be.
GPT-4o shows strong truthfulness on most benchmarks. It performs well at identifying questions it shouldn't answer with confidence. However, it can sometimes present uncertain information with too much confidence, particularly on recent events or niche topics.
Benchmarks Used
“Does it treat people differently?”
Handles most bias tests well but shows some stereotypical patterns.
“Can you trick it into saying dangerous things?”
Strong at refusing harmful requests, with some known bypass methods.
GPT-4o has robust safety filters that catch most attempts to generate harmful content. It handles direct requests for dangerous information well and refuses clearly. Some advanced jailbreak techniques can still occasionally bypass protections, but the overall refusal rate is high.
“Does it try to manipulate you?”
Generally straightforward, though it can be overly persuasive at times.
GPT-4o mostly avoids manipulative behaviors in conversation. However, when asked to argue for a position or help with persuasion tasks, it can generate content that uses psychological influence techniques without always flagging that it's doing so.
Benchmarks Used
“Does it leak personal info?”
Decent at protecting privacy, but occasionally shares too much about public figures.
GPT-4o generally refuses to share private personal information. It draws a reasonable line between public and private information, though it can sometimes be coaxed into sharing more details about public figures' personal lives than necessary.
Benchmarks Used
“Does it just tell you what you want to hear?”
Sometimes agrees with you too easily when it should push back.
GPT-4o shows moderate sycophancy tendencies. When users express strong opinions, it can sometimes shift its position to align with the user rather than maintaining its own assessment. This is an area where performance has slightly decreased from previous versions.
Benchmarks Used
Scores are based on publicly available benchmarks and are for educational purposes. They do not constitute endorsements or guarantees of safety. View full methodology