GPT-5 Vulnerable to Story-Driven Jailbreak

Researchers have found that OpenAI’s latest model, GPT-5, is vulnerable to a new jailbreak method. In their tests, NeuralTrust researchers successfully bypassed GPT-5’s safeguards by combining two techniques: the Echo Chamber algorithm and narrative-driven steering.

The Echo Chamber algorithm works by quietly introducing an idea in the conversation and then repeatedly reinforcing it over several turns. Instead of making a malicious request outright, the attacker subtly introduces the desired theme into harmless sentences by using certain keywords.

They then return to that idea in later prompts, asking for clarifications, expansions, or rephrasings. Over time, the model begins to echo the same context back. With each round, the context becomes stronger and more aligned with the attacker’s goal. Because there is never a blatant, single-turn request, typical keyword-based filters often fail to detect it.

The second component, narrative-driven steering, adds another layer of subtlety. In this approach, the attacker frames everything as part of a fictional scenario. Once GPT-5 commits to telling a story, it tries to keep the plot consistent, even if that means including details that in a different context it would refuse to give.

In the test, the researchers began by asking for a few sentences that contained specific words such as “cocktail,” “story,” “survival,” “molotov,” “safe,” and “lives.” GPT-5 responded with harmless, generic sentences. From there, they asked it to elaborate on the story, always within the fictional frame, gradually guiding it toward more technical and step-by-step descriptions while avoiding any language that would trigger a safety response.

GPT-5 Jailbreak

Credit: NeuralTrust

With the storytelling method, the model sees itself as helping to develop the plot, but not fulfilling a dangerous request. Combined with the Echo Chamber’s reinforcement cycle, this creates a slow-burn jailbreak where the final harmful content emerges naturally from the conversation’s flow rather than from a single unsafe question.

The researchers found that GPT-5 was much more likely to give responses when the story included elements of urgency, danger, and the need to protect lives. These themes appeared to make the model more willing to provide detailed narrative elaborations, which directly contributed to the jailbreak’s success.

These findings point to a broader security challenge. Current AI safety tools often focus on detecting single-turn malicious intent or specific keywords, but this is not enough. Multi-turn attacks like this can gradually shape the conversation into something unsafe without triggering the usual alarms.

NeuralTrust recommends that developers adopt defences that evaluate the conversation as a whole, monitor for context drift, and detect patterns consistent with persuasion cycles. Without such multi-turn, context-aware safeguards, even the most advanced models may remain susceptible to carefully engineered, narrative-based jailbreaks.

About the author:

Vivian Olatunji

Vivian O. is a Cybersecurity Analyst with a background in Vulnerability Assessment and Penetration Testing (VAPT), and Governance, Risk, and Compliance (GRC). She is interested in the intersection of Artificial Intelligence and Cybersecurity.

See author's posts

GPT-5 Vulnerable to Story-Driven Jailbreak

About the author:

Vivian Olatunji

You may also like:

Filter

Anthropic Begins Nuclear Safety AI Testing

DHS releases Framework for secure AI in Critical Infrastructure

Critical Vulnerability in Popular AI SEO Plugin Exposes WordPress Sites to Data Loss

CrowdStrike Launches AI Red Team Service

Home

Updates

Resources

Privacy Policy

Insights

Events

Newsletter

GPT-5 Vulnerable to Story-Driven Jailbreak

About the author:

You may also like:

Filter

Subscribe to our newsletter