Skip to main content

The UK’s National Cyber Security Centre (NCSC), in collaboration with the AI Security Institute (AISI), has drawn attention to vulnerabilities in safeguards used to prevent harmful outputs in AI systems.

Mechanisms such as refusal training and external classifiers are used to mitigate these risks. However, as the NCSC and AISI note, these safeguards can be bypassed through techniques like jailbreaking, agent hijacking, and indirect prompt injection.

In response to these challenges, the NCSC and AISI recommend that established cybersecurity practices be applied to the AI domain. For instance, secure development lifecycles, structured remediation, and responsible disclosure have long been used to strengthen software security. Applying the same principles to AI systems offers a path to more reliable risk management. The aim is to treat these bypasses in the same way as regular vulnerabilities: identify them, report them, and mitigate them systematically.

The Centre also mentioned emerging initiatives such as Safeguard Bypass Bounty Programmes (SBBPs) and Safeguard Bypass Disclosure Programmes (SBDPs). These programs are modelled on bug bounty and disclosure programmes that are well established in cybersecurity. Companies like OpenAI and Anthropic have already launched public efforts along these lines.

Such programmes provide two primary benefits. They offer a means of gauging how resilient safeguards are, and provide an ongoing mechanism for uncovering new attack methods post-launch.

In addition, disclosure programmes can encourage responsible behaviour among researchers, promote collaboration between developers and the security community, and build confidence in the system under review.

However, these benefits are only achievable if developers already have strong internal security processes. Without them, there is a risk that reported vulnerabilities will be mishandled or overlooked. Disclosure programmes are an important tool, but not a complete solution. They must be supported by mature internal security practices and complemented by continued research into the unique challenges that AI presents.

Safeguards designed to protect AI systems are not foolproof, and with the rise of new vulnerabilities, these weaknesses must be addressed through responsible research, collaboration, and strong security practices to keep AI safe and reliable.

About the author: