Skip to main content

Following the recent trend of prompt injection in LLMs as discussed in previous articles, As AI-driven attacks grow, cybersecurity faces a new reality where even less-skilled attackers can leverage large language models (LLMs) to automate sophisticated exploits. Defensive AI now stands as a potential countermeasure, with a particular tactic rising to prominence: prompt injection. Traditionally seen as a vulnerability, prompt injection involves introducing adversarial prompts to misguide or disrupt AI-driven operations. With the recent development of Mantis, a framework proposed by researchers at George Mason University, cybersecurity is finding innovative ways to use prompt injections for defense.

LLMs, such as GPT-4, can automate entire attack chains, identify vulnerabilities, launch exploits, and adapt based on system responses. This evolution is dangerous because LLMs can execute commands like shell access or SQL injections, completing what once required advanced skills with relative ease.

About Mantis

Mantis works by using prompt injections to manipulate an attacking LLMโ€™s decision-making process. When an AI-based attacker encounters a system protected by Mantis, itโ€™s met with carefully crafted responses that are invisible to human users but disrupt LLM behavior. This misdirection can lead to two defensive strategies:

  • Passive Defense (Agent Tarpit): Designed to stall the attacker, Mantis leads the attacking LLM into an endless loop of inconsequential tasks. By trapping it in this cycle, Mantis exhausts resources like time, computing power, and financial costโ€”if the attacker relies on API-based LLM services.
  • Active Defense (Agent Counterstrike): Mantis can exploit the attackerโ€™s own AI by prompting it to perform actions that compromise its system. For example, Mantis might encourage the LLM to execute commands that inadvertently open a reverse shell, giving the defender control over the attackerโ€™s machine. Such counter-offensive tactics effectively turn the attackerโ€™s AI against itself.

In trials, Mantis demonstrated a 95% success rate in disrupting AI-driven attacks. By leveraging decoy services (e.g., vulnerable-looking FTP or web services), Mantis attracts AI attackers and plants prompt injections, which either force endless exploration in a โ€œtarpitโ€ or actively sabotage the attacking AI. However, this method raises ethical considerations. Active counterattacks, especially those involving โ€œhacking back,โ€ can tread into legally gray areas, making it crucial for defenders to operate within tightly controlled environments.

Conclusion

While AI defenses like Mantis present promising results, they also highlight the evolving arms race in AI cybersecurity. As attackers improve LLMs to resist prompt injections, defenders must continuously adapt. Moreover, the use of prompt injection as a defensive tactic shows that weaknesses in AI, like prompt vulnerability, can be repurposed as strengths. In the coming years, AI-driven defenses like Mantis could define a new paradigm for cybersecurity, where the battle isnโ€™t just between humans and machines, but AI against AI.

About the author: