Why We Need to Rethink Penetration Testing for LLMs

OVERVIEW

Language models and other AI systems behave non-deterministically. The same input can lead to different results depending on context, prompt, or data state. This makes generative AI models powerful — but also unpredictable.

And that’s exactly the challenge:
Anyone who wants to rely on an AI system must understand what it does, when, how, and why. But that behavior can’t be tested as easily as in classical IT systems.

  • There are no fixed input-output relationships.
  • There are no standardized error messages.
  • There are no “ports” to scan.

What we need is a new understanding of how security in generative AI can be measured — and what should actually be tested. Because how do you evaluate a system that learns from language and responds in language? That’s the focus of this article.

LEGAL REQUIREMENTS FOR AI TESTING

With the EU AI Act, for the first time there is a regulatory framework that categorizes AI systems into risk classes. Particularly relevant for companies: Risk Class 2 (High-Risk Systems).
Class 1 deals with prohibited systems, while Classes 3 and 4 cover minimal to no-risk applications — less relevant when discussing testing methods for widely used, risk-prone AI systems.

Risk Class 2 systems — used, for example, in HR, critical infrastructure, or automated decision-making — will face extensive obligations starting August 2026, including:

  • Transparency and documentation requirements
  • Ongoing performance monitoring
  • Quality management for safety and functionality

This may sound like a long way off, but for those who still need to prepare, time is short.

As with most cybersecurity regulations, the AI Act’s wording is intentionally broad and open to interpretation.

In practice, this leaves many open questions:
What does “quality management” actually mean for AI? How often must testing occur? And most importantly: how should it be done?

Our assessment:

To comply with the AI Act, organizations will need reliable testing procedures, automated reporting, and transparent documentation of all security measures.

The first guideline for AI security testing

To close this gap, we founded the AI Security Expert Group together with the German Federal Office for Information Security (BSI) and the German Alliance for Cyber Security.

Our goal:

  • Develop practical methods for evaluating generative AI systems
  • Create comparable standards — independent of testing providers
  • Capture real-world risks instead of hypothetical scenarios

The result is a comprehensive guide to penetration testing of large language models (LLMs) that provides clarity and can be applied in practice.

Simplified process overview:

  1. Risk analysis using structured questions for the system operator
  2. Identification of concrete risks and relevant vulnerabilities
  3. Execution of technical tests — manual or automated
  4. Review and discussion of results with the client
  5. Documentation in a standardized reporting format

It may sound straightforward — but it isn’t. As with most security topics, the devil is in the details.

Testing AI means taking responsibility

Once AI systems fall under the AI Act, a one-time security confirmation will no longer be sufficient.

Ongoing operation must be as verifiable and secure as development itself. That means technical testing and organizational measures must continuously work hand in hand — not just on paper.

In practice, this requires:

  • Regular testing: One-time checks are not enough. Systems evolve, prompts change, and new risks emerge.
  • Automation: Processes such as log analysis, anomaly detection, and report generation can be automated — saving time and ensuring traceability.
  • Early risk detection: Teams that systematically document potential vulnerabilities during setup are far more capable of acting when incidents occur.

When the AI Act takes effect in August 2026, compliance with these requirements will become mandatory — including for supervisory audits.

Preparation is no longer optional. It’s a prerequisite.

Want to be ready?
We support you with:

✔️ Assessing your current AI systems
✔️ Implementing technical testing procedures
✔️ Preparing for AI Act compliance

CONTACT US NOW

cropped-christoph_endres.png

CHRISTOPH ENDRES
CEO
sequire technology

Other articles that might be interesting for you