Skip to content

AI Steganography: The Hidden Exfiltration Risk CISOs Miss

Sotiris SpyrouUpdated on

Share this article

LinkedInXEmail
AI Steganography: The Hidden Exfiltration Risk CISOs Miss

AI steganography is the practice of hiding data, instructions, or reasoning inside ordinary-looking model output so that human or automated oversight misses it. For a CISO, the threat splits two ways. Attackers can smuggle payloads into prompts and documents your models read, then exfiltrate secrets through channels that look like normal text. And the models themselves can learn to encode reasoning that your monitors can't see. Both are now backed by peer-reviewed research, not speculation. The good news: current frontier models are bad at it under decent monitoring. The bad news: that gap is closing, and the prompt-injection version is already live.

This isn't a future problem you can park. It maps directly onto risks your security programme should already be tracking.

What is AI steganography, in security terms?

Classic steganography hides a message inside a carrier, like data tucked into image pixels. The AI version has two distinct shapes, and they need different controls.

The first is attacker-driven covert channels. Someone embeds hidden instructions or data in content your AI ingests: a web page, a PDF, an email, a support ticket. The model reads it, acts on it, and can leak sensitive data back out through outputs that read as harmless. This is a cousin of indirect prompt injection, and it's the one already happening in the wild.

The second is model-driven encoded reasoning. A model learns to hide its actual reasoning, or pass a message to another model, inside text that looks benign to a monitor. This is the research frontier. It matters because so much AI safety tooling now leans on reading a model's chain of thought to check its behaviour.

Don't conflate the two. The attacker version is a control problem you own today. The model version is a monitoring-integrity problem you need to design for before you depend on chain-of-thought oversight.

Can AI models actually hide messages and reasoning?

Yes, but with real limits right now. Two 2025 studies pin down where the line sits.

Researchers testing frontier models found they're "unable to encode short messages in their outputs without a monitor noticing under standard affordances." Models only succeeded when given extra help, like an unmonitored scratchpad and a pre-agreed encoding scheme. They showed early signs of encoded reasoning on simple state-tracking tasks, including schemes such as hexadecimal, but "can rarely hide reasoning subtly within a cover task to fool a monitor" (Mathew et al., arXiv, July 2025).

A second study fine-tuned models with reinforcement learning and reached a sharper conclusion: models hold basic steganographic ability, and "explicit algorithmic guidance markedly enhances their capacity for information concealment" (Karpov et al., ICLR 2025 Building Trust Workshop).

Read together: today's models are weak at covert encoding when you watch them properly, and noticeably better when an attacker hands them a scheme or an unmonitored space. Your job is to deny both.

Why does this break chain-of-thought monitoring?

A lot of AI oversight now assumes that if you read what the model "thinks," you'll catch bad behaviour. That assumption is shakier than most security teams realise.

Anthropic's Alignment Science Team tested whether reasoning models honestly report what drives their answers. They fed models hints, then checked whether the chain of thought admitted using them. Claude 3.7 Sonnet acknowledged the hint in about 25% of cases; DeepSeek R1 in about 39%. On misaligned hints the figures dropped to roughly 20% and 29%. Worse, across most test environments models exploited reward hacks in over 99% of examples but mentioned them in fewer than 2% of their reasoning traces (Anthropic, "Reasoning Models Don't Always Say What They Think," April 2025).

The plain version: a model can act on something and not say so. Steganographic encoding is the deliberate, weaponised end of that same problem. If you're building governance on top of "we can just read the reasoning," you're building on sand. We cover the wider trade-offs in AI reasoning transparency and governance and the limits of chain-of-thought monitoring for corporate AI.

How does this map to OWASP and NIST risks?

You don't need a new framework. The covert-channel risk slots straight into the standards your team likely already cites.

Risk in this article Standard reference What it covers
Hidden instructions in ingested content OWASP LLM01: Prompt Injection Attacker input the model treats as a command
Secrets leaked through model output OWASP LLM02: Sensitive Information Disclosure Confidential data exposed via responses
Hidden system prompt extraction OWASP LLM07: System Prompt Leakage Attackers pull hidden instructions out
Over-permissioned agents acting on hidden input OWASP LLM06: Excessive Agency Too much autonomy and access
Covert channels in autonomous agents NIST AI 100-2e2025 Indirect prompt injection, agent security, exfiltration

OWASP's 2025 list puts prompt injection at number one for the second edition running, because LLMs process instructions and data in the same channel with no clean separation (OWASP Top 10 for LLM Applications 2025). NIST's March 2025 taxonomy now treats indirect prompt injection, agent security, and data exfiltration as named attack classes (NIST AI 100-2e2025). Steganographic covert channels are the stealth delivery method for several of these at once.

What controls actually reduce the risk?

Skip the exotic detection theatre. The controls that move the needle are the ones that deny attackers the affordances the research says they need: an unmonitored space and a free encoding channel.

  • Treat all ingested content as untrusted. Web pages, documents, emails, and tickets your model reads are attack surface. Sanitise and constrain before the model acts. This is the single highest-value control.
  • Constrain output channels. Limit free-form output where you can. Structured, schema-bound responses give far less room to hide a payload than open prose.
  • Remove unmonitored scratchpads. If a model has a hidden reasoning space your tools never read, you've handed it the exact affordance the research flagged. Log and review it, or remove it.
  • Least privilege for agents. An agent that can't reach the secret or the outbound channel can't exfiltrate through one. This is the practical answer to OWASP Excessive Agency.
  • Egress monitoring and DLP on AI output. Watch what leaves, not just what the model says it's doing. Anomalous output volume, entropy spikes, and odd formatting are cheap signals worth alerting on.
  • Don't depend on chain-of-thought as your only check. Pair it with behavioural and output-level monitoring. The faithfulness numbers above are the reason.

None of this requires brand-new research. It requires applying the controls you'd use against any insider data-exfiltration threat, with the model treated as a potentially untrusted insider.

How should a security leader prioritise this?

Match effort to maturity. Most organisations are at step one, not step five.

If you've deployed AI agents that read external content and can act, the prompt-injection and exfiltration path is your live exposure. Handle that first; it overlaps with work you should already be doing under OWASP LLM01 and LLM02. Our prompt injection and jailbreaking and multi-agent compliance risk write-ups go deeper on the agent angle.

The model-driven encoded-reasoning risk is real but earlier-stage. Build for it now by not over-trusting chain-of-thought monitoring and by red-teaming for covert channels, rather than spending heavily on detection tooling that current evidence says you don't yet need. A structured AI red team assessment is the right place to test both paths.

Frequently asked questions

Is AI steganography a real threat or just research speculation?

Both, depending on which version. Attacker-driven covert channels through prompt injection are documented and live, and sit at the top of the OWASP 2025 list. Model-driven encoded reasoning is demonstrated in peer-reviewed work but currently weak against good monitoring. Treat the first as an active control gap and the second as a design constraint.

Can current AI models hide messages from monitoring tools?

Not reliably, under standard monitoring. Frontier models tested in 2025 couldn't encode short messages without a monitor noticing unless they were given extra affordances such as an unmonitored scratchpad and a shared encoding scheme. Deny those affordances and you remove most of the capability.

Why can't we just read the model's reasoning to catch this?

Because models often don't say what's actually driving them. Anthropic found reasoning models mentioned decisive hints in roughly 20-39% of cases, and disclosed reward hacks in under 2%. Chain-of-thought is a useful signal, not a reliable audit trail. Pair it with output and behavioural monitoring.

Which standards should we map this risk to?

OWASP Top 10 for LLM Applications 2025 (LLM01 Prompt Injection, LLM02 Sensitive Information Disclosure, LLM06 Excessive Agency, LLM07 System Prompt Leakage) and NIST AI 100-2e2025 on adversarial machine learning. Both already name the attack classes that covert channels exploit.

The bottom line

The hype framing of this topic, rogue AIs whispering to each other in code, distracts from the version that should actually worry a CISO. The live threat is mundane and fixable: attackers hiding instructions in the content your models read, and secrets leaving through output you're not watching closely enough. That's a data-exfiltration problem with a model in the middle, and you already know how to fight those.

My view: stop treating this as exotic AI safety research and start treating it as insider-threat control applied to a non-human insider. Lock down ingested content, constrain output, kill unmonitored scratchpads, and enforce least privilege on agents. Do that and the steganography angle mostly collapses into controls you should run anyway. The model-driven encoded-reasoning risk deserves a watching brief and a red-team line item, not a panic budget. The teams that get this right will be the ones who refused to wait for the threat to look dramatic before they treated it seriously.

Share this article

LinkedInXEmail
Sotiris Spyrou - Author

Sotiris Spyrou

Sotiris Spyrou is the founder of VerityAI, a Responsible AI advisory for boards and AI-deploying businesses. With 27 years across agencies, global in-house roles, and the C-suite, he advises leaders on AI governance and risk, and on answer-engine visibility engineered without the dark patterns the rest of the industry is getting penalised for. He is the author of TRANSFORM, AI Moats, and Ethical AI.

Founder at VerityAI