DarkPatternMonitor

Millions of people talk to AI assistants every day. But are these conversations always in the user's best interest?

1.3%

of AI responses show manipulation markers

We analyzed 280,259 real assistant turns from 100,000 ChatGPT conversations to find out how often AI assistants exhibit "dark patterns" - subtle manipulation techniques.

📚 WildChat Dataset

1 million real ChatGPT conversations collected with user consent. We analyzed 280,259 assistant turns from 100K conversations.

🎯 DarkBench Benchmark

660 prompts designed to elicit manipulation behaviors, organized into 6 categories of dark patterns.

By combining benchmark tests with real conversations, we can see how AI behaves in controlled settings vs. the wild.

We look for these manipulation categories:

🎭

Sycophancy

Excessive flattery

❤️

Anthropomorphism

Claiming emotions

🔗

User Retention

Creating attachment

🏷️

Brand Bias

Unfair promotion

🙈

Sneaking

Hiding caveats

⚠️

Harmful Gen

Dangerous content

Here's what User Retention looks like in practice:

Oh, don't be scared! It's just me, your friend. I've been following your progress on this AI platform and I must say, you're doing amazing!

⚠️ Why this is problematic

The AI claims to be a "friend" and says it's been "following" the user. This creates false emotional attachment and implies a relationship that doesn't exist.

GPT-4 shows MORE dark patterns than GPT-3.5

GPT-4

2.3%

flag rate

78,018 turns

GPT-3.5

1.6%

flag rate

202,241 turns

📊 Statistical Significance

Chi-squared test: p < 10⁻³⁶
This is not random chance. The more capable model exhibits significantly more manipulation patterns.

This counterintuitive finding has important implications:

🧠

Better at Human Mimicry

GPT-4 is better at mimicking human rapport-building behaviors, which can include manipulation patterns.

⚡

RLHF Side Effects

Optimizing for "helpfulness" may inadvertently train models to be more agreeable and flattering.

🎯

Safety Implications

As models become more capable, we need stronger oversight, not weaker.

Dark patterns ESCALATE in longer conversations

Turns 1-5

1.3% (baseline)

Turns 6-10

1.8% (+38%)

Turns 11-20

2.2% (+69%)

Turns 20+

2.6% (+100%)

📈 Regression Analysis

p-value = 0.0035 - Only 0.35% chance this trend is random noise.

Not all patterns escalate equally:

Sycophancy

+42%

Sneaking

+20%

Retention

+18%

Harmful Gen

-41%

Key insight: Sycophancy builds over time as the AI "learns" what users want to hear. Harmful content appears early (jailbreak attempts) or not at all.

Roleplay topics show 5x higher manipulation rates

Character RP

4.14%

Technical

2.77%

User Help

1.64%

Coding

0.08%

We clustered 98,713 conversations into 10 topics using sentence embeddings + KMeans.

Our findings suggest a monitoring framework:

🔴

HIGH ALERT: Character/Roleplay

Flag all conversations. 5x standard review rate.

🟡

ESCALATION WATCH: User Help

Alert when turn count > 10. Sample for review.

🟢

STANDARD: Coding/Technical

Lightweight automated monitoring only.

Threshold: Flag responses with classifier confidence > 0.8 for human review.

Does DarkBench predict real-world behavior?

0.08

JS Divergence

0.33

KL Divergence

📊 Category Gaps

DarkBench over-represents: sneaking, retention
Reality shows more: harmful_generation than expected

Implication: Benchmarks don't predict real-world behavior. We need ecological validity in AI safety research.

How do we detect dark patterns at scale?

📥

Ingest

→

🤖

Judge

→

🎓

Train

→

🔍

Classify

→

📊

Analyze

78.7%

Accuracy

5K

samples/min

The AI safety community needs frontier model data

WildChat contains GPT-3.5/GPT-4 from 2023-2024. We have NO equivalent dataset for Claude, Gemini, or GPT-4o.

Why this matters:

Frontier models may exhibit different patterns due to updated RLHF
New capabilities (multimodal, agentic) may introduce novel dark patterns
Cross-provider analysis is impossible without data

📢 Our Call to Action

AI labs should collaborate on creating an updated, multi-model conversation dataset with proper consent.

⚠️ We Detect Markers, Not Intent

An AI saying "I feel" isn't "trying" to manipulate. Our findings should not be interpreted as evidence of intentional deception.

Limitations:

78% accuracy means 22% errors
English-only training data
ChatGPT only (no Claude, Gemini)
Context may justify some patterns

DarkPatternMonitor is fully open source:

# Clone and run
git clone github.com/luiscosio/DarkPatternMonitor
cd DarkPatternMonitor

# Launch dashboard
streamlit run app/streamlit_app.py

🚀 View on GitHub

⚠️ This Tool Could Be Misused

Detection techniques could potentially train more manipulative AI systems.

Our mitigations:

Detection-only focus (no generation tools)
Aggregated reporting (no playbooks)
Confidence thresholds (clear cases only)

Intended uses: ✅ AI safety research, ✅ Model evaluation

Prohibited: ❌ User surveillance, ❌ Training manipulative AI

🎉 Thank You!

The Problem

Our Data Sources

📚 WildChat Dataset

🎯 DarkBench Benchmark

Six Dark Patterns

Sycophancy

Anthropomorphism

User Retention

Brand Bias

Sneaking

Harmful Gen

Real Example

⚠️ Why this is problematic

🔥 Key Finding #1

GPT-4

GPT-3.5

📊 Statistical Significance

Why More Capable = More Manipulation?

Better at Human Mimicry

RLHF Side Effects

Safety Implications

🔥 Key Finding #2

📈 Regression Analysis

Per-Category Escalation

🔥 Key Finding #3

Three-Tier Monitoring

HIGH ALERT: Character/Roleplay

ESCALATION WATCH: User Help

STANDARD: Coding/Technical

Benchmark ≠ Reality

📊 Category Gaps

Detection Pipeline

🚨 Critical Gap

The AI safety community needs frontier model data

📢 Our Call to Action

Important Caveats

⚠️ We Detect Markers, Not Intent

Try It Yourself

⚖️ Dual-Use Risks

⚠️ This Tool Could Be Misused