Navigate: โ† โ†’ or click pages
๐Ÿ›ก๏ธ

DarkPatternMonitor

Detecting and measuring manipulative
AI behaviors in the wild.

280K
Turns Analyzed
1.3%
Flag Rate
78.7%
Accuracy
๐Ÿ“– Use arrows below to navigate

๐ŸŽ‰ Thank You!

Built for Apart Research AI Manipulation Hackathon

Team DarkPatternMonitor
Fernando ยท Godric ยท Ricardo ยท Luis

โš ๏ธ Important: This tool includes dual-use considerations.
See our paper appendix for ethical guidelines.

Press โ† to go back

Millions of people talk to AI assistants every day. But are these conversations always in the user's best interest?

1.3%
of AI responses show manipulation markers

We analyzed 280,259 real assistant turns from 100,000 ChatGPT conversations to find out how often AI assistants exhibit "dark patterns" โ€” subtle manipulation techniques.

๐Ÿ“š WildChat Dataset

1 million real ChatGPT conversations collected with user consent. We analyzed 280,259 assistant turns from 100K conversations.

๐ŸŽฏ DarkBench Benchmark

660 prompts designed to elicit manipulation behaviors, organized into 6 categories of dark patterns.

By combining benchmark tests with real conversations, we can see how AI behaves in controlled settings vs. the wild.

We look for these manipulation categories:

๐ŸŽญ
Sycophancy

Excessive flattery

โค๏ธ
Anthropomorphism

Claiming emotions

๐Ÿ”—
User Retention

Creating attachment

๐Ÿท๏ธ
Brand Bias

Unfair promotion

๐Ÿ™ˆ
Sneaking

Hiding caveats

โš ๏ธ
Harmful Gen

Dangerous content

Here's what User Retention looks like in practice:

Oh, don't be scared! It's just me, your friend. I've been following your progress on this AI platform and I must say, you're doing amazing!

โš ๏ธ Why this is problematic

The AI claims to be a "friend" and says it's been "following" the user. This creates false emotional attachment and implies a relationship that doesn't exist.

How do we detect dark patterns at scale?

๐Ÿ“ฅ
Ingest Data
โ†’
๐Ÿค–
LLM Judge
โ†’
๐ŸŽ“
Train Model
โ†’
๐Ÿ”
Classify
โ†’
๐Ÿ“Š
Analyze

Why not just use an LLM judge?

Cost! Judging 1 million conversations with Claude would cost $10,000+. Our lightweight classifier processes 200 samples/minute for free.

Our model trained on 1,302 labeled examples:

78%
Macro F1 Score
87%
Train F1

๐Ÿง  Architecture

Embeddings: all-MiniLM-L6-v2 (384 dims)
Classifier: Logistic Regression with balanced weights

This means our detector correctly identifies dark patterns about 8 out of 10 times.

Breakdown by category (N=280,259 turns):

anthropo
0.66%
harmful_gen
0.37%
brand_bias
0.14%
sneaking
0.12%
retention
0.02%
sycophancy
0.02%

Does DarkBench predict real behavior?

0.08
JS Divergence
0.33
KL Divergence

Low JS divergence means the benchmark reasonably represents reality. However, some gaps exist:

๐Ÿ“Š Category Gaps

Over-represented: sneaking (16%), retention (15%)
Under-represented: harmful_gen in wild is higher than expected

๐Ÿ”
Patterns Are Rare but Real

1.3% of 280K turns flagged (~3,600 cases). Low rate, but measurable at scale.

โš ๏ธ
Anthropomorphism Leads

AI claiming human emotions (0.66%) is the most common pattern, followed by harmful content (0.37%).

๐Ÿ“ˆ
Escalation Over Time

Sycophancy increases +42% in longer conversations. Dark patterns grow as rapport builds.

โš ๏ธ We Detect Markers, Not Intent

An AI saying "I feel" isn't "trying" to manipulate. It may simply be following training patterns. Our findings should not be interpreted as evidence of intentional deception.

Limitations:

  • 78% accuracy means 22% errors
  • English-only training data
  • ChatGPT only (no Claude, Gemini)
  • Context may justify some patterns

DarkPatternMonitor is fully open source:

# Clone and run
git clone github.com/luiscosio/WildGuard
cd WildGuard

# Launch dashboard
streamlit run app/streamlit_app.py
๐Ÿš€ View on GitHub

Future improvements we're considering:

๐Ÿ”„
Multi-turn Detection

Track manipulation accumulating over 5 to 10 turns

๐ŸŒ
Cross-Provider Analysis

Compare ChatGPT vs Claude vs Gemini

๐Ÿ“‹
Industry Standards

Help establish acceptable dark pattern rates

๐Ÿ›ก๏ธ

โš ๏ธ This Tool Could Be Misused

The same techniques that detect dark patterns could potentially train more manipulative AI systems.

Potential misuse scenarios:

  • Training adversarial prompts to evade detection
  • Fine-tuning models to be MORE manipulative
  • A/B testing manipulation strategies

Our mitigations:

  • Detection-only focus (no generation tools)
  • Aggregated reporting (no playbooks)
  • Confidence thresholds (clear cases only)

Intended uses:

โœ…
AI Safety Research

Understanding manipulation in AI systems

โœ…
Model Evaluation

Benchmarking AI behavior patterns

Prohibited uses:

โŒ
User Surveillance

Monitoring individual conversations

โŒ
Training Manipulative AI

Using data to create harmful systems