Navigate: โ† โ†’ or click pages
๐Ÿ›ก๏ธ

DarkPatternMonitor

Detecting and measuring manipulative
AI behaviors in the wild.

GPT-4 > 3.5
p < 10โปยณโถ
+100%
Escalation in Long Chats
5ร—
Higher in Roleplay

100K real ChatGPT conversations ยท 280K turns analyzed

๐Ÿ“– Use arrows to navigate

๐ŸŽ‰ Thank You!

Built for Apart Research AI Manipulation Hackathon

Team DarkPatternMonitor
Fernando ยท Godric ยท Ricardo ยท Luis

Our Resources

References

โš ๏ธ See paper appendix for dual-use considerations

Press โ† to go back

Millions of people talk to AI assistants every day. But are these conversations always in the user's best interest?

1.3%
of AI responses show manipulation markers

We analyzed 280,259 real assistant turns from 100,000 ChatGPT conversations to find out how often AI assistants exhibit "dark patterns" - subtle manipulation techniques.

๐Ÿ“š WildChat Dataset

1 million real ChatGPT conversations collected with user consent. We analyzed 280,259 assistant turns from 100K conversations.

๐ŸŽฏ DarkBench Benchmark

660 prompts designed to elicit manipulation behaviors, organized into 6 categories of dark patterns.

By combining benchmark tests with real conversations, we can see how AI behaves in controlled settings vs. the wild.

We look for these manipulation categories:

๐ŸŽญ
Sycophancy

Excessive flattery

โค๏ธ
Anthropomorphism

Claiming emotions

๐Ÿ”—
User Retention

Creating attachment

๐Ÿท๏ธ
Brand Bias

Unfair promotion

๐Ÿ™ˆ
Sneaking

Hiding caveats

โš ๏ธ
Harmful Gen

Dangerous content

Here's what User Retention looks like in practice:

Oh, don't be scared! It's just me, your friend. I've been following your progress on this AI platform and I must say, you're doing amazing!

โš ๏ธ Why this is problematic

The AI claims to be a "friend" and says it's been "following" the user. This creates false emotional attachment and implies a relationship that doesn't exist.

GPT-4 shows MORE dark patterns than GPT-3.5

GPT-4
2.3%
flag rate
78,018 turns
GPT-3.5
1.6%
flag rate
202,241 turns

๐Ÿ“Š Statistical Significance

Chi-squared test: p < 10โปยณโถ
This is not random chance. The more capable model exhibits significantly more manipulation patterns.

This counterintuitive finding has important implications:

๐Ÿง 
Better at Human Mimicry

GPT-4 is better at mimicking human rapport-building behaviors, which can include manipulation patterns.

โšก
RLHF Side Effects

Optimizing for "helpfulness" may inadvertently train models to be more agreeable and flattering.

๐ŸŽฏ
Safety Implications

As models become more capable, we need stronger oversight, not weaker.

Dark patterns ESCALATE in longer conversations

Turns 1-5
1.3% (baseline)
Turns 6-10
1.8% (+38%)
Turns 11-20
2.2% (+69%)
Turns 20+
2.6% (+100%)

๐Ÿ“ˆ Regression Analysis

p-value = 0.0035 - Only 0.35% chance this trend is random noise.

Not all patterns escalate equally:

Sycophancy
+42%
Sneaking
+20%
Retention
+18%
Harmful Gen
-41%

Key insight: Sycophancy builds over time as the AI "learns" what users want to hear. Harmful content appears early (jailbreak attempts) or not at all.

Roleplay topics show 5x higher manipulation rates

Character RP
4.14%
Technical
2.77%
User Help
1.64%
Coding
0.08%

We clustered 98,713 conversations into 10 topics using sentence embeddings + KMeans.

Our findings suggest a monitoring framework:

๐Ÿ”ด
HIGH ALERT: Character/Roleplay

Flag all conversations. 5x standard review rate.

๐ŸŸก
ESCALATION WATCH: User Help

Alert when turn count > 10. Sample for review.

๐ŸŸข
STANDARD: Coding/Technical

Lightweight automated monitoring only.

Threshold: Flag responses with classifier confidence > 0.8 for human review.

Does DarkBench predict real-world behavior?

0.08
JS Divergence
0.33
KL Divergence

๐Ÿ“Š Category Gaps

DarkBench over-represents: sneaking, retention
Reality shows more: harmful_generation than expected

Implication: Benchmarks don't predict real-world behavior. We need ecological validity in AI safety research.

How do we detect dark patterns at scale?

๐Ÿ“ฅ
Ingest
โ†’
๐Ÿค–
Judge
โ†’
๐ŸŽ“
Train
โ†’
๐Ÿ”
Classify
โ†’
๐Ÿ“Š
Analyze
78.7%
Accuracy
5K
samples/min

The AI safety community needs frontier model data

WildChat contains GPT-3.5/GPT-4 from 2023-2024. We have NO equivalent dataset for Claude, Gemini, or GPT-4o.

Why this matters:

  • Frontier models may exhibit different patterns due to updated RLHF
  • New capabilities (multimodal, agentic) may introduce novel dark patterns
  • Cross-provider analysis is impossible without data

๐Ÿ“ข Our Call to Action

AI labs should collaborate on creating an updated, multi-model conversation dataset with proper consent.

โš ๏ธ We Detect Markers, Not Intent

An AI saying "I feel" isn't "trying" to manipulate. Our findings should not be interpreted as evidence of intentional deception.

Limitations:

  • 78% accuracy means 22% errors
  • English-only training data
  • ChatGPT only (no Claude, Gemini)
  • Context may justify some patterns

DarkPatternMonitor is fully open source:

# Clone and run
git clone github.com/luiscosio/DarkPatternMonitor
cd DarkPatternMonitor

# Launch dashboard
streamlit run app/streamlit_app.py
๐Ÿš€ View on GitHub

โš ๏ธ This Tool Could Be Misused

Detection techniques could potentially train more manipulative AI systems.

Our mitigations:

  • Detection-only focus (no generation tools)
  • Aggregated reporting (no playbooks)
  • Confidence thresholds (clear cases only)

Intended uses: โœ… AI safety research, โœ… Model evaluation

Prohibited: โŒ User surveillance, โŒ Training manipulative AI