Engineering Knowledge Base

Anti-Pattern Database

Documented AI agent failures and anti-patterns from Zenpower's production sessions. Searchable reference for humans and AI agents.

163 Anti-Patterns
110 Sessions
8 Categories
21k+ Tool Calls
No anti-patterns match your search.

Database & Data Loss

#68
Wiping Docker Data Directories
NEVER delete, recreate, or chown .docker-data directories. Postgres, Redis, and other volumes are LIVE DATA. Use fix-permissions.sh if permissions are wrong — don't recreate.
Critical
#69
rsync --delete Without Proper Excludes
CATASTROPHIC: rsync --delete deletes any file in destination not in source. Always exclude .docker-data, backups, acme.json. The pattern docker-data does NOT match .docker-data (missing dot).
Critical
#71
Using Raw rsync for Zenpower Sync
NEVER run rsync manually. Use /opt/zenpower/scripts/safe-sync.sh which has verified exclude patterns, dry-run preview, and safety checks. rsync caused the 2026-01-23 data loss.
Critical
#21
ALTER TABLE Instead of Migration
Always create proper Alembic migration files. Running ALTER TABLE directly against the live database bypasses the migration system and breaks alembic_version tracking.
High
#13
Using alembic stamp Instead of upgrade
alembic stamp marks a migration as applied WITHOUT running the SQL. Always use alembic upgrade head. Stamp is only for resolving state divergence, never as a substitute for running migrations.
High
#14
Not Verifying Migrations Actually Ran
After any migration, check the DB has the expected columns and tables. alembic upgrade head reporting success does not guarantee all SQL executed without error.
High
#22
Manual alembic_version UPDATE
Manually updating the alembic_version table is the same as alembic stamp — marks the migration as applied without running the SQL. Never do this.
High
#72
Single Backup Location
Always backup to BOTH /opt/zenpower/backups/ AND /home/zenpower/backups/. The 2026-01-23 incident deleted all backups alongside the data. Two locations means one survives.
High
#AP
Alembic JSON Columns Wrapped with json.dumps()
SQLAlchemy Column(JSON) handles serialization automatically. Passing json.dumps({...}) double-encodes — the DB stores a JSON string containing escaped JSON. Fix: use plain Python dicts in migrations.
Medium
#AP
Using metadata as a Column Name
metadata is reserved in SQLAlchemy. Fix: extra_metadata = Column("metadata", JSON, ...) — different Python attribute name, same DB column name.
Medium

Docker & Infrastructure

#58
Installing Storage-Heavy Software Without Checking Disk
CRITICAL: Run df -h and check software disk requirements BEFORE installing. Bitcoin blockchain = 500GB+. Ethereum = 2TB+. This filled the server disk to 0 bytes overnight (2026-01-17), making the system unusable. ASK FIRST.
Critical
#80
Log Files Without Rotation
CRITICAL: process-supervisor.js used fs.appendFileSync without rotation — log grew to 133GB and filled the 290GB server disk. ALWAYS set up logrotate or use a logging library with rotation when deploying services.
Critical
#3
Manual docker run Commands
NEVER use docker run directly. Always use docker compose --env-file /etc/zenpower/compose.env. Manual docker run broke Traefik causing 502 errors. Use compose only.
High
#41
Container Restart Doesn't Apply New Env Vars
docker restart does NOT re-read env vars from compose.env. Use docker compose up -d --force-recreate to apply environment variable changes.
High
#63
Creating Files as Root Without Fixing Permissions
After ANY root operations in /opt/zenpower, run bash /opt/zenpower/scripts/fix-permissions.sh. NEVER use chown -R blindly — this breaks Docker volumes, acme.json, and postgres data directories.
High
#64
Not Checking Expected vs Actual Container Count
Run docker ps --format "{{'{{'}}{{'.Names'}}{{'}}'}}" | wc -l and compare against COMPOSE_PROFILES. Missing containers always means missing profiles — don't assume things are running.
High
#AP
Traefik File Watcher Unreliable
ALWAYS run docker compose restart traefik after adding routes to dynamic.d/. The Traefik file provider watcher does not reliably pick up new or modified files.
High
#42
Docker Context Set to Rootless But Daemon Not Running
Check docker context ls. The zenpower user may default to a rootless context; if the rootless daemon is not running, switch to the default context for root operations.
Medium
#73
No Pre-Flight Verification for Destructive Commands
ALWAYS dry-run first. For rsync: rsync -n --delete. For rm: ls the path first. For docker: docker compose config before up. Every destructive command needs a verification step.
Medium
#77
Next.js in Alpine Resolves localhost to [::1]
Alpine's /etc/hosts maps localhost to IPv6 [::1]. Health checks using localhost fail. Use 127.0.0.1 explicitly in health check commands and internal service URLs.
Medium
#76
Using LangFuse v3 (Requires ClickHouse)
LangFuse v3 requires ClickHouse as a dependency. For PostgreSQL-only deployments, always pin to langfuse/langfuse:2. Check release notes before upgrading major versions of any data service.
Medium

Authentication & Security

#37
Committing Files With API Keys or Secrets
ALWAYS check for secrets before commit. Never bypass pre-commit secret detection. --no-verify is only valid if the pre-commit hook has a false positive — verify manually first.
Critical
#43
Exposing API Keys in Command Output
NEVER echo or export API keys in visible commands. Use source /etc/zenpower/compose.env or set -a && source file && set +a to load secrets without exposing them.
Critical
#51
SameSite=Strict Cookies Causing Redirect Loops
Use SameSite=Lax for auth cookies that need to work across subdomain redirects. SameSite=Strict breaks the SIWE login flow because the cookie is not sent on cross-origin navigations.
High
#78
Forward-Auth Blocks All Requests Including Health Checks
Services behind forward-auth middleware reject unauthenticated health probes. Plan auth bypass for monitoring endpoints (/healthz, /api/public/health) or use container-internal health checks on 127.0.0.1.
High
#39
DNS resolv.conf Pointing to Inactive Resolver
If systemd-resolved is inactive, remove 127.0.0.1 from /etc/resolv.conf and use direct nameservers. Broken DNS silently breaks all container networking and service discovery.
High
#AP
JSON Config Columns Without field_validator
Unvalidated JSON config = potential config injection. tiers.economy was exploitable to force opus model selection. Always use @field_validator with tier-model check and unknown key rejection on all JSON config columns.
Medium
#AP
Docker Bypasses UFW Firewall Rules
UFW INPUT rules don't affect Docker-exposed ports. Add DROP rules to the DOCKER-USER iptables chain. Persist with /etc/docker/docker-user-iptables.sh + docker-iptables.service.
Medium

Agent Communication

#AP
POST /instruct Is NOT Idempotent — Never Call Twice
Every /instruct call fires the agent, costs money, logs to DB, and sends a Discord notification. Save the response to /tmp/{'{'}agent{'}'}_response.json then read the file. Never re-POST to re-read results. Session-17: called 4 agents twice = $0.23 wasted.
Critical
#118
P2 ACK Timeout Exceeded (300s Max)
P2 ACK timeout is 300 seconds maximum. Check inbox every 60 seconds. Session-18: context compaction killed the inbox loop resulting in a 36-minute P2 ACK gap. Build a dead-man switch so another agent takes over when root goes dark.
High
#88
No Watchdog for Orchestrator Context Compaction
When root's context compacts, the inbox loop dies and the entire team stalls. Pattern: if root goes dark for more than 5 minutes, another agent should take over coordination. Session-20: 36-min P2 ACK gap from compaction.
High
#89
Agent Sessions Dying Without Auto-Recovery
dev-wsl sessions die every 15-20 min with no auto-restart. Work for 15 min, session dies, 25+ min dark, root declares dead. Need: heartbeat-based session health check + auto-restart mechanism.
High
#87
Launching Long Test Runs Without Estimating Duration
NEVER run a test suite longer than 60 seconds without explicit root/CEO approval. Estimate duration BEFORE running. Include the estimate in the ACK. Ask "is this run worth $X?" before launching.
High
#AP
Fixed /tmp Paths in Parallel-Invoked Scripts
$$ in bash = parent shell PID, NOT unique per background subshell. Use mktemp for temp files in any script that may be invoked in parallel. The agent_bridge.sh mktemp fix resolved data races in all 4 paths: zp_msg, zp_mem, zp_brain, zp_ack.
Medium
#AP
Staff Agent Findings Without Verification
Agents hallucinate file names and line numbers. Cipher fabricated "line 109 / X-Auth-Tier header" — it doesn't exist. ALWAYS grep the codebase to confirm ANY agent finding before acting on it.
Medium

Code Quality

#AP
Module Shadowing
services/email.py shadows the stdlib email module. services/auth/ shadowed services/auth.py (renamed to services/staff_auth/). ALWAYS ls the target directory before creating new packages to spot conflicts.
High
#AP
Pydantic v2: model_config as Field Name
model_config is reserved in Pydantic v2. Use model_routing instead. Never name a Pydantic field with a framework-reserved attribute name — the error is silent and confusing.
High
#AP
Python Operator Precedence: and Before or
A and B or C evaluates as (A and B) or C. C always wins if truthy. Use explicit parentheses. This pattern has caused multiple silent logic bugs across the codebase.
High
#AP
Inline Import Inside Async Function — UnboundLocalError Trap
If import asyncio is at module level AND inside a function body, Python raises UnboundLocalError. Fix: remove all inline imports once a module-level import exists for that symbol.
High
#28
Mixing Sync/Async Patterns in Background Jobs
For FastAPI cron endpoints: inline the async code, don't call a sync wrapper that uses asyncio.run(). Calling asyncio.run() inside an already-running event loop raises RuntimeError: This event loop is already running.
Medium
#35
Conditional FastAPI Dependencies at Module Load
db: T = Depends(x) if COND else None fails with Pydantic 2.11. Use proper optional dependency functions (Annotated[Optional[T], Depends(get_db_optional)]) instead of runtime conditionals in parameter defaults.
Medium
#20
Using snake_case in MCP Schemas
The MCP specification uses camelCase: inputSchema, not input_schema. Incorrect casing causes silent schema rejection. Always check the MCP spec before writing tool definitions.
Medium
#AP
OpenRouter Model ID Slash Stripping
OpenRouter model IDs are org/model (e.g., deepseek/deepseek-chat-v3-0324). The _strip_provider_prefix function must preserve slashes. Never strip the org prefix from OpenRouter model IDs.
Medium

Deployment

#65
Claiming "Done" Without Ecosystem Tests Passing
ALWAYS run /opt/zenpower/scripts/ecosystem-tests/run.sh --quick before claiming completion. 667 passed + 6 failed = NOT DONE. Fix failures first. "Done" means ecosystem tests PASS.
Critical
#11
Deploy Without Verify
ALWAYS curl and verify live endpoints BEFORE committing. Tests that pass locally can fail in production due to env vars, volumes, or networking differences. The 3-second curl check beats a 15-minute test suite.
High
#15
Running 15-Min Test Suite Before 3-Sec Live Check
curl the endpoint FIRST (3 seconds), THEN run tests if the curl fails or the issue isn't clear. Never launch the full test suite when a simple live check would surface the issue immediately and save 15 minutes.
High
#33
Secrets Files Out of Sync
/opt/zenpower/.env.secrets and /etc/zenpower/.env.secrets can diverge after manual edits. Sync or symlink them. Divergence causes silent auth failures and API credential mismatches.
High
#57
Using Stale Stats From Previous Versions
Landing pages had 145 tools but the actual count was 389. ALWAYS curl the live endpoint to get current counts before updating documentation or landing pages. Never copy stats from old sources.
High
#26
Checking Git Tags for Version Instead of compose.env
APP_VERSION in /etc/zenpower/compose.env is the authoritative deployed version. Git tags may lag or diverge. Always check compose.env to know what version is actually running in production.
Medium
#83
DEV Assuming Production Server State
DEV writes code and tests. ROOT verifies on production. Never claim "verified working" from a DEV machine — always send a back-challenge for ROOT to verify live state. DEV cannot see container health, disk, or env vars.
Medium
#AP
Not Using deploy.sh for Deployments
Use /opt/zenpower/tools/ops/deploy.sh service [--force] for ALL application deploys. Fall back to direct compose only if deploy.sh fails. Raw docker compose up skips pre/post-deploy hooks.
Medium

Cost & Budget

#1
Calling External APIs Repeatedly Without Caching
Cache, throttle, and ask before expensive external API calls. Alchemy API spam wasted credits in multiple sessions. Never call a paid external API in a loop without explicit user approval.
High
#17
Re-Running Timed-Out Test Suite Without Fixing Issue
If the test suite timed out, FIX THE ISSUE first. Don't retry blindly. A timeout indicates a hung test or infrastructure problem, not a transient failure. Retrying wastes compute and blocks coordination.
High
#AP
Claude Model IDs — Wrong Suffix Format
claude-opus-4-6 and claude-sonnet-4-6 have NO dated suffix. claude-sonnet-4-6-20250619 returns 404. Haiku still requires a date suffix: claude-haiku-4-5-20251001. Verify model IDs before deploying agents.
High
#6
Excessive grep on Large Codebases
grep -r on /opt/zenpower spiked server load from 5 to 32. Use the targeted Grep tool or delegate to off-peak. Never broad-search the production filesystem during business hours.
Medium
#9
Verbose Responses — Token Waste
Be surgical and concise. Long explanations, unnecessary confirmations, and repeating file contents all waste tokens and user attention. Act, don't explain. Show results, not process.
Medium
#AP
Rebuilding Services Multiple Times Unnecessarily
Rebuilt the landing container 3x in one session from using the wrong compose file. Always check the image name and compose file before building. One correct build beats three trial-and-error builds.
Medium

General

#24
Not Reading CLAUDE_FAILURES.md at Session Start
Read /opt/zenpower/docs/CLAUDE_FAILURES.md FIRST before making any infrastructure changes. This file contains 163 anti-patterns from 110 sessions. Skipping this guarantees repeating known failures.
Critical
#4
Edit Without Read
Always use the Read tool before any Edit. Editing a file without reading it first leads to mismatched context strings, dropped lines, and failed tool calls. This is the single most common failure mode.
High
#5
Guessing File Paths
Use ls or find to verify paths exist before referencing them in tool calls. Guessing paths wastes tool call budget and produces confusing errors that mask the actual problem.
High
#50
Not Running Tests in the Correct venv
Always source /opt/zenpower/.venv/bin/activate before running pytest. Running tests outside the venv picks up system packages and produces misleading import errors that hide the actual issue.
High
#47
Redesigning Systems Instead of Debugging
When the user reports "X doesn't work", DEBUG X. Don't propose replacing the whole system. The cost of a redesign is always higher than a targeted fix, and you probably haven't found the root cause yet.
High
#46
Claiming Completion Without Full Verification
Test the ACTUAL user flow, not just curl endpoints. Previous agent claimed "ZenCursor built and deployed to both repos" — no artifacts existed. Verify the artifact, the deployment, and the user-visible result before claiming done.
High
#54
Making Changes Without Committing
Commit changes with git commit in /opt/zenpower and push with git push origin main. Uncommitted changes are lost on the next git pull --rebase. The workflow is: edit → commit → deploy → verify → push.
Medium
#AP
Bash ! in Passwords — Never Inline
The CEO password contains !. Write JSON to /tmp first: python3 -c 'import json; ...' then curl -d @/tmp/file.json. Or use the api_auth.sh helper script. Inlining causes bash history expansion failures.
Medium
#AP
Skipping git pull --rebase at Session Start
cd /opt/zenpower && git pull --rebase origin main before any changes. Session-17: skipped this step, resulting in a 4-commit divergence and rebase conflicts in all modified files.
Medium
This page is auto-updated as new patterns are discovered. Source: docs/CLAUDE_FAILURES.md — 163 anti-patterns from 110 sessions as of 2026-02-26.