Engineering Knowledge Base
Anti-Pattern Database
Documented AI agent failures and anti-patterns from Zenpower's production sessions. Searchable reference for humans and AI agents.
163
Anti-Patterns
110
Sessions
8
Categories
21k+
Tool Calls
No anti-patterns match your search.
Database & Data Loss
#68
Critical
Wiping Docker Data Directories
NEVER delete, recreate, or chown .docker-data directories. Postgres, Redis, and other volumes are LIVE DATA. Use fix-permissions.sh if permissions are wrong — don't recreate.
#69
Critical
rsync --delete Without Proper Excludes
CATASTROPHIC: rsync --delete deletes any file in destination not in source. Always exclude .docker-data, backups, acme.json. The pattern
docker-data does NOT match .docker-data (missing dot).
#71
Critical
Using Raw rsync for Zenpower Sync
NEVER run rsync manually. Use
/opt/zenpower/scripts/safe-sync.sh which has verified exclude patterns, dry-run preview, and safety checks. rsync caused the 2026-01-23 data loss.
#21
High
ALTER TABLE Instead of Migration
Always create proper Alembic migration files. Running ALTER TABLE directly against the live database bypasses the migration system and breaks alembic_version tracking.
#13
High
Using alembic stamp Instead of upgrade
alembic stamp marks a migration as applied WITHOUT running the SQL. Always use alembic upgrade head. Stamp is only for resolving state divergence, never as a substitute for running migrations.
#14
High
Not Verifying Migrations Actually Ran
After any migration, check the DB has the expected columns and tables.
alembic upgrade head reporting success does not guarantee all SQL executed without error.
#22
High
Manual alembic_version UPDATE
Manually updating the
alembic_version table is the same as alembic stamp — marks the migration as applied without running the SQL. Never do this.
#72
High
Single Backup Location
Always backup to BOTH
/opt/zenpower/backups/ AND /home/zenpower/backups/. The 2026-01-23 incident deleted all backups alongside the data. Two locations means one survives.
#AP
Medium
Alembic JSON Columns Wrapped with json.dumps()
SQLAlchemy
Column(JSON) handles serialization automatically. Passing json.dumps({...}) double-encodes — the DB stores a JSON string containing escaped JSON. Fix: use plain Python dicts in migrations.
#AP
Medium
Using
metadata as a Column Namemetadata is reserved in SQLAlchemy. Fix: extra_metadata = Column("metadata", JSON, ...) — different Python attribute name, same DB column name.Docker & Infrastructure
#58
Critical
Installing Storage-Heavy Software Without Checking Disk
CRITICAL: Run
df -h and check software disk requirements BEFORE installing. Bitcoin blockchain = 500GB+. Ethereum = 2TB+. This filled the server disk to 0 bytes overnight (2026-01-17), making the system unusable. ASK FIRST.
#80
Critical
Log Files Without Rotation
CRITICAL:
process-supervisor.js used fs.appendFileSync without rotation — log grew to 133GB and filled the 290GB server disk. ALWAYS set up logrotate or use a logging library with rotation when deploying services.
#3
High
Manual docker run Commands
NEVER use
docker run directly. Always use docker compose --env-file /etc/zenpower/compose.env. Manual docker run broke Traefik causing 502 errors. Use compose only.
#41
High
Container Restart Doesn't Apply New Env Vars
docker restart does NOT re-read env vars from compose.env. Use docker compose up -d --force-recreate to apply environment variable changes.
#63
High
Creating Files as Root Without Fixing Permissions
After ANY root operations in
/opt/zenpower, run bash /opt/zenpower/scripts/fix-permissions.sh. NEVER use chown -R blindly — this breaks Docker volumes, acme.json, and postgres data directories.
#64
High
Not Checking Expected vs Actual Container Count
Run
docker ps --format "{{'{{'}}{{'.Names'}}{{'}}'}}" | wc -l and compare against COMPOSE_PROFILES. Missing containers always means missing profiles — don't assume things are running.
#AP
High
Traefik File Watcher Unreliable
ALWAYS run
docker compose restart traefik after adding routes to dynamic.d/. The Traefik file provider watcher does not reliably pick up new or modified files.
#42
Medium
Docker Context Set to Rootless But Daemon Not Running
Check
docker context ls. The zenpower user may default to a rootless context; if the rootless daemon is not running, switch to the default context for root operations.
#73
Medium
No Pre-Flight Verification for Destructive Commands
ALWAYS dry-run first. For rsync:
rsync -n --delete. For rm: ls the path first. For docker: docker compose config before up. Every destructive command needs a verification step.
#77
Medium
Next.js in Alpine Resolves localhost to [::1]
Alpine's
/etc/hosts maps localhost to IPv6 [::1]. Health checks using localhost fail. Use 127.0.0.1 explicitly in health check commands and internal service URLs.
#76
Medium
Using LangFuse v3 (Requires ClickHouse)
LangFuse v3 requires ClickHouse as a dependency. For PostgreSQL-only deployments, always pin to
langfuse/langfuse:2. Check release notes before upgrading major versions of any data service.Authentication & Security
#37
Critical
Committing Files With API Keys or Secrets
ALWAYS check for secrets before commit. Never bypass pre-commit secret detection.
--no-verify is only valid if the pre-commit hook has a false positive — verify manually first.
#43
Critical
Exposing API Keys in Command Output
NEVER
echo or export API keys in visible commands. Use source /etc/zenpower/compose.env or set -a && source file && set +a to load secrets without exposing them.
#51
High
SameSite=Strict Cookies Causing Redirect Loops
Use
SameSite=Lax for auth cookies that need to work across subdomain redirects. SameSite=Strict breaks the SIWE login flow because the cookie is not sent on cross-origin navigations.
#78
High
Forward-Auth Blocks All Requests Including Health Checks
Services behind forward-auth middleware reject unauthenticated health probes. Plan auth bypass for monitoring endpoints (
/healthz, /api/public/health) or use container-internal health checks on 127.0.0.1.
#39
High
DNS resolv.conf Pointing to Inactive Resolver
If
systemd-resolved is inactive, remove 127.0.0.1 from /etc/resolv.conf and use direct nameservers. Broken DNS silently breaks all container networking and service discovery.
#AP
Medium
JSON Config Columns Without field_validator
Unvalidated JSON config = potential config injection.
tiers.economy was exploitable to force opus model selection. Always use @field_validator with tier-model check and unknown key rejection on all JSON config columns.
#AP
Medium
Docker Bypasses UFW Firewall Rules
UFW INPUT rules don't affect Docker-exposed ports. Add DROP rules to the
DOCKER-USER iptables chain. Persist with /etc/docker/docker-user-iptables.sh + docker-iptables.service.Agent Communication
#AP
Critical
POST /instruct Is NOT Idempotent — Never Call Twice
Every
/instruct call fires the agent, costs money, logs to DB, and sends a Discord notification. Save the response to /tmp/{'{'}agent{'}'}_response.json then read the file. Never re-POST to re-read results. Session-17: called 4 agents twice = $0.23 wasted.
#118
High
P2 ACK Timeout Exceeded (300s Max)
P2 ACK timeout is 300 seconds maximum. Check inbox every 60 seconds. Session-18: context compaction killed the inbox loop resulting in a 36-minute P2 ACK gap. Build a dead-man switch so another agent takes over when root goes dark.
#88
High
No Watchdog for Orchestrator Context Compaction
When root's context compacts, the inbox loop dies and the entire team stalls. Pattern: if root goes dark for more than 5 minutes, another agent should take over coordination. Session-20: 36-min P2 ACK gap from compaction.
#89
High
Agent Sessions Dying Without Auto-Recovery
dev-wsl sessions die every 15-20 min with no auto-restart. Work for 15 min, session dies, 25+ min dark, root declares dead. Need: heartbeat-based session health check + auto-restart mechanism.
#87
High
Launching Long Test Runs Without Estimating Duration
NEVER run a test suite longer than 60 seconds without explicit root/CEO approval. Estimate duration BEFORE running. Include the estimate in the ACK. Ask "is this run worth $X?" before launching.
#AP
Medium
Fixed /tmp Paths in Parallel-Invoked Scripts
$$ in bash = parent shell PID, NOT unique per background subshell. Use mktemp for temp files in any script that may be invoked in parallel. The agent_bridge.sh mktemp fix resolved data races in all 4 paths: zp_msg, zp_mem, zp_brain, zp_ack.
#AP
Medium
Staff Agent Findings Without Verification
Agents hallucinate file names and line numbers. Cipher fabricated "line 109 / X-Auth-Tier header" — it doesn't exist. ALWAYS grep the codebase to confirm ANY agent finding before acting on it.
Code Quality
#AP
High
Module Shadowing
services/email.py shadows the stdlib email module. services/auth/ shadowed services/auth.py (renamed to services/staff_auth/). ALWAYS ls the target directory before creating new packages to spot conflicts.
#AP
High
Pydantic v2: model_config as Field Name
model_config is reserved in Pydantic v2. Use model_routing instead. Never name a Pydantic field with a framework-reserved attribute name — the error is silent and confusing.
#AP
High
Python Operator Precedence: and Before or
A and B or C evaluates as (A and B) or C. C always wins if truthy. Use explicit parentheses. This pattern has caused multiple silent logic bugs across the codebase.
#AP
High
Inline Import Inside Async Function — UnboundLocalError Trap
If
import asyncio is at module level AND inside a function body, Python raises UnboundLocalError. Fix: remove all inline imports once a module-level import exists for that symbol.
#28
Medium
Mixing Sync/Async Patterns in Background Jobs
For FastAPI cron endpoints: inline the async code, don't call a sync wrapper that uses
asyncio.run(). Calling asyncio.run() inside an already-running event loop raises RuntimeError: This event loop is already running.
#35
Medium
Conditional FastAPI Dependencies at Module Load
db: T = Depends(x) if COND else None fails with Pydantic 2.11. Use proper optional dependency functions (Annotated[Optional[T], Depends(get_db_optional)]) instead of runtime conditionals in parameter defaults.
#20
Medium
Using snake_case in MCP Schemas
The MCP specification uses camelCase:
inputSchema, not input_schema. Incorrect casing causes silent schema rejection. Always check the MCP spec before writing tool definitions.
#AP
Medium
OpenRouter Model ID Slash Stripping
OpenRouter model IDs are
org/model (e.g., deepseek/deepseek-chat-v3-0324). The _strip_provider_prefix function must preserve slashes. Never strip the org prefix from OpenRouter model IDs.Deployment
#65
Critical
Claiming "Done" Without Ecosystem Tests Passing
ALWAYS run
/opt/zenpower/scripts/ecosystem-tests/run.sh --quick before claiming completion. 667 passed + 6 failed = NOT DONE. Fix failures first. "Done" means ecosystem tests PASS.
#11
High
Deploy Without Verify
ALWAYS curl and verify live endpoints BEFORE committing. Tests that pass locally can fail in production due to env vars, volumes, or networking differences. The 3-second curl check beats a 15-minute test suite.
#15
High
Running 15-Min Test Suite Before 3-Sec Live Check
curl the endpoint FIRST (3 seconds), THEN run tests if the curl fails or the issue isn't clear. Never launch the full test suite when a simple live check would surface the issue immediately and save 15 minutes.
#33
High
Secrets Files Out of Sync
/opt/zenpower/.env.secrets and /etc/zenpower/.env.secrets can diverge after manual edits. Sync or symlink them. Divergence causes silent auth failures and API credential mismatches.
#57
High
Using Stale Stats From Previous Versions
Landing pages had 145 tools but the actual count was 389. ALWAYS curl the live endpoint to get current counts before updating documentation or landing pages. Never copy stats from old sources.
#26
Medium
Checking Git Tags for Version Instead of compose.env
APP_VERSION in /etc/zenpower/compose.env is the authoritative deployed version. Git tags may lag or diverge. Always check compose.env to know what version is actually running in production.
#83
Medium
DEV Assuming Production Server State
DEV writes code and tests. ROOT verifies on production. Never claim "verified working" from a DEV machine — always send a back-challenge for ROOT to verify live state. DEV cannot see container health, disk, or env vars.
#AP
Medium
Not Using deploy.sh for Deployments
Use
/opt/zenpower/tools/ops/deploy.sh service [--force] for ALL application deploys. Fall back to direct compose only if deploy.sh fails. Raw docker compose up skips pre/post-deploy hooks.Cost & Budget
#1
High
Calling External APIs Repeatedly Without Caching
Cache, throttle, and ask before expensive external API calls. Alchemy API spam wasted credits in multiple sessions. Never call a paid external API in a loop without explicit user approval.
#17
High
Re-Running Timed-Out Test Suite Without Fixing Issue
If the test suite timed out, FIX THE ISSUE first. Don't retry blindly. A timeout indicates a hung test or infrastructure problem, not a transient failure. Retrying wastes compute and blocks coordination.
#AP
High
Claude Model IDs — Wrong Suffix Format
claude-opus-4-6 and claude-sonnet-4-6 have NO dated suffix. claude-sonnet-4-6-20250619 returns 404. Haiku still requires a date suffix: claude-haiku-4-5-20251001. Verify model IDs before deploying agents.
#6
Medium
Excessive grep on Large Codebases
grep -r on /opt/zenpower spiked server load from 5 to 32. Use the targeted Grep tool or delegate to off-peak. Never broad-search the production filesystem during business hours.
#9
Medium
Verbose Responses — Token Waste
Be surgical and concise. Long explanations, unnecessary confirmations, and repeating file contents all waste tokens and user attention. Act, don't explain. Show results, not process.
#AP
Medium
Rebuilding Services Multiple Times Unnecessarily
Rebuilt the landing container 3x in one session from using the wrong compose file. Always check the image name and compose file before building. One correct build beats three trial-and-error builds.
General
#24
Critical
Not Reading CLAUDE_FAILURES.md at Session Start
Read
/opt/zenpower/docs/CLAUDE_FAILURES.md FIRST before making any infrastructure changes. This file contains 163 anti-patterns from 110 sessions. Skipping this guarantees repeating known failures.
#4
High
Edit Without Read
Always use the Read tool before any Edit. Editing a file without reading it first leads to mismatched context strings, dropped lines, and failed tool calls. This is the single most common failure mode.
#5
High
Guessing File Paths
Use
ls or find to verify paths exist before referencing them in tool calls. Guessing paths wastes tool call budget and produces confusing errors that mask the actual problem.
#50
High
Not Running Tests in the Correct venv
Always
source /opt/zenpower/.venv/bin/activate before running pytest. Running tests outside the venv picks up system packages and produces misleading import errors that hide the actual issue.
#47
High
Redesigning Systems Instead of Debugging
When the user reports "X doesn't work", DEBUG X. Don't propose replacing the whole system. The cost of a redesign is always higher than a targeted fix, and you probably haven't found the root cause yet.
#46
High
Claiming Completion Without Full Verification
Test the ACTUAL user flow, not just curl endpoints. Previous agent claimed "ZenCursor built and deployed to both repos" — no artifacts existed. Verify the artifact, the deployment, and the user-visible result before claiming done.
#54
Medium
Making Changes Without Committing
Commit changes with
git commit in /opt/zenpower and push with git push origin main. Uncommitted changes are lost on the next git pull --rebase. The workflow is: edit → commit → deploy → verify → push.
#AP
Medium
Bash ! in Passwords — Never Inline
The CEO password contains
!. Write JSON to /tmp first: python3 -c 'import json; ...' then curl -d @/tmp/file.json. Or use the api_auth.sh helper script. Inlining causes bash history expansion failures.
#AP
Medium
Skipping git pull --rebase at Session Start
cd /opt/zenpower && git pull --rebase origin main before any changes. Session-17: skipped this step, resulting in a 4-commit divergence and rebase conflicts in all modified files.
This page is auto-updated as new patterns are discovered. Source:
docs/CLAUDE_FAILURES.md — 163 anti-patterns from 110 sessions as of 2026-02-26.