Skip to main content

WPEngine Monitoring

Overview

Automated monitoring of 228 production WordPress sites across all WPEngine installs. Collects backup, SSL, disk, and bandwidth metrics every 2 hours with Slack alerts.

Quick Start

# Check monitoring status
tail -100 ~/logs/flypilot-wpengine.log

# Check alerts
tail -100 ~/logs/flypilot-wpengine-alerts.log

# View recent metrics in Supabase
psql $SUPABASE_DB_URL -c "SELECT * FROM v_wpengine_health_summary LIMIT 10;"

What's Being Monitored

228 Production Sites

Monitoring covers:

  • OMG Interactive sites - Property management sites across multiple clients
  • Learned Media sites - Client WordPress installations
  • FLYPILOT sites - ShopSanel, SPS Fitness, SHAI USA, etc.

4 Metric Types

MetricFieldsUpdate Frequency
Backupstatus, size_mb, hours_sinceEvery 2 hours
SSLissuer, valid_to, days_remainingEvery 2 hours
Diskused_mb, total_mb, percent_usedEvery 2 hours
Bandwidthused_gb, included_gb, percent_usedEvery 2 hours

Health Thresholds

Backup Monitoring

StatusConditionAction
🔴 CriticalBackup failedSlack alert immediately
⚠️ Warning>48 hours since last backupSlack alert
✅ HealthyBackup within 48 hoursNo action

SSL Monitoring

StatusConditionAction
🔴 Critical≤7 days until expirySlack alert immediately
⚠️ Warning≤30 days until expirySlack alert
✅ Healthy>30 days until expiryNo action

Disk Monitoring

StatusConditionAction
🔴 Critical≥90% disk usedSlack alert immediately
⚠️ Warning≥75% disk usedSlack alert
✅ Healthy<75% disk usedNo action

Bandwidth Monitoring

StatusConditionAction
🔴 Critical≥90% bandwidth usedSlack alert immediately
⚠️ Warning≥75% bandwidth usedSlack alert
✅ Healthy<75% bandwidth usedNo action

Cron Schedule

Metrics Collection

# Every 2 hours
0 */2 * * * cd ~/GitHub/flypilot && source .venv/bin/activate && python scripts/sync_wpengine_metrics.py

What it does:

  1. Connects to WPEngine API
  2. Fetches metrics for all 228 sites
  3. Stores in Supabase wpengine_metrics table
  4. Updates last_synced_at timestamp

Alert Processing

# Every 15 minutes during business hours (7am-8pm, Mon-Fri)
*/15 7-20 * * 1-5 cd ~/GitHub/flypilot && source .venv/bin/activate && python scripts/wpengine_alerts.py

What it does:

  1. Queries wpengine_metrics table
  2. Evaluates health thresholds
  3. Sends Slack alerts to #tech-alerts for issues
  4. Logs all activity to ~/logs/flypilot-wpengine-alerts.log

Slack Alerts

All alerts are sent to the #tech-alerts channel with:

Message format:

🔴 CRITICAL: SSL Certificate Expiring

Site: clientsite.com
Install: omg-client-1
Days Remaining: 5
Expires: 2026-02-08

Action Required: Renew SSL certificate immediately

Alert types:

  • 🔴 Critical - Requires immediate action
  • ⚠️ Warning - Requires attention soon
  • ✅ Resolved - Issue has been fixed

Supabase Dashboard Views

5 pre-built views for querying metrics:

v_wpengine_health_summary

Overall health score (0-100) per site based on all metrics.

SELECT * FROM v_wpengine_health_summary
WHERE health_score < 70
ORDER BY health_score ASC;

Health score calculation:

  • Backup status: 40% weight
  • SSL validity: 30% weight
  • Disk usage: 20% weight
  • Bandwidth usage: 10% weight

v_wpengine_backup_status

Backup-specific metrics and status.

SELECT * FROM v_wpengine_backup_status
WHERE hours_since_last_backup > 48
ORDER BY hours_since_last_backup DESC;

v_wpengine_ssl_status

SSL certificate metrics and expiry tracking.

SELECT * FROM v_wpengine_ssl_status
WHERE days_remaining < 30
ORDER BY days_remaining ASC;

v_wpengine_disk_status

Disk usage metrics and capacity tracking.

SELECT * FROM v_wpengine_disk_status
WHERE percent_used > 75
ORDER BY percent_used DESC;

v_wpengine_bandwidth_status

Bandwidth usage metrics and overage tracking.

SELECT * FROM v_wpengine_bandwidth_status
WHERE percent_used > 75
ORDER BY percent_used DESC;

Manual Operations

Run Metrics Collection Manually

cd ~/GitHub/flypilot
source .venv/bin/activate

# Collect metrics for all sites
python scripts/sync_wpengine_metrics.py

# Check logs
tail -50 ~/logs/flypilot-wpengine.log

Run Alert Check Manually

cd ~/GitHub/flypilot
source .venv/bin/activate

# Dry-run (no Slack alerts sent)
python scripts/wpengine_alerts.py --dry-run -v

# Run for real (sends Slack alerts)
python scripts/wpengine_alerts.py

Test Alert Filtering

# See what alerts would be triggered
python scripts/wpengine_alerts.py --dry-run -v | grep -A 5 "Critical\|Warning"

# Count alerts by severity
python scripts/wpengine_alerts.py --dry-run -v | grep -c "Critical"
python scripts/wpengine_alerts.py --dry-run -v | grep -c "Warning"

Phase 1 Deployment Status

Status: ✅ Production (deployed 2026-02-03)

What's Live:

  • ✅ Metrics collection (backup, SSL, disk, bandwidth)
  • ✅ Cron job running every 2 hours
  • ✅ Data syncing to Supabase
  • ✅ Dashboard views created
  • ✅ Alert system deployed
  • ✅ Slack notifications working

Completion: 100%


Future Enhancements (Phase 2-3)

Phase 2: Enhanced Monitoring

  • Performance metrics (page load times, TTFB)
  • Uptime monitoring (ping checks)
  • PHP error log monitoring
  • Plugin update notifications

Phase 3: Automation

  • Automatic SSL renewal reminders
  • Backup verification testing
  • Disk cleanup recommendations
  • Bandwidth optimization suggestions

Troubleshooting

"No metrics collected"

# Check cron is running
crontab -l | grep wpengine

# Check logs for errors
tail -100 ~/logs/flypilot-wpengine.log | grep -i error

# Test API connection
python scripts/sync_wpengine_metrics.py

"Alerts not showing in Slack"

# Verify Slack bot is in #tech-alerts channel
# Check Slack bot token in environment
echo $SLACK_BOT_TOKEN

# Test alert system
python scripts/wpengine_alerts.py --dry-run -v

"Too many SSL alerts"

# Check SSL filtering in script
python scripts/wpengine_alerts.py --dry-run -v | grep -A 5 "SSL"

# Expected: less than 10 SSL alerts per run
# If more, adjust filtering threshold

"Metrics out of date"

# Check last collection time
psql $SUPABASE_DB_URL -c "SELECT MAX(last_synced_at) FROM wpengine_metrics;"

# Should be within 2 hours of current time
# If older, check cron and logs

Configuration

Environment Variables

Required in ~/.config/flypilot/automation.env:

WPENGINE_USER_ID="your-wpengine-user-id"
WPENGINE_PASSWORD="your-wpengine-password"
SLACK_BOT_TOKEN="xoxb-your-slack-bot-token"
SUPABASE_URL="https://your-project.supabase.co"
SUPABASE_SERVICE_ROLE_KEY="your-service-role-key"

Alert Thresholds

Thresholds are configurable in scripts/wpengine_alerts.py:

THRESHOLDS = {
'backup_warning_hours': 48,
'ssl_critical_days': 7,
'ssl_warning_days': 30,
'disk_critical_percent': 90,
'disk_warning_percent': 75,
'bandwidth_critical_percent': 90,
'bandwidth_warning_percent': 75,
}