Maintenance¶
Overview¶
Maintenance covers the day-to-day tasks that keep your DocumentDB cluster healthy and performant. Regular monitoring, log review, and proactive resource management prevent outages and help you catch issues before they affect applications.
Monitoring DocumentDB Cluster Health¶
DocumentDB Cluster Status¶
Check the overall health of your DocumentDB clusters:
# List all DocumentDB clusters and their status
kubectl get documentdb -n <namespace>
# Detailed cluster information
kubectl describe documentdb <cluster-name> -n <namespace>
| What to check | Normal | Investigate if |
|---|---|---|
STATUS column |
Cluster in healthy state |
Any other status (e.g., Setting up primary, Creating replica) persists longer than a few minutes |
AGE column |
Consistent with deployment time | Unexpectedly recent — may indicate an unplanned restart |
Pod Health¶
# Check pod status (each pod runs PostgreSQL + gateway sidecar)
kubectl get pods -n <namespace> -l app=<cluster-name>
# View pod resource usage
kubectl top pods -n <namespace>
| What to check | Normal | Investigate if |
|---|---|---|
READY column |
2/2 (PostgreSQL container + gateway sidecar) |
Less than 2/2 — one or both containers are not ready |
STATUS column |
Running |
CrashLoopBackOff, Error, Pending, or Init persisting beyond startup |
RESTARTS column |
0 (or very low over the cluster lifetime) |
High or rapidly increasing — indicates repeated container crashes |
Resource usage (kubectl top) |
CPU and memory stable under normal workload | CPU consistently maxed out (throttling) or memory climbing steadily (OOMKill risk) |
Log Management¶
Tip
We recommend setting up centralized log collection as part of your observability strategy. See the telemetry playground for OpenTelemetry, Prometheus, and Grafana integration examples.
# Recent operator logs
kubectl logs -n documentdb-operator deployment/documentdb-operator --tail=100
# Follow operator logs in real time
kubectl logs -n documentdb-operator deployment/documentdb-operator -f
What's normal: Periodic reconciliation messages, successful backup notifications.
Investigate if: Repeated ERROR or WARNING lines, reconciliation failures, or stack traces appear.
Access PostgreSQL logs inside a specific pod:
What's normal: Startup messages, checkpoint completions, autovacuum activity.
Investigate if: FATAL, PANIC, or repeated ERROR entries appear. Watch for out of memory, no space left on device, or too many connections messages.
Configuring Log Level¶
The spec.logLevel field controls the PostgreSQL instance log verbosity. It does not affect the DocumentDB operator or gateway logs.
Tip
For production deployments, use warning or error to reduce log volume. Reserve info or debug for troubleshooting.
Apply the change:
Resource Monitoring¶
# Pod resource consumption
kubectl top pods -n <namespace>
# Node resource consumption
kubectl top nodes
| What to check | Normal | Investigate if |
|---|---|---|
| Pod CPU usage | Varies with workload; no sustained spikes | Consistently maxed out — queries may be throttled |
| Pod memory usage | Stable and predictable | Climbing steadily or hitting node limits — pods may be OOMKilled. Check for memory-heavy queries. |
| Node resource usage | Enough headroom for pod scheduling and bursts | Nodes above 80% utilization — new pods may fail to schedule or existing pods may be evicted. |
Storage Monitoring¶
Monitor persistent volume usage:
# Check PVC status and capacity
kubectl get pvc -n <namespace>
# Check actual disk usage inside a pod
kubectl exec -it <pod-name> -n <namespace> -c postgres -- df -h /var/lib/postgresql/data
| What to check | Normal | Investigate if |
|---|---|---|
PVC STATUS |
Bound |
Pending — the storage class may not be able to provision a volume |
Disk usage (df -h) |
Below 70% of capacity | Above 80% — risk of the database halting when storage is full. Plan a migration to a larger volume. |
| Growth rate | Gradual and predictable | Sudden spikes — may indicate a bulk data load, excessive logging, or WAL accumulation |
Note
PVC resize is not currently supported but is planned for a future release. If storage usage approaches capacity, provision a new DocumentDB cluster with larger pvcSize and restore from a backup. See Storage Configuration for details.
Events and Alerts¶
The operator emits Kubernetes events for significant state changes:
# View events for a DocumentDB cluster
kubectl get events -n <namespace> --field-selector involvedObject.name=<cluster-name>
# View all DocumentDB-related events
kubectl get events -n <namespace> --sort-by=.lastTimestamp
Key events to watch for:
| Event | Meaning | Action |
|---|---|---|
BackupSchedule |
A scheduled backup created a Backup resource | No action needed — verify periodically that backups are running on schedule |
BackupFailed |
A backup failed | Investigate immediately. Check operator logs and storage configuration. Ensure your backup target is reachable. |
InvalidSchedule |
A ScheduledBackup has an invalid cron expression | Fix the spec.schedule field in your ScheduledBackup resource. |
PVsRetained |
PVs were retained after DocumentDB cluster deletion | Expected if reclaimPolicy: Retain. Clean up PVs manually if no longer needed. |