Maintenance¶

Overview¶

Maintenance covers the day-to-day tasks that keep your DocumentDB cluster healthy and performant. Regular monitoring, log review, and proactive resource management prevent outages and help you catch issues before they affect applications.

Monitoring DocumentDB Cluster Health¶

DocumentDB Cluster Status¶

Check the overall health of your DocumentDB clusters:

# List all DocumentDB clusters and their status
kubectl get documentdb -n <namespace>

# Detailed cluster information
kubectl describe documentdb <cluster-name> -n <namespace>

What to check	Normal	Investigate if
`STATUS` column	`Cluster in healthy state`	Any other status (e.g., `Setting up primary`, `Creating replica`) persists longer than a few minutes
`AGE` column	Consistent with deployment time	Unexpectedly recent — may indicate an unplanned restart

Pod Health¶

# Check pod status (each pod runs PostgreSQL + gateway sidecar)
kubectl get pods -n <namespace> -l app=<cluster-name>

# View pod resource usage
kubectl top pods -n <namespace>

What to check	Normal	Investigate if
`READY` column	`2/2` (PostgreSQL container + gateway sidecar)	Less than `2/2` — one or both containers are not ready
`STATUS` column	`Running`	`CrashLoopBackOff`, `Error`, `Pending`, or `Init` persisting beyond startup
`RESTARTS` column	`0` (or very low over the cluster lifetime)	High or rapidly increasing — indicates repeated container crashes
Resource usage (`kubectl top`)	CPU and memory stable under normal workload	CPU consistently maxed out (throttling) or memory climbing steadily (OOMKill risk)

Log Management¶

Tip

We recommend setting up centralized log collection as part of your observability strategy. See the telemetry playground for OpenTelemetry, Prometheus, and Grafana integration examples.

DocumentDB Operator LogsPostgreSQL LogsGateway Logs

# Recent operator logs
kubectl logs -n documentdb-operator deployment/documentdb-operator --tail=100

# Follow operator logs in real time
kubectl logs -n documentdb-operator deployment/documentdb-operator -f

What's normal: Periodic reconciliation messages, successful backup notifications.

Investigate if: Repeated ERROR or WARNING lines, reconciliation failures, or stack traces appear.

Access PostgreSQL logs inside a specific pod:

kubectl exec -it <pod-name> -n <namespace> -c postgres -- \
  cat /controller/log/postgres

What's normal: Startup messages, checkpoint completions, autovacuum activity.

Investigate if: FATAL, PANIC, or repeated ERROR entries appear. Watch for out of memory, no space left on device, or too many connections messages.

Access gateway (sidecar) logs:

kubectl logs <pod-name> -n <namespace> -c documentdb-gateway

What's normal: Successful connection handling, startup messages.

Investigate if: Repeated connection refused errors, authentication failures, or TLS handshake errors appear.

Configuring Log Level¶

The spec.logLevel field controls the PostgreSQL instance log verbosity. It does not affect the DocumentDB operator or gateway logs.

spec:
  logLevel: "warning"  # Options: debug, info, warning, error

Tip

For production deployments, use warning or error to reduce log volume. Reserve info or debug for troubleshooting.

Apply the change:

kubectl apply -f documentdb.yaml

Resource Monitoring¶

# Pod resource consumption
kubectl top pods -n <namespace>

# Node resource consumption
kubectl top nodes

What to check	Normal	Investigate if
Pod CPU usage	Varies with workload; no sustained spikes	Consistently maxed out — queries may be throttled
Pod memory usage	Stable and predictable	Climbing steadily or hitting node limits — pods may be OOMKilled. Check for memory-heavy queries.
Node resource usage	Enough headroom for pod scheduling and bursts	Nodes above 80% utilization — new pods may fail to schedule or existing pods may be evicted.

Storage Monitoring¶

Monitor persistent volume usage:

# Check PVC status and capacity
kubectl get pvc -n <namespace>

# Check actual disk usage inside a pod
kubectl exec -it <pod-name> -n <namespace> -c postgres -- df -h /var/lib/postgresql/data

What to check	Normal	Investigate if
PVC `STATUS`	`Bound`	`Pending` — the storage class may not be able to provision a volume
Disk usage (`df -h`)	Below 70% of capacity	Above 80% — risk of the database halting when storage is full. Plan a migration to a larger volume.
Growth rate	Gradual and predictable	Sudden spikes — may indicate a bulk data load, excessive logging, or WAL accumulation

Note

PVC resize is not currently supported but is planned for a future release. If storage usage approaches capacity, provision a new DocumentDB cluster with larger pvcSize and restore from a backup. See Storage Configuration for details.

Events and Alerts¶

The operator emits Kubernetes events for significant state changes:

# View events for a DocumentDB cluster
kubectl get events -n <namespace> --field-selector involvedObject.name=<cluster-name>

# View all DocumentDB-related events
kubectl get events -n <namespace> --sort-by=.lastTimestamp

Key events to watch for:

Event	Meaning	Action
`BackupSchedule`	A scheduled backup created a Backup resource	No action needed — verify periodically that backups are running on schedule
`BackupFailed`	A backup failed	Investigate immediately. Check operator logs and storage configuration. Ensure your backup target is reachable.
`InvalidSchedule`	A ScheduledBackup has an invalid cron expression	Fix the `spec.schedule` field in your ScheduledBackup resource.
`PVsRetained`	PVs were retained after DocumentDB cluster deletion	Expected if `reclaimPolicy: Retain`. Clean up PVs manually if no longer needed.