Many ESP32 projects run perfectly on a developer desk and collapse once deployed in apartments, offices, or industrial spaces with noisy Wi-Fi conditions. The main mistake is treating every disconnect as an exceptional event. In production, intermittent link issues are normal. Firmware should absorb them predictably.
1. Failure modes to design for
I model Wi-Fi failures in four classes:
- short RF drop (seconds)
- prolonged AP unavailability (minutes)
- DHCP or DNS instability
- credential or roaming mismatch
Each class needs a different recovery response. Using one generic reconnect loop is usually not enough.
2. Connection state machine
A robust ESP32 client should use explicit states:
BOOTWIFI_CONNECTINGWIFI_ONLINEWIFI_DEGRADEDOFFLINE_BUFFERING
Transitions should be driven by events and timers, not only callback side effects. This keeps behavior debuggable.
3. Backoff without reboot loops
Avoid immediate full-device resets after repeated failures. First apply reconnect attempts with exponential backoff and jitter:
- attempts 1-3: short delay
- attempts 4-10: medium delay
- attempts >10: long delay and reduced network activity
Only perform controlled restart after a clearly defined threshold and with a reboot reason log.
4. Offline buffering policy
Telemetry should not be dropped immediately when offline. Keep a bounded ring buffer for latest payloads:
- max entries by RAM budget
- payload compaction for repeated metrics
- include enqueue timestamp for staleness filtering
When connectivity returns, flush oldest-first with rate limiting.
5. Timeouts and watchdog strategy
Network calls must always have explicit timeout boundaries. Blocking forever on socket operations eventually deadlocks higher-level tasks.
Use watchdog supervision for the main loop and monitor task liveness counters. A watchdog should recover true stalls, not mask bad network logic.
6. Health signals for operators
Expose internal health indicators:
- current Wi-Fi RSSI
- reconnect count in current hour
- queue depth
- last successful publish timestamp
This allows remote diagnosis before users report outages.
7. Power interaction
Weak power rails often look like Wi-Fi instability. Brown-outs during TX bursts can mimic random disconnects. Add brown-out counter and measure voltage under peak transmission.
If battery-powered, align upload schedule with power budget and radio duty cycle.
8. Test matrix before deployment
Validate at minimum:
- AP reboot while device running
- DHCP server unavailable
- intermittent packet loss
- wrong credential fallback behavior
- long offline period with buffered data
Most reconnection bugs are found only with these induced failures.
Final note
ESP32 reliability comes from controlled degradation and measured recovery. A good firmware keeps operating locally, buffers data safely, and reconnects without panicking the whole device.