Many ESP32 projects run perfectly on a developer desk and collapse once deployed in apartments, offices, or industrial spaces with noisy Wi-Fi conditions. The main mistake is treating every disconnect as an exceptional event. In production, intermittent link issues are normal. Firmware should absorb them predictably.

1. Failure modes to design for

I model Wi-Fi failures in four classes:

  • short RF drop (seconds)
  • prolonged AP unavailability (minutes)
  • DHCP or DNS instability
  • credential or roaming mismatch

Each class needs a different recovery response. Using one generic reconnect loop is usually not enough.

2. Connection state machine

A robust ESP32 client should use explicit states:

  • BOOT
  • WIFI_CONNECTING
  • WIFI_ONLINE
  • WIFI_DEGRADED
  • OFFLINE_BUFFERING

Transitions should be driven by events and timers, not only callback side effects. This keeps behavior debuggable.

3. Backoff without reboot loops

Avoid immediate full-device resets after repeated failures. First apply reconnect attempts with exponential backoff and jitter:

  • attempts 1-3: short delay
  • attempts 4-10: medium delay
  • attempts >10: long delay and reduced network activity

Only perform controlled restart after a clearly defined threshold and with a reboot reason log.

4. Offline buffering policy

Telemetry should not be dropped immediately when offline. Keep a bounded ring buffer for latest payloads:

  • max entries by RAM budget
  • payload compaction for repeated metrics
  • include enqueue timestamp for staleness filtering

When connectivity returns, flush oldest-first with rate limiting.

5. Timeouts and watchdog strategy

Network calls must always have explicit timeout boundaries. Blocking forever on socket operations eventually deadlocks higher-level tasks.

Use watchdog supervision for the main loop and monitor task liveness counters. A watchdog should recover true stalls, not mask bad network logic.

6. Health signals for operators

Expose internal health indicators:

  • current Wi-Fi RSSI
  • reconnect count in current hour
  • queue depth
  • last successful publish timestamp

This allows remote diagnosis before users report outages.

7. Power interaction

Weak power rails often look like Wi-Fi instability. Brown-outs during TX bursts can mimic random disconnects. Add brown-out counter and measure voltage under peak transmission.

If battery-powered, align upload schedule with power budget and radio duty cycle.

8. Test matrix before deployment

Validate at minimum:

  • AP reboot while device running
  • DHCP server unavailable
  • intermittent packet loss
  • wrong credential fallback behavior
  • long offline period with buffered data

Most reconnection bugs are found only with these induced failures.

Final note

ESP32 reliability comes from controlled degradation and measured recovery. A good firmware keeps operating locally, buffers data safely, and reconnects without panicking the whole device.