Optimizing OTA Updates: From 228s to 26s

Our Hoopi Pedal uses a Daisy Seed (STM32H7) for audio DSP and an ESP32 for WiFi connectivity. We have the ability to update the audio effects by changing the Daisy's firmware, using over-the-air (OTA) updates.

Over-the-air updates work by:

  1. Hoopi App checks cloud for updates and downloads firmware
  2. Hoopi App sends firmware to ESP32 via HTTP
  3. ESP32 sends firmware to Daisy via UART
  4. Daisy writes to QSPI flash staging area
  5. Daisy copies from staging to active area and reboots

The initial implementation took 228 seconds for a 294KB firmware update. Users were waiting nearly 4 minutes, and the "critical window" (where power loss could brick the device) was over 60 seconds.

The Architecture

sequenceDiagram participant Cloud participant App as Hoopi App participant ESP32 participant Daisy App->>Cloud: Check for updates Cloud-->>App: New version available App->>Cloud: Download firmware Cloud-->>App: Firmware binary App->>ESP32: HTTP POST /api/ota ESP32->>Daisy: OTA_START (size, crc32) Note right of Daisy: Erase QSPI staging area Daisy-->>ESP32: ACK (block_size, total_blocks) loop For each block ESP32->>Daisy: OTA_DATA (block_num, data) Daisy-->>ESP32: ACK end ESP32->>Daisy: OTA_VERIFY Daisy-->>ESP32: ACK (calculated_crc) ESP32->>Daisy: OTA_FINISH Note right of Daisy: Copy staging → active (critical!) Daisy-->>ESP32: ACK Note right of Daisy: Reboot into new firmware ESP32->>Daisy: REQUEST_DEVICE_INFO Daisy-->>ESP32: FW version (1.9) Note left of ESP32: Version matches! ESP32-->>App: OTA Complete

Building a Test Harness

Before optimizing, we needed fast iteration. Running full OTA cycles (upload → transfer → reboot) for each test would be painfully slow. Instead, we created a test mode that benchmarks QSPI operations directly at startup.

// hoopi.cpp - Test mode for QSPI benchmarking
#define DEBUG_PRINT 1
#define TEST_QSPI_SPEED 1  // Enable timing tests

#if TEST_QSPI_SPEED
    // Wait for USB serial connection
    hw.seed.StartLog(true);  // true = blocking wait for connection

    hw.PrintLine("=== QSPI Speed Test ===");

    uint32_t start, elapsed;
    constexpr uint32_t TEST_SIZE = 64 * 1024;  // 64KB test

    // Test 1: Erase timing
    hw.PrintLine("Erasing 64KB at staging area...");
    start = System::GetNow();
    hw.seed.qspi.Erase(OTA_QSPI_STAGING_ADDR, OTA_QSPI_STAGING_ADDR + TEST_SIZE);
    elapsed = System::GetNow() - start;
    hw.PrintLine("Erase 64KB: %lums", elapsed);

    // Test 2: Write timing with different block sizes
    uint8_t* test_buf = new uint8_t[32768];
    memset(test_buf, 0xAA, 32768);

    // 256-byte writes
    start = System::GetNow();
    for (uint32_t i = 0; i < TEST_SIZE; i += 256) {
        hw.seed.qspi.Write(OTA_QSPI_STAGING_ADDR + i, 256, test_buf);
    }
    elapsed = System::GetNow() - start;
    hw.PrintLine("Write 64KB (256B blocks): %lums", elapsed);

    // 32KB writes
    hw.seed.qspi.Erase(OTA_QSPI_STAGING_ADDR, OTA_QSPI_STAGING_ADDR + TEST_SIZE);
    start = System::GetNow();
    for (uint32_t i = 0; i < TEST_SIZE; i += 32768) {
        hw.seed.qspi.Write(OTA_QSPI_STAGING_ADDR + i, 32768, test_buf);
    }
    elapsed = System::GetNow() - start;
    hw.PrintLine("Write 64KB (32KB blocks): %lums", elapsed);

    delete[] test_buf;
    hw.PrintLine("=== Test Complete ===");
#endif

This gave us immediate feedback:

=== QSPI Speed Test ===
Erase 64KB: 1408ms
Write 64KB (256B blocks): 11023ms
Write 64KB (32KB blocks): 137ms
=== Test Complete ===

The 256B vs 32KB write difference (80x!) immediately showed us where to focus.


Optimization 1: 32KB Write Chunks

Problem: The original code wrote firmware in 256-byte pages during the final copy.

// BEFORE: 256-byte page writes (SLOW!)
for (uint32_t i = 0; i < ota_expected_size; i += 256) {
    hw.seed.qspi.Write(OTA_QSPI_ACTIVE_ADDR + i, 256,
                       (uint8_t*)(OTA_QSPI_STAGING_ADDR + i));
}

Solution: Write in 32KB chunks instead. QSPI flash can handle larger writes efficiently.

// AFTER: 32KB chunk writes (5.5x faster!)
constexpr uint32_t CHUNK_SIZE = 32 * 1024;
uint8_t* sram_buf = new uint8_t[CHUNK_SIZE];

while (bytes_copied < ota_expected_size) {
    uint32_t chunk_size = std::min(CHUNK_SIZE, ota_expected_size - bytes_copied);

    // Copy to SRAM buffer first (can't write directly from QSPI)
    memcpy(sram_buf, (uint8_t*)(OTA_QSPI_STAGING_ADDR + bytes_copied), chunk_size);

    // Write full 32KB chunk at once
    hw.seed.qspi.Write(OTA_QSPI_ACTIVE_ADDR + bytes_copied, chunk_size, sram_buf);
    bytes_copied += chunk_size;
}

Result: Critical window reduced from ~62s to ~11s.


Optimization 2: 64KB Block Erase

Problem: libDaisy's QSPI erase used 4KB sector erase commands (0xD7), requiring 256 erase operations for 1MB.

Discovery: We benchmarked the erase operations:

Erase 64KB (16x 4KB sectors): 80,000ms  // Calling EraseSector 16 times
Erase 64KB (bulk Erase):       1,700ms  // Still using 4KB internally
Erase 64KB (64KB block cmd):     115ms  // Using 0xD8 command

The IS25LP080D flash chip supports 64KB block erase (command 0xD8), but libDaisy wasn't using it!

Solution: Patch libDaisy to use 64KB block erase when possible:

// Added to libDaisy's qspi.cpp
QSPIHandle::Result QSPIHandle::Impl::EraseBlock64K(uint32_t address)
{
    QSPI_CommandTypeDef s_command;
    s_command.Instruction = 0xD8;  // 64KB block erase (was 0xD7 for 4KB)
    s_command.AddressMode = QSPI_ADDRESS_1_LINE;
    s_command.AddressSize = QSPI_ADDRESS_24_BITS;
    s_command.Address     = address;
    // ... rest of command setup

    WriteEnable();
    HAL_QSPI_Command(&halqspi_, &s_command, HAL_QPSI_TIMEOUT_DEFAULT_VALUE);
    AutopollingMemReady(HAL_QPSI_TIMEOUT_DEFAULT_VALUE);
    return QSPIHandle::Result::OK;
}

// Modified Erase() to use 64KB blocks when aligned
QSPIHandle::Result QSPIHandle::Impl::Erase(uint32_t start_addr, uint32_t end_addr)
{
    constexpr uint32_t BLOCK_64K = 0x10000;
    constexpr uint32_t SECTOR_4K = 0x1000;

    while (end_addr > start_addr) {
        uint32_t block_addr = start_addr & 0x0FFFFFFF;

        // Use 64KB block erase when aligned and enough space remaining
        if ((block_addr % BLOCK_64K) == 0 && (end_addr - start_addr) >= BLOCK_64K) {
            EraseBlock64K(block_addr);
            start_addr += BLOCK_64K;
        } else {
            EraseSector(block_addr);  // Fall back to 4KB
            start_addr += SECTOR_4K;
        }
    }
    return QSPIHandle::Result::OK;
}

Result: Erase time reduced from ~24s to ~1.6s (15x faster).


Optimization 3: Larger UART Blocks

Problem: Sending firmware in 256-byte blocks meant 1,177 UART packets with protocol overhead for each.

Solution: Increase block size to 4KB, reducing packets from 1,177 to just 74:

// hoopi.h
#define OTA_BLOCK_SIZE    4096   // Was 256
#define UART_MAX_DATA_LEN 4108   // 4096 + header room
#define UART_RING_SIZE    8192   // Larger DMA buffer

Extended frame format for large payloads:

+-------+------+--------+--------+-------+----------+----------+
| START | 0xFE | LEN_LO | LEN_HI |  CMD  |  DATA    |  CRC16   |
+-------+------+--------+--------+-------+----------+----------+
|  0xAA | 0xFE |    2 bytes      | 1 byte| 4096 B   |  2 bytes |
+-------+------+--------+--------+-------+----------+----------+

Result: Total time reduced from 181s to 54s.


Optimization 4: Higher Baud Rate

Problem: At 115200 baud, transferring 294KB takes ~34 seconds just for raw data.

Solution: Increase UART baud rate to 460800 (4x faster):

// hoopi.h - InitUart()
uart_conf.baudrate = 460800;  // Was 115200

We tested 921600 baud but saw instability. 460800 proved reliable.

Result: UART transfer time reduced from ~34s to ~14s.


Final Results

Version Total Time Optimizations
v16 228s Original implementation
v1.4 193s Baseline (cleaned up)
v1.5 181s 32KB writes + 64KB block erase
v1.6 55s 2048-byte OTA blocks
v1.7 54s 4096-byte OTA blocks
v1.9 26s 460800 baud

Timing Breakdown (v1.9)

Phase                     Duration
────────────────────────────────────
WiFi upload               ~1.6s
QSPI erase                ~1.6s
UART transfer (74 blocks) ~14s
CRC verification          ~0.1s
Copy to active (critical) ~2.8s
Reboot + confirm          ~4s
────────────────────────────────────
Total                     ~26s

Critical Window

The most important metric for safety - how long a power failure could brick the device:

Version Critical Window
v16 >60s
v1.9 ~3s

Log Comparison

Before (v16): 228 seconds

I (266660) HOOPI: OTA block 1177/1177 sent (100%)
I (266660) HOOPI: OTA_VERIFY sent
I (266760) OTA: OTA CRC verified: 0xb80f1bb5
I (266760) OTA: OTA_FINISH sent, waiting for copy to complete...
I (271760) OTA: Waiting for copy to complete... 5s
I (276760) OTA: Waiting for copy to complete... 10s
...
I (326760) OTA: Waiting for copy to complete... 60s
I (329000) OTA: OTA_FINISH ACK received - copy complete, Daisy rebooting...

After (v1.9): 26 seconds

I (19469) OTA: State -> STARTED, sending OTA_START...
I (21069) OTA: OTA_START ACK received: block_size=4096, total_blocks=74
I (21069) OTA: State -> SENDING, 74 blocks to send
I (35289) HOOPI: OTA_VERIFY sent
I (35389) OTA: OTA CRC verified: 0x170d76b6
I (35389) OTA: OTA_FINISH sent, waiting for copy to complete...
I (38179) OTA: OTA_FINISH ACK received - copy complete, Daisy rebooting...
I (42439) OTA: FW version 1.9 matches expected, OTA complete!

Key Takeaways

  1. Build a test harness first - Creating isolated QSPI benchmarks with StartLog(true) gave us instant feedback without running full OTA cycles. This turned a 4-minute test loop into seconds.

  2. Batch operations - Larger chunks reduce per-operation overhead dramatically (1,177 packets → 74 packets)

  3. Test incrementally - Each optimization was tested independently before combining, making it easy to isolate regressions

The final result: 8.8x faster updates with a 20x smaller critical window.

Final sequence chart

sequenceDiagram participant Hoopi App participant ESP32 participant Daisy Note over Hoopi App,Daisy: OTA Update Flow (294KB @ 460800 baud ≈ 26s) Hoopi App->>ESP32: Firmware binary (WiFi) Note right of ESP32: ~1.6s upload ESP32->>Daisy: OTA_START (size, crc32) Note right of Daisy: Stop audio<br/>Erase QSPI staging area Note right of Daisy: ~1.6s (64KB block erase) Daisy-->>ESP32: ACK (block_size=4096, total=74) loop 74 blocks (~14s) ESP32->>Daisy: OTA_DATA (block_num, 4096 bytes) Note right of Daisy: Write to staging area Daisy-->>ESP32: ACK (block_num) end ESP32->>Daisy: OTA_VERIFY Note right of Daisy: Calculate CRC32 Daisy-->>ESP32: ACK (calculated_crc) alt CRC Match ESP32->>Daisy: OTA_FINISH Note right of Daisy: Copy staging → active<br/>(critical window ~3s) Daisy-->>ESP32: ACK Note right of Daisy: Reboot Note over Daisy: Bootloader loads<br/>new firmware Daisy-->>ESP32: Device ready ESP32->>Daisy: REQUEST_DEVICE_INFO Daisy-->>ESP32: version = 1.9 Note over ESP32: OTA Complete! else CRC Mismatch ESP32->>Daisy: OTA_ABORT Note right of Daisy: Resume audio<br/>Discard staging Daisy-->>ESP32: ACK end