Looks like I finally have a fix for the flash write issue, thanks to some help from Espressif. Embarrassingly it turns out to be my fault, but I’m not too proud to admit that. So what was the problem? Since SDK v1.5.1 people have been reporting intermittent failed OTA updates and I’ve been able to reproduce it easily myself. This only seemed to occur when using wifi for long sustained writes.
So what was happening? Occasionally the network receive would get a much larger packet than normal. In my latest testing they are usually around 1436 bytes, but occasionally one would arrive that was 5744 (this is the only large value I have seen myself, but presumably other values could occur). The flash is erased in 4k sectors and the rBoot code erases them as required, checking if the current data block will fit in the current sector and if not erasing the next one too. Where I went wrong was to assume that a receive would always be less than the 4k sector size, having never seen one before SDK v1.5.1 that was anywhere near that large. When one of these very large packets arrives it could span 3 sectors. The second sector would get erased correctly but the third sector would not. The flash write command would not return an error when it tried to write to the non-erased sector, so no fault was noticed at this time. Then on the next write that third sector, now being the first sector for that next write, gets erased and the new data written part way through it (where it should be).
Moral of the story. Don’t make assumptions and don’t ignore the edges cases – I’m usually pretty good at this second point, but occasionally I seem to need reminding. In this case I had thought about it and added code that detected a chunk over 4k that would need more erasing. As I didn’t think this could ever happen I merely put a comment in the if statement to say what would need to be done in that scenario, if I’d thrown an error at that point instead this problem would have been easily diagnosed. However, I did also assume that the flash write would fail if it tried to write to flash that was not erased, so I expected to see an error of some kind.
Why did it suddenly happen at v1.5.1? I don’t know that but presumably Espressif made some change in the SDK that makes this more likely to occur. While playing with some code for Pete Scargill I did manage to reproduce the problem with v1.4.0 so it wasn’t impossible for it to happen there, but I never had any reports of it there previously. I also found that the timers in Pete’s code made it more likely to happen in my testing, so I suspect the extra processing these caused was impacting the performance of the network stack and causing more packets to be bunched together and delivered as larger chunks to the application. Further testing showed RF interference could also cause the same result.
The fix is available in the rBoot GitHub repo, and has also been updated in Sming.