Looks like I finally have a fix for the flash write issue, thanks to some help from Espressif. Embarrassingly it turns out to be my fault, but I’m not too proud to admit that. So what was the problem? Since SDK v1.5.1 people have been reporting intermittent failed OTA updates and I’ve been able to reproduce it easily myself. This only seemed to occur when using wifi for long sustained writes.
So what was happening? Occasionally the network receive would get a much larger packet than normal. In my latest testing they are usually around 1436 bytes, but occasionally one would arrive that was 5744 (this is the only large value I have seen myself, but presumably other values could occur). The flash is erased in 4k sectors and the rBoot code erases them as required, checking if the current data block will fit in the current sector and if not erasing the next one too. Where I went wrong was to assume that a receive would always be less than the 4k sector size, having never seen one before SDK v1.5.1 that was anywhere near that large. When one of these very large packets arrives it could span 3 sectors. The second sector would get erased correctly but the third sector would not. The flash write command would not return an error when it tried to write to the non-erased sector, so no fault was noticed at this time. Then on the next write that third sector, now being the first sector for that next write, gets erased and the new data written part way through it (where it should be).
Moral of the story. Don’t make assumptions and don’t ignore the edges cases – I’m usually pretty good at this second point, but occasionally I seem to need reminding. In this case I had thought about it and added code that detected a chunk over 4k that would need more erasing. As I didn’t think this could ever happen I merely put a comment in the if statement to say what would need to be done in that scenario, if I’d thrown an error at that point instead this problem would have been easily diagnosed. However, I did also assume that the flash write would fail if it tried to write to flash that was not erased, so I expected to see an error of some kind.
Why did it suddenly happen at v1.5.1? I don’t know that but presumably Espressif made some change in the SDK that makes this more likely to occur. While playing with some code for Pete Scargill I did manage to reproduce the problem with v1.4.0 so it wasn’t impossible for it to happen there, but I never had any reports of it there previously. I also found that the timers in Pete’s code made it more likely to happen in my testing, so I suspect the extra processing these caused was impacting the performance of the network stack and causing more packets to be bunched together and delivered as larger chunks to the application. Further testing showed RF interference could also cause the same result.
The fix is available in the rBoot GitHub repo, and has also been updated in Sming.
4 thoughts on “Flash write bug fix”
Thank you for rboot!
With some effort I have it now running with nodemcu-dev and esp-iot-v1.51.
I think I have found a little problem in the fix for the flash write bug. In trying to do 2 or more consecutive calls to rboot.ota() I got a problem with the rom0 sectors overwritten with 0xff. I started rboot.ota() from rom 1.
I think the problem is a missing (re-)initialization of status.last_sector_erased in rboot_write_init() (rboot-api.c) .
That variable stays at the last value of the previous rboot.ota() call and the loop in rboot_write_flash() is then doing the wrong stuff.
Adding that here made the problem disappear.
I’m not sure I follow. The value of status.last_sector_erased is reinitialised in rboot_write_init():
Sorry, looks like I had an older version of the code in my build environment or I forgot when merging. I had added the same code at that place as found in the link above only 2 lines below .
No worries. It was in the original fix https://github.com/raburton/rboot/commit/75ca33be0524ab7a5f2bd55065add28ad812e045 so I guess it just got missed when you merged.