Hey everyone, by popular request, I'm going to start doing some blog posts that walk through the details of building a small hardware/software company. (Partly to help others and partly to improve my writing, feel free to skip if you're not into this stuff)
Firmware Updates that Don’t Suck
Since 2017 we’ve shipped over 250,000 Arsenal’s (a AI photography assistant for DSLR/mirrorless cameras). Being a small company, we didn’t have the customer support resources to be walking users through the update process. Updates need to work, be fast, and be 100% resilient. Here’s everything we did to get there and some tips for anyone trying to do the same.
When building Arsenal 1, we decided to time-box the firmware update “feature”. We had so many things to do, we had to be ok with good enough. Spoiler: It wasn’t until Arsenal 2 that we had the time to get everything right.
What We Needed
I did quite a bit of research and at the time (2017) and there weren’t any good off the shelf solutions that met our requirements. Our perfect update system needed the following:
- designed for linux not microcontrollers (Arsenal runs Linux on an arm processor)
- fault tolerant, no way to brick a unit (surprisingly not something most off the shelf solutions provided)
- updates over wifi (from our mobile app)
- minimal disk overhead (no duplicates of the whole firmware image) fast updates
A Lesson In What Not To Do
The camera industry is the classic example of terrible update experiences. Most cameras require you to navigate a website, find your camera model, download a firmware binary, put it in a specific path on the SD card, insert the SD card in the camera, then update through the camera menu. It’s a struggle for the user, but to be fair, it is simple to engineer and seems resilient.
Sony is probably the worst example of bad updates, instead of the SD card workflow, for most cameras they require you install a custom driver and kernel extension (yes, you read that right) on your Mac/PC to talk to the camera over USB.
With Arsenal we wanted to raise the bar.
Arsenal 1 Firmware Build
We were racing against a deadline to ship Arsenal 1, so moving fast was the top priority. Our embedded/Linux engineer and I brainstormed a few options of how to implement a more production firmware build process. For Arsenal 1, we created a simple build process based on Arch linux and bash scripts.
Our build process involved the following:
- configure and build our Linux kernel and drivers
- install a stock Arch image on the board — along with our custom kernel and drivers
- install our dependencies with pacman and gcc
- remove the build related dependencies
- copy our software to the disk
- image and compress the disk image (The firmware image was just a tar of our disk image)
- The whole thing was glued together with bash scripts. Not the most elegant, but it worked. And more importantly, we got it built quickly.
Disk Overhead and Operation Never Brick
Embedded firmware images are usually really small, so a common technique is to have two partitions on disk. You write the new image to one partition, then flip to it after a successful update. If something goes wrong, you don’t flip.
Our firmware images were huge, so we didn’t have the disk space to dedicate to two copies of everything. Instead we created a very minimal image (using openembedded) that only had the firmware updater, usb/wifi drivers, and an http server to upload to. We called this the “recovery image.”
“Recovery” sat on a small partition at the start of the disk, after that was the root partition for the main firmware image, then a user partition for all user data.
We made the recovery image and root partitions as read only as possible so things like overflowing log files wouldn’t cause boot issues. Storing all user data (in our case image caches) on the user partition meant a full user disk wouldn’t affect the actual boot, and our device code could clear the cache on boot even if we coded something wrong. (Which I assure you never happened, not even once)
This kind of defensive design saved us a few times. Just assume you will screw something up at some point and try to design a firmware update process that can handle some screw ups.
Handling Screw Ups
Even with the precautions on the main firmware/root partition, I knew mistakes would be made. So we needed to be able to handle a failed start on the root partition. Either from a bad disk write, corrupted bits, or the software getting into a bad state (from a bug), we needed failed boots of the root partition to be recoverable.
The recovery image was very small and targeted, so we decided if something went wrong with the root firmware boot, we would fail over to the recovery partition. From there you could re-upload the firmware image and be on your way again. To be safe, recovery updates clear the user partition, but Arsenal has very little on device state, so for users it’s not an issue.
You might wonder how you can be sure the device will flip over to recovery if something like a bad disk write happens. This is where hardware watchdog’s come in. The hardware watchdog’s are set up by uboot, after the root partition boots (takes about 6 seconds to boot), then our software has 30 seconds to disable the watchdog. If the watchdog isn’t disabled in 30 seconds, the unit will reboot. After 3 failed boots, it kicks over to recovery. (Check your processors data sheet for details on how to set up these watchdogs)
Simple but effective. From the mobile app, we can treat recovery updates the same way as normal updates.
The important thing with hardware watchdogs is that they are implemented in hardware (funny how that works). No amount of software mistakes (except accidentally disabling the watchdog) will prevent the reboot. Only the happy path in our case prevents Arsenal from entering recovery.
Ok, so what happens if the recovery image gets corrupted? Our recovery image is about 10MB’s, so we just did the obvious thing and added a 2nd copy of the recovery image, again with a watchdog to fail over and boot the 2nd image. These images are actually files on the recovery partition which uboot mounts as a read only file system.
I don’t have any metrics on how many times having the 2nd copy of recovery has saved us. Data on the frequency of cosmic ray induced bit-flips and some napkin math says it has happened. We have yet to get back a unit that wouldn’t boot into recovery.
As a user, you mostly access Arsenal via a mobile app. To simplify things, I decided early that the mobile app should only connect if the device firmware version matched the mobile app version. (Otherwise you go from a single app/device version that needs to testing to a matrix of version combinations to test)
On Arsenal 1, when the mobile app updated (which happens automatically on phones), the mobile app would see the versions didn’t match and require you to download and install the firmware update to continue. The process was fairly smooth, though automatically disconnecting and reconnecting wifi required a ton of debugging work and Wifi driver fixes (especially to deal with Android)
ASIDE — HERE BE DRAGONS WARNING: even today, Android’s wifi stacks are really up to the phone manufacturer and the quality of drivers and even implementation of the core Android Wifi API’s is very poor. If you need automatic Wifi connections to a device from Android, be prepared for a lot of pain as you purchase and debug issues with the way your wifi chips drivers communicate to any of the 30,000+ unique Android phones.
The Missing Requirements
Arsenal 1’s firmware update process had a huge flaw which you may have seen already. If your mobile app updated (as happens in the background), and you weren’t on internet when you opened the new version, you couldn’t download the firmware and connect. Our firmware images were 500MB’s, so you needed a good internet connection and possibly a free data plan. Even if you were on Wifi or had cell, waiting for the update to download while the good light was fading made for unhappy users. (DJI and GoPro had the same issue in their apps at the time)
This turned out to be a much harder problem to fix than you might think. In a perfect world, we would have blocked automatic app updates for our app, but that’s not an option.
For Arsenal 2, we added two more requirements for our firmware updater.
- either have offline update support or allow non-matched version connections
- updates in less than 2 minutes start to finish (“door to door” as we call it, the time from pressing update until you’re connected to the updated Arsenal)
The <2 minute update requirement seemed like an impossible hurtle. Arsenal 1 updates took about 8 minutes. (ugh) To get to <2, we would need to rebuild every part of the system, but we’ll come back to that.
First we debated if we could somehow handle the testing of unmatched mobile and device code versions. We thought about just supporting connections from 5 versions back, then requiring an update if it was older. This would still mean 5² = 25 combinations to test on each release. (And multiply that by the 100+ cameras we support) QA already took weeks, so I nixed that idea pretty quickly.
Our only option was embedding the firmware in the app itself. When we built Arsenal 1, the amount of available storage on phones (especially iPhones) was really limited, users told us a 550MB app was a non-starter. So we opted to download the firmware image when you needed it.
A lot had changed by Arsenal 2 (2020). I checked my phone, I had 4 apps all taking up more than 500MB. I had honestly never noticed. Users were still telling me that 500MB was too big, but I decided we would try embedding the firmware and see how many complaints we got.
Our #1 complaint on Arsenal 1 was blocked connections when a firmware update was needed and you were offline, so we didn’t have a ton to lose.
Shrinking the Firmware
I never liked the firmware build process for Arsenal 1. I had managed to integrate it into our CI pipeline, but it ran on a physical unit that booted into a DFU type mode. We needed to send a special sequence to a pin on the board to get it into the DFU mode, so we wired an FTDI controller to the pin and flipped its GPIO pin to send the sequence. (#hacky)
For Arsenal 2, firmware/software builds needed to be:
- faster (Arsenal 1 builds took hours)
- repeatable (Arsenal 1 builds were downloading things from pacman and github during the build, not good)
- more compact, the Arsenal 1 image was 500MB, about 250MB of that was machine learning models, the rest was mostly Arch linux and the rest of the binary dependencies.
Our linux/embedded engineer had worked with yocto and openembedded before. These projects let you build a custom linux distribution with exactly the things you need (and nothing you don’t.) They can even strip out unused symbols from binaries and do some other space optimization tricks.
They are really powerful, but not for the faint of heart. If it had just been me, I would opt for something like buildroot. The great thing about all of these projects is that they cross compile every package, so you can run the build (in parallel) on a large machine. My physically massive — 64 core, 256GB machine learning machine would do fine.
Also, my favorite thing about openembedded is that there’s some container/chroot jail type stuff that happens during the build, and inside of that is a fake
sudo type command that’s named
pseudo Best name ever.
It took our embedded engineer a few months to get everything moved over, but once we did we ended up with a firmware image that was <50MB with all of our dependencies. A 5x space savings, not bad. There was still an additional 250MB for the machine learning models. The smaller image made for a 2x faster boot. And lastly, openembedded brought our build time down to about 10 minutes.
Compressing the hell out of it
The challenge with Arsenal 2 was that we had both a Standard and a Pro version. The Pro version ran a different Arm architecture than Standard, so we would need to embed both firmware images into the mobile app.
We were looking at about 300MB per firmware image with gzip compression.
Faster Install Times
The 300MB of firmware image would upload in about 45 seconds, so that gave us one minute, 30 seconds to install the firmware and reboot. With gzip compression, the install time was … 5 minutes … I benchmarked the disk and figured out the bottleneck was decompression performance. Moving to parallel gzip decoding improved things, but not enough.
I spent a day building a script to test every compression format I thought might work well on our hardware. (I tried gzip, bzip, deflate, snappy, zstd, and blosc) The tl;dr is zstd won, the compression times were much longer, but all that mattered for us was the decompression times and compression ratio. Zstd’s decompression uses SIMD instructions on ARM and decompression time doesn’t increase as compression level goes up.
Zstd took our 300MB image with light gzip compression to a 220MB image with very high zstd compression.
The smaller image and parallel decoding got our install time down to 1 minute, and the smaller image shaved the upload time down to about 30 seconds. Disabling TCP-slow start for the firmware upload connection got it the rest of the way there.
After months of work, we had a firmware update that took 1 minute, 30 seconds from when you push the update button to when you’re reconnected on the new version.
The Moment of Truth
We had sent out a few hundred Arsenal 2’s to some early testers. The first version had the old Arsenal 1 installer. When it was ready, we pushed the new app version with the new installer. Then we waited…
I knew that embedding the firmware was our best option, so I was ready to defend why we had to make the app 500MB’s.
The messages started coming in, but not the messages I thought. It was all positive feedback about how fast the update process was. To this day, the number of complaints about app size is probably less than 10. (I’m not sure this would have been the case in 2017, but since the app downloads in the background, most people never even look at the size.)
Firmware updates are a seldom discussed area of building hardware products, the quality of your update process can have a huge impact on how your product is received. Here’s what worked for us:
- openembedded for building firmware/disk images
- zstd for fast decoding and high compression on ARM hardware
- embed the images into the mobile app
- recovery images and hardware watchdog’s for safety
If you’re about to build or integrate an existing firmware update solution, my one piece of advice is to be sure to allocate enough dev time. Firmware updates can seem like something easy form the outside, but once you dive in there’s complexity hiding under every corner. Be sure to program defensively and make sure your system can handle shipping a main image that will fail to boot under some condition.