[{"data":1,"prerenderedAt":260},["ShallowReactive",2],{"blog-hardware-software-integration-guide":3},{"id":4,"title":5,"body":6,"datePublished":247,"description":248,"extension":249,"meta":250,"navigation":251,"path":252,"seo":253,"stem":254,"tags":255,"updatedAt":247,"__hash__":259},"blog\u002Fblog\u002Fhardware-software-integration-guide.md","Hardware-Software Integration: The Firmware-to-Cloud Gap",{"type":7,"value":8,"toc":230},"minimark",[9,13,16,21,24,27,30,37,43,54,57,61,64,69,72,75,79,82,86,89,93,96,100,103,107,110,116,122,128,132,135,138,157,161,164,210,213,217],[10,11,12],"p",{},"The hardware works on the bench. The app demos beautifully on the office Wi-Fi. Then five hundred units ship, and within a month you're fielding support tickets about devices that show \"online\" but ignore commands, charts full of garbage readings, and a firmware update that half the fleet refuses to take. Nobody on either team wrote a bug, exactly. The product is failing in the space between them.",[10,14,15],{},"We do hardware-software integration for a living, and this is the single most consistent thing we see: connected products almost never die because the electronics were badly designed or the cloud was badly architected. They die at the boundary — the place where firmware, connectivity, and cloud meet — because that boundary belonged to nobody.",[17,18,20],"h2",{"id":19},"the-two-teams-problem","The two-teams problem",[10,22,23],{},"Most IoT product development happens like this. A company hires an electronics or firmware consultancy to build the device, and a separate web agency to build the app and backend. Both are competent in their own domain. Between them sits a contract: a protocol document, a JSON schema, maybe an OpenAPI spec and a shared Slack channel.",[10,25,26],{},"The contract describes the happy path. The product lives everywhere else.",[10,28,29],{},"Three things reliably fall into the gap:",[10,31,32,36],{},[33,34,35],"strong",{},"Protocols."," The document says the device publishes a reading every 60 seconds over MQTT. It doesn't say what happens when the broker connection drops mid-publish, whether messages are queued on-device or discarded, whether QoS 1 duplicates are possible (they are), or who deduplicates them. The firmware team assumes the backend is idempotent. The backend team assumes the device sends each reading once. Both assumptions are reasonable. Both can't be true.",[10,38,39,42],{},[33,40,41],{},"Timing."," Firmware engineers think in milliseconds, interrupt latencies, and watchdog timeouts. Backend engineers think in request-response cycles and eventual consistency. So the device times out and retries after 5 seconds, while the backend takes 8 seconds to process under load, and now every slow request is processed twice. Nobody specified the timeout relationship because each side's timing was an internal detail — until it wasn't.",[10,44,45,48,49,53],{},[33,46,47],{},"Error states."," The device reports fault code ",[50,51,52],"code",{},"0x4F",". The backend logs it as \"unknown error\" because the fault table lives in a C header file the cloud team has never seen. Meanwhile the cloud returns HTTP 429 under load, and the firmware treats anything that isn't 200 as \"retry immediately\", which is exactly the wrong response to a rate limit.",[10,55,56],{},"The gap is contract-shaped: everything written down works, and everything implied breaks.",[17,58,60],{"id":59},"where-integrated-products-actually-die","Where integrated products actually die",[10,62,63],{},"After enough of these projects, the failure points stop being surprising. They cluster in five places.",[65,66,68],"h3",{"id":67},"over-the-air-updates","Over-the-air updates",[10,70,71],{},"OTA is the hardest problem in connected hardware because it spans the entire stack. The bootloader and A\u002FB partition scheme are firmware concerns. The update server, signing infrastructure, and staged rollout policy are cloud concerns. The decision logic — when to update, how to verify health afterwards, when to roll back — is both, which in a two-team project usually means neither.",[10,73,74],{},"The questions that matter: what happens if power is cut mid-flash? How does a device prove the new firmware is healthy before committing to it? Can the cloud keep talking to version 1.2 devices while version 1.4 rolls out, and for how long? An update mechanism that hasn't answered these isn't an update mechanism; it's a fleet-bricking mechanism on a delay.",[65,76,78],{"id":77},"fleet-provisioning","Fleet provisioning",[10,80,81],{},"Everything works when an engineer hand-flashes ten development boards with credentials. It falls apart when a contract manufacturer needs to produce five thousand units, each with a unique identity and certificate, claimed by the right customer account on first boot. Provisioning has to be idempotent — devices get factory-reset, returned, refurbished, and re-sold — and it has to work without an engineer in the loop. This is where projects discover, late, that \"how does a device become trusted?\" was never anyone's job.",[65,83,85],{"id":84},"device-state-synchronisation","Device state synchronisation",[10,87,88],{},"A connected device has two states: what the device is actually doing, and what the cloud believes it's doing. The moment connectivity is imperfect — which is always — those diverge. If the app writes commands straight at the device and assumes success, a user toggles a setting while the device is offline and the system now permanently disagrees with itself. Solving this properly means a desired-state\u002Freported-state model (a device shadow or twin), conflict rules, and a UI honest about staleness. It's well-understood architecture, but it has to be designed across the boundary, not bolted on from one side.",[65,90,92],{"id":91},"flaky-connectivity","Flaky connectivity",[10,94,95],{},"The bench has perfect Wi-Fi. The field has cellular dead zones, captive portals, hostile NAT, and routers that lie about DNS. Two specific failures recur. First, devices that retry without exponential backoff and jitter: after a regional power cut, ten thousand devices reconnect in the same second and take down your own broker — a self-inflicted denial of service. Second, on-device buffering with no policy: how much data do you store offline, what do you drop first, and what timestamp do you attach when the real-time clock hasn't synced yet? \"It reconnects eventually\" is not a design.",[65,97,99],{"id":98},"instrument-data-formats","Instrument data formats",[10,101,102],{},"For scientific and industrial equipment, the payload itself is a boundary. Packed binary structs with endianness assumptions, raw ADC counts that need device-specific calibration coefficients, units that exist only in a comment in the firmware source. Then firmware 1.3 adds a field, half the fleet is still on 1.2, and the cloud parser has to handle both — forever. Payload schema versioning across a mixed-version fleet is unglamorous and absolutely product-critical.",[17,104,106],{"id":105},"what-this-looks-like-when-it-goes-wrong","What this looks like when it goes wrong",[10,108,109],{},"These are hypothetical, but each is a composite of failure modes we've seen variants of in the wild.",[10,111,112,115],{},[33,113,114],{},"Imagine a lab instrument company."," The device's clock resets to 1970 on power loss and corrects itself once NTP syncs, a minute or so after boot. Firmware timestamps readings locally; the cloud trusts device timestamps. Every power cycle produces a burst of readings dated fifty-six years ago, which the backend's retention policy silently deletes as expired. The data isn't wrong in either codebase. It's wrong in the gap.",[10,117,118,121],{},[33,119,120],{},"Imagine a heating controller."," The app reads state from the cloud's cached record, which says \"on\" because that was the last command sent. The device has been offline for two days. Customers stare at an app confidently displaying a fiction, and the support team has no tool that shows the difference between \"device reports on\" and \"we told it to turn on and heard nothing back\".",[10,123,124,127],{},[33,125,126],{},"Imagine an OTA rollout"," where the new firmware reports update health over a connection that a customer's captive-portal Wi-Fi quietly blocks. The device can't confirm health, dutifully rolls back, re-downloads the update overnight, and repeats — an invisible update loop chewing through a metered SIM. The bootloader behaved correctly. The server behaved correctly. The product failed.",[17,129,131],{"id":130},"what-changes-when-one-team-owns-sensor-to-screen","What changes when one team owns sensor-to-screen",[10,133,134],{},"When the same team designs the firmware, the protocol, and the cloud — what we'd call sensor-to-screen ownership — the gap doesn't get bridged. It stops existing.",[10,136,137],{},"Decisions get made once, with the whole system in view. The message format is designed knowing both the device's RAM budget and the dashboard's query patterns. The retry policy is one policy, with backoff tuned against real broker limits. Error states share a single taxonomy from ADC fault to support dashboard, so when a customer calls, someone can actually trace the failure. And the integration gets tested as a system: hardware-in-the-loop rigs in CI, cloud deployments validated against real devices on a desk — not two halves meeting for the first time in a customer's building.",[10,139,140,141,146,147,151,152,156],{},"This is also where how we work matters. Our ",[142,143,145],"a",{"href":144},"\u002Fservices\u002Fembedded-hardware","embedded & hardware integration"," practice and our ",[142,148,150],{"href":149},"\u002Fservices\u002Fcloud-devops","cloud & DevOps"," practice are the same people, deliberately. Hyperfocus means the protocol design, the power budget, and the rollout policy are held in one head at the same time — which is precisely what boundary problems require, because they're invisible to anyone holding only half the system. Pattern recognition across layers is how you spot that a \"random\" cloud data anomaly is actually a clock-sync issue three layers down before it ships. Cambridge happens to be a good place to do this from: the region is dense with instrument makers, deep-tech spinouts, and hardware companies, which is a large part of why we focus on ",[142,153,155],{"href":154},"\u002Fsoftware-development-cambridge","software development in Cambridge"," and the surrounding area for embedded systems development in the UK.",[17,158,160],{"id":159},"questions-to-ask-any-vendor-about-the-boundary","Questions to ask any vendor about the boundary",[10,162,163],{},"Whoever you're evaluating — including us — these questions expose whether the gap has an owner:",[165,166,167,174,180,186,192,198,204],"ol",{},[168,169,170,173],"li",{},[33,171,172],{},"Who owns the protocol spec, and how does it change after launch"," when devices in the field still speak the old version?",[168,175,176,179],{},[33,177,178],{},"Walk me through an OTA update that loses power halfway."," What does the device do? What does the cloud see? Who finds out?",[168,181,182,185],{},[33,183,184],{},"How does a unit get its identity and certificates at the factory",", and what happens when one is factory-reset and resold?",[168,187,188,191],{},[33,189,190],{},"What does the app show when a device has been offline for an hour?"," Where do commands sent during that hour go?",[168,193,194,197],{},[33,195,196],{},"What happens when the whole fleet reconnects at once"," after a network outage?",[168,199,200,203],{},[33,201,202],{},"How is the integration tested?"," Is real hardware in the CI loop, or do firmware and cloud first meet in the field?",[168,205,206,209],{},[33,207,208],{},"How many firmware versions can the backend handle simultaneously",", and for how long?",[10,211,212],{},"A vendor who answers these crisply has built integrated products before. A vendor who says \"the firmware team handles that\" is showing you the gap.",[17,214,216],{"id":215},"building-or-rescuing-a-connected-product","Building or rescuing a connected product?",[10,218,219,220,224,225,229],{},"If you're planning an integrated product — or have one in the field that's behaving in ways nobody can explain — we're happy to talk through the architecture. No pitch deck, no obligation; sometimes a one-hour conversation about where your boundary lives is genuinely all that's needed. Email us at ",[142,221,223],{"href":222},"mailto:hello@overclockminds.co.uk","hello@overclockminds.co.uk"," or ",[142,226,228],{"href":227},"\u002Fcontact","get in touch here",".",{"title":231,"searchDepth":232,"depth":232,"links":233},"",2,[234,235,243,244,245,246],{"id":19,"depth":232,"text":20},{"id":59,"depth":232,"text":60,"children":236},[237,239,240,241,242],{"id":67,"depth":238,"text":68},3,{"id":77,"depth":238,"text":78},{"id":84,"depth":238,"text":85},{"id":91,"depth":238,"text":92},{"id":98,"depth":238,"text":99},{"id":105,"depth":232,"text":106},{"id":130,"depth":232,"text":131},{"id":159,"depth":232,"text":160},{"id":215,"depth":232,"text":216},"2026-06-11","Connected products rarely die in the firmware or the cloud. They die in the gap between. Here's where it happens and how to close it.","md",{},true,"\u002Fblog\u002Fhardware-software-integration-guide",{"title":5,"description":248},"blog\u002Fhardware-software-integration-guide",[256,257,258],"embedded","hardware","iot","gI4mrvFN2-IF3Ue87zcXhx_v95N6HqGdGGmsKC4lY-Q",1781193457578]