The thing that amazes me is how they've rolled out such a buggy change at such a scale. I would assume that for such critical systems, there would be a gradual rollout policy, so that not everything goes down at once.
Lack of gradual, health mediated rollout is absolutely the core issue here. False positive signatures, crash inducing blocks, etc will always slip through testing at some % no matter how good testing is. The necessary defense in depth here is to roll out ALL changes (binaries, policies, etc) in a staggered fashion with some kind of health checks in between (did > 10% of endpoints the change went to go down and stay down right after the change was pushed?).
Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.
You can stagger changes out within a reasonable timeframe - the blocks already take hours/days/weeks to come up with, taking an extra hour or two to trickle the change out gradually with some basic sanity checks between staggers is a tradeoff everyone would embrace in order to avoid the disaster we're living through today.
Need a reset on their balance point of security:uptime.
Wow !! good to know real reason for non-staggered release of the software ...
> Crowdstrike bit my company with a false positive that severely broke the entire production fleet because they pushed the change everywhere all at once instead of staggering it out. We pushed them hard in the RCA to implement staggered deployments of their changes. They sent back a 50 page document explaining why they couldn't which basically came down to "that would slow down blocks of true positives" - which is technically true but from followup conversations quite clear that is was not the real reason. The real reason is that they weren't ready to invest the engineering effort into doing this.
There's some irony there in that the whole point of CrowdStrike itself is that it does behavioural based interventions. ie: it notices "unusual" activity over time and then can react to that autonomously. So them telling you they can't engineer it is kind of like them telling you they do don't know how to do a core feature they actually sell and market the product itself doing.
It's quite handy that all the things that pass QA never fail in production. :)
On a serious note, we have no way of knowing whether their update passed some QA or not, likely it hasn't, but we don't know. Regardless, the post you're replying to, IMHO, correctly makes the point that no matter how good your QA is: it will not catch everything. When something slips, you are going to need good observability and staggered, gradual, rollbackable, rollouts.
Ultimately, unless it's a nuclear power plant or something mission critical with no redundancy, I don't care if it passes QA, I care that it doesn't cause damage in production.
Had this been halted after bricking 10, 100, 1.000, 10.000, heck, even 100.000 machines or a whopping 1.000.000 machines, it would have barely made it outside of the tech circle news.
> On a serious note, we have no way of knowing whether their update passed some QA or not
I think we can infer that it clearly did not go through any meaningful QA.
It is very possible for there to be edge-case configurations that get bricked regardless of how much QA was done. Yes, that happens.
That's not what happened here. They bricked a huge portion of internet connected windows machines. If not a single one of those machines was represented in their QA test bank, then either their QA is completely useless, or they ignored the results of QA which is even worse.
There is no possible interpretation here that doesn't make Crowdstrike look completely incompetent.
If there had been a QA process, the kill rate could not have been as high as it is, because there'd have to be at least one system configuration that's not subject to the issue.
I agree that testing can reduce the probability of having huge problems, but there are still many ways in which a QA process can fail silently, or even pass properly, without giving a good indication of what will happen in production due to data inconsistencies or environmental differences.
Ultimately we don't know if they QA'd the changes at all, if this was data corruption in production, or anything really. What we know for sure is that they didn't have a good story for rollbacks and enforced staggered rollouts.
My understanding of their argument is that they can't afford the time to see if it breaks the QA fleet. Which I agree with GP is not a sufficient argument.
If and when there is a US Cyber Safety Review Board investigation of this incident, documents like that are going to be considered with great interest by the parties involved.
Often it is the engineers working for a heavily invested customer at the sharp end of the coal face who get a glimpse underneath the layers of BS and stare into the abyss.
This doesn’t look good, they say. It looks fine from up top! Keep shoveling! Comes the reply.
Sure, gradual rollout seems obviously desirable, but think of it from a liability perspective.
You roll out a patch to 1% of systems, and then a few of the remaining 99% get attacked and they sue you for having a solution but not making it available to them. It won't matter that your sales contract explains that this is how it works and the rollout is gradual and random.
Then push it down to customer, better yet provide integration points with other patch management software (no idea if you can integrate with WSUS without doing insane crap, but it's not the only system to handle that, etc.)
Another version of the "fail big" or "big lie" type phenomenon. Impact 1% of your customers and they sue you saying the gradual rollout demonstrates you had prior knowledge of the risk. Impact 100% of your customers and somehow you get off the hook by declaring it a black swan event that couldn't have been foretold.
This. I can see such an update shipping out for a few users. I mean I've shipped app updates that failed spectacularly in production due to a silly oversight (specifically: broken on a specific Android version), but those were all caught before shipping the app out to literally everybody around the world at the same time.
The only thing I can think of is they were trying to defend from a very severe threat very quickly. But... it seems like if they tested this on one machine they'd have found it.
Unless that threat was a 0day bug that allows anyone to SSH to any machine with any public key, it was not worth pushing it out in haste. Full stop. No excuses.
I also blame the customers here to be completely honest.
The fact the software does not allow for progressive rollout of a version in your own fleet should be an instantaneous "pass". It's unacceptable for a vendor to decide when updates are applied to my systems.
Absolutely. I may be speaking from ignorance here, as I don't know much about Windows, but isn't it also a big security red flag that this thing is reaching out to the Internet during boot?
I understand the need for updating these files, they're essentially what encodes the stuff the kernel agent (they call it a "sensor"?) is looking for. I also get why a known valid file needs to be loaded by the kernel module in the boot process--otherwise something could sneak by. What I don't understand is why downloading and validating these files needs to be a privileged process, let alone something in the actual kernel. And to top it all off, they're doing it at boot time. Why?
I hope there's an industry wide safety and reliability lesson learned here. And I hope computer operators (IT departments, etc) realize that they are responsible for making sure the things running on their machines are safe and reliable.
With fear of sounding like a douche-bag, I honestly believe there's A LOT of incompetence in the tech-world, which permeates all layers, security companies, AV companies, OS companies etc.
I really blame the whole power-structure, it looked like the engineers had the power, but last 10 years tech has been turned upside-down and exploited as any other industry, controlled by the opportunistic and greedy people. Everything is about making money, shipping features, the engineering is lost.
Would you rather tick compliance boxes easily or think deep about your critical path? Would you rather pay 100k for a skilled engineer or 5 cheaper (new) ones? Would you rather sell your HW now despite pushing feature-incomplete buggy app ruining the experience for many many customers? Will you listen to your engineers?
I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.
I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.
Apple recognised kernel extension brought all sorts of trouble for users such as instability, crashing, etc. and presented a juicy attack surface. They deprecated and eventually disallowed kernel extensions supplanting them with a system extensions framework to provide interfaces for VPN functionality, EDR agents, etc.
A Crowdstrike agent couldn't panic or boot loop macOS due to a bug in the code when using this interface.
> I know people really dislike how Apple restricts your freedom to use their software in any way they don't intend. But this is one of the times where they shine.
Yes, the problem here is that the system owners had too much control over their systems.
No, no, that's the EXACT OPPOSITE of what happened. The problem is Crowdstrike had too much control of systems -- arguing that we should instead give that control to Apple is just swapping out who's holding the gun.
> arguing that we should instead give that control to Apple is just swapping out who's holding the gun.
apple wrote the OS, in this scenario they're already holding a nuke, and getting the gun out of crowdstrike's hands is in fact a win.
it is self-evident that 300 countries having nukes is less safe than 5 countries having them. Getting nukes (kernel modules) out of the hands of randos is a good thing even if the OS vendor still has kernel access (which they couldn't possibly not have) and might have problems of their own. IDK why that's even worthy of having to be stated.
don't let the perfect be the enemy of the good, incremental improvements in the state of things is still improvement. there is a silly amount of black-and-white thinking around "popular" targets like apple and nvidia (see: anything to do with the open-firmware-driver) etc.
"sure google is taking all your personal data and using it to target ads to your web searches, but apple also has sponsored/promoted apps in the app store!" is a similarly trite level of discourse that is nonetheless tolerated when it's targeted at the right brand.
This is good nuance to add to the conversation, thanks.
I think in most cases you have to trust some group of parties. As an individual you likely don't have enough time and expertise to fully validate everything that runs on your hardware.
Do you trust the OSS community, hardware vendors, OS vendors like IBM, Apple, M$, do you trust third party vendors like Crowdstrike?
For me, I prefer to minimize the number of parties I have to trust, and my trust is based on historical track record. I don't mind paying and giving up functionality.
Even if you've trusted too many people, and been burned, we should design our systems such that you can revoke that trust after the fact and become un-burned.
Having to boot into safe mode and remove the file is a pretty clumsy remediation. Better would be to boot into some kind of trust-management interface and distrust cloudstrike updates dated after July 17, then rebuild your system accordingly (this wouldn't be difficult to implement with nix).
Of course you can only benefit from that approach if you trust the end user a bit more than we typically do. Physical access should always be enough to access the trust management interface, anything else is just another vector for spooky action at a distance.
It is some mix of priorities along the frontier, with Apple being on the significantly controlling end such that I wouldn't want to bother. Your trust should also be based on prediction, and giving a major company even more control over what your systems are allowed to do has been historically bad and only gets worse. Even if Apple is properly ethical now (I'm skeptical, I think they've found a decently sized niche and that most of their users wouldn't drop them even if they moved to significantly higher levels of telemetry, due to being a status good in part), there's little reason to give them that power in perpetuity. Removing that control when it is absued hasn't gone well in the past.
Microsoft is also trying to make drivers and similar safer with HVCI, WDAC, ELAM and similar efforts.
But given how a large part of their moat is backwards compatibility, very few of those things are the default and even then probably wouldn't have prevented this scenario.
These customers wouldn't be able to do that in time frames measured in anything but decades and/or they would risk going bankrupt attempting to switch.
Microsoft has far more leverage than they choose to exert, for various reasons.
I can't run a 10year old game on my Mac but i can run a 30 year old game on my windows 11 box. Microsoft prioritizes backwards compatibility for older software,
For apple you just need to be an apple customer, they do a good job on crashing computers with their OSX updates like Sonoma. I remember my first macbook pro retina couldn’t go to sleep because it wouldn’t wake up till apple decided to release a fix for it. Good thing they don’t make server OSes.
I remember fearing every OSX update because until they switched to just shipping read-only partition images you had considerable chance of hitting a bug in Installer.app that resulted in infinite loop... (the bug existed since ~10.6 until they switched to image-based updates...)
30 years ago would be 1994. Were there any 32-bit Windows games in 1994 other than the version of FreeCell included with Win32s?
16-bit games (for DOS or Windows) won't run natively under Windows 11 because there's no 32-bit version of Windows 11 and switching a 64-bit CPU back to legacy mode to get access to the 16-bit execution modes is painful.
Maybe. Have you tried? 30 year old games often did not implement delta timing, so they advance ridiculously fast on modern processors. Or the games required a memory mode not supported by modern Windows (see real mode, expanded memory, protected mode), requiring DOSBox or other emulator to run today.
Well - recognition where it's due - that actually looks pretty great. (Assuming that, contrary to prior behavior, they actually support it, and fix bugs without breaking backwards compatibility every release, and don't keep swapping it out for newer frameworks, etc etc)
> I also blame us, the SWE engineers, we are waay to easily busied around by these types of people who have no clue. Have professional integrity, tests is not optional or something that can be cut, it's part of SWE.
Then maybe most of what's done in the "tech-industry" isn't, in any real sense, "engineering"?
I'd argue the areas where there's actual "engineering" in software are the least discussed---example being hard real-time systems for Engine Control Units/ABS systems etc.
That _has_ to work, unlike the latest CRUD/React thingy that had "engineering" processes of cargo-culting whatever framework is cool now and subjective nonsense like "code smells" and whatever design pattern is "needed" for "scale" or some such crap.
Perhaps actual engineering approaches could be applied to software development at large, but it wouldn't look like what most programmers do, day to day, now.
How is mission-critical software designed, tested, and QA'd? Why not try those approaches?
Amen to that. Software Engineering as a discipline badly suffers from not incorporating well-known methods for preventing these kinds of disasters from Systems Engineering.
> How is mission-critical software designed, tested, and QA'd? Why not try those approaches?
Ultimately, because it is more expensive and slower to do things correctly, though I would argue that while you lose speed initially with activities like actually thinking through your requirements and your verification and validation strategies, you end up gaining speed later when you're iterating on a correct system implementation because you have established extremely valuable guardrails that keep you focused and on the right track.
At the end of the day, the real failure is in the risk estimation of the damage done when these kinds of systems fail. We foolishly think that this kind of widespread disastrous failure is less likely than it really is, or the damage won't be as bad. If we accurately quantified that risk, many more systems we build would fall under the rigor of proper engineering practices.
Accountability would drive this. Engineering liability codes are a thing, trade liability codes are a thing. If you do work that isn't up to code, and harm results, you're liable. Nobody is holding us software developers accountable, so it's no wonder these things continue to happen.
"Listen to the engineers?" The problem is that there are no engineers, in the proper sense of the term. What there are is tons and tons of software developers who are all too happy to be lax about security and safe designs for their own convenience and fight back hard against security analysts and QA when called out on it.
Engineers can be lazy and greedy, too. But at least they should better understand the risks of cutting corners.
> Have professional integrity, tests is not optional or something that can be cut, it's part of SWE. Gradual rollout, feature-toggles, fall-backs/watchdogs etc. basic tools everyone should know.
In my career, my solution for this has been to just include doing things "the right way" as part of the estimate, and not give management the option to select a "cutting corners" option. The "cutting corners" option not only adds more risk, but rarely saves time anyway when you inevitably have to manually roll things back or do it over.
Sigh, I've tried this. So management reassigned to a dev who was happy to ship a simalcrum of the thing that, at best, doesn't work or, at worst, is full of security holes and gives incorrect results. And this makes management happy because something shipped! Metrics go up!
And then they ask why, exactly, did the senior engineer say this would take so long? Why always so difficult?
I don't know that incompetence is the best way to describe the forces at play but I agree with your sentiment.
There is always tension between business people and engineering. Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends.
The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.
It's a tradeoff which in healthy organizations where the two sides and leadership communicate effectively is well managed.
> Where the engineers want things to be perfect and safe, because we need to fix the arising issues during nights and weekends. The business people are interested in getting features released, and don't always understand the risks by pushing arbitrary dates.
Isn't this issue a vindication of the engineering approach to management, where you try to _not_ brick thousands of computers because you wanted to meet some internal deadline faster?
> There is always tension between business people and engineering.
Really? I think this situation (and the situation with Boeing!) shows that the tension is between ultimately between responsibility and irresponsibility.
I cannot be said that this is a win for short-sighted and incompetent business people?
If people don't understand the risks they shouldn't be making the decisions.
I think this is especially true in businesses where the thing you are selling is literally your ability to do good engineering. In the case of Boeing the fundamental thing customers care about is the "goodness" of the actual plane (for example the quality, the value for money, etc). In the case of Crowdstrike people wanted high quality software to protect their computers.
Yeah, good point. If you buy a carton of milk and it's gone off you shrug and go back to the store. If you're sitting in a jet plane at 30,000ft and the door goes for a walk... Twilight Zone. (And if the airline's security contractor sends a message to all the planes to turn off their engines... words fail. It's not... I can't joke about it. Too soon.)
Yes. I have been working in the tech industry since the early aughts and I never seen the industry so weak on engineer lead firms. Something really happened and the industry flipped.
In most companies, businesspeople without any real software dev experience control the purse strings. Such people should never run companies that sell life-or-death software.
The reality is there is plenty of space in the software industry to trade off velocity against "competent" software engineering. Take Instagram as an example. No one is going to die if e.g. a bug causes someone's IG photo upload to only appear in a proper subset of the feeds where it should appear.
In the civil engineering world, at least in Europe, the lead engineer would sign papers that would put him as liable if a bridge or a building structure collapses on its own. The civil engineers face literal prison time if they make a sloppy work.
In the software engineering world, we have TOSs that deny any liability if the software fails. Why?
It boils my blood to think that the heads of CrowdStrike would maybe get a slap on the wrist and everything will slowly continue as usual as the machines will get fixed.
Let's think about this for a second. I agree to some extend with what you are trying to say, I just think there's a critical thing missing here in your consideration, and that is usage of the product outside its intended purpose/marketing.
Civil engineers built bridges knowingly that civilians use them, and structural failure can cause deaths. The line of responsibility is clear.
SW companies (like CrowdStrike (CS)) it MAY BE less straight-forward.
A relevant real-world example is the use of consumer drones in military conflicts. Companies like DJI design and market their drones for civilian use, such as photography. However, these drones have been repurposed in conflict zones, like Ukraine, to carry explosives. If such a drone malfunctioned during military use, it would be unreasonable to hold DJI accountable, as this usage clearly falls outside the product's intended purpose and marketing.
The liability depends on the guarantees they make. If they market it for AV used for critical infrastructure, such as healthcare (seems like they do https://www.crowdstrike.com/platform/) - by all means, it's reasonable to hold with accountable.
However, SW companies should be able to sell products and long as they're clear what the limitations are, and it needs to be clearly communicated to the customers.
We have those TOS's in the software world because it would be prohibitively expensive to make all software reliable as a publicly used bridge. For those who died as a direct result of CrowdStrike, that's where the litigious nature of the US becomes a rare plus. And CrowdStrike will lose a lot of customers over this. It isn't perfect, but the market will arbitrate CrowdStrike's future in the coming months and years.
We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.
I mean back in the mid teens we had the whole “move fast and break things” motif. I think that quickly morphed into “be agile” because no one actually felt good about breaking things.
We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.” Like, let’s create our own oath.
> We don’t really have any software engineering leaders these days. It would be nice if one stood up and said “stop being awful. Let’s be professionals and earn our money.”
I assume you realize that you don't get very far in many companies when you do that. I'm not humble-bragging, but I used to say just this over past 10-15 years even when in senior/leadership positions, and it ended up giving me a reputation of "oh, gedy is difficult", and you get sidelined by more "helpful" junior devs and managers who are willing to sling shit over the wall to please product. It's really not worth it.
It’s a matter of getting a critical mass of people who do that. In other words, changing the general culture. I’m lucky to work at a company that more or less has that culture.
Yeah I’ve found this is largely cultural, and it needs to come from the top.
The best orgs have a gnarly, time-wisened engineer in a VP role who somehow is also a good people person, and pushes both up and down engineering quality above all else. It’s a very very rare combination.
> We’re definitely in a moment. I’ve seen a large shift away from discipline in the field. People don’t seem to care about professionalism or “good work”.
Agreed. Thinking back to my experience at a company like Sun, every build was tested on every combination of hardware and OS releases (and probably patch levels, don't remember). This took a long time and a very large number of machines running the entire test suites. After that all passed ok, the release would be rolled out internally for dogfooding.
To me that's the base level of responsibility an engineering organization must have.
Here, apparently, Crowdstrike lets a code change through with little to no testing and immediately pushes it out to the entire world! And this is from a product that is effectively a backdoor to every host. What could go wrong? YOLO right?
This mindset is why I grow to hate what the tech industry has become.
As an infra guy, it seems like all my biggest fights at work lately have been about quality. Long abandoned dependencies that never get updated, little to no testing, constant push to take things to prod before they're ready. Not to mention all the security issues that get shrugged off in the name of convenience.
I find both management and devs are to blame. For some reason the amazingly knowledgeable developers I read on here daily are never to be found at work.
Yes. I’ve had the same experience. Literally have had engineers get upset with me when I asked them to consider optimizing code or refactor out complexity. “Yeah we’ll do it in a follow up, this needs to ship now,” is what I always end up hearing. We’re not their technical leads but we get pulled into a lot of PRs because we have oversight on a lot of areas of the codebase. From our purview, it’s just constantly deteriorating.
IMO, if you want to write code for anything mission critical you should need some kind of state certification, especially when you are writing code for stuff that is used by govt., hospitals, finance etc.
Not certification, licensure. That can and will be taken away if you violate the code of ethics. Which in this case means the code of conduct dictated to you by your industry instead of whatever you find ethical.
Like a license to be a doctor, lawyer, or civil engineer.
There’s - perhaps rightfully, but certainly predictably - a lot of software engineers in this thread moaning about how evil management makes poor engineers cut corners. Great, licensure addresses that. You don’t cut corners if doing so and getting caught means you never get to work in your field again. Any threat management can bring to the table is not as bad as that. And management is far less likely to even try if they can’t just replace you with a less scrupulous engineer (and there are many, many unscrupulous engineers) because there aren’t any because they’re all subject to the same code of ethics. Licensure gives engineers leverage.
I think that could cause a huge shift away from contributing to or being the maintainer of open source software. It would be too risky if those standards were applied and they couldn't use the standard "as is, no warranties" disclaimers.
Actually, no it wouldn't, as the licensire would likely be tied with providing the service on a paid basis to others. You could write or maintain any codebase you want. Once you start consuming it for an employer though, the licensure kicks in.
Paid/subsidized maintainers may be a different story though. But there absolutely should be some level of teeth and stake wieldable by a professional SWE to resist pushes to "just do the unethical/dangerous thing" by management.
I might have misunderstood. I took it to mean that engineers would be responsible for all code they write - the same as another engineer may be liable for any bridge they build - which would mean the common "as is", "no warranty", "not fit for any purpose" cute clauses common to OSS would no longer apply as this is clearly skirting around the fact that you made a tool to do a specific thing, and harming your computer isn't the intended outcome.
You can already enforce responsibility via contract but sure, some kind of licensing board that can revoke a license so you can no longer practice as a SWE would help with pushback against client/employer pressure. In a global market though it may be difficult to present this as a positive compared to overseas resources once they get fed up with it. It would probably need either regulation, or the private equivalent - insurance companies finding a real, quantifiable risk to apply to premiums.
Trouble is, the bridge built by any licensed engineer stands in its location, and can't be moved or duplicated. Software however is routinely duplicated, and copied to places that might not be suitable for ite original purpose.
I’d be ok with this so long as 1) there are rules about what constitutes properly built software and 2) there are protections for engineers who adhere to these rules
Far from being douchey, I think you've hit the nail on the head.
No one is perfect, we're all incompetent to some extent. You've written shitty code, I've definitely written shitty code. There's little time or consideration given to going back and improving things. Unless you're lucky enough to have financial support while working on a FOSS project where writing quality software is actually prioritized.
I get the appeal software developers have to start from scratch and write their own kernel, or OS, etc. And then you realize that working with modern hardware is just as messy.
We all stack our own house of cards upon another. Unless we tear it all down and start again with a sane stable structure, events like this will keep happening.
I think you are correct on that many SWEs are incompetent. I definitely am. I wish I had the time and passion to go through a complete self-training of CS fundamentals using Open Course resources.
> I honestly believe there's A LOT of incompetence in the tech-world
I can understand why. An engineer with expertise in one area can be a dunce in another; the line between concerns can be blurry; and expectations continue to change. Finding the right people with the right expertise is hard.
100% what we seen in the last couple of decades is the march of normies into the techno sphere to the detriment of the prior natives.
We've essentially watched digital colonialism, and it certainly peaks with Elon musk wealth and ego, attempting to buy up the digital market place of ideas.
Applying rigorous engineering principles is not something I see developers doing often. Whether or not it's incompetence on their part, or pressure from 'imbecile MBAs and marketers', it doesn't matter. They are software developers, not engineers. Engineers in most countries have to belong to a professional body and meet specific standards before they can practice as professionals. Any asshat can call themselves a 'software engineer', the current situation being a prime example, or was this a marketing decision?
You're making the title be more than it is. This won't get solved by more certification. The checkbox of having certified security is what allowed it to happen in the first place.
No. Engineering means something. This is a software ‘engineering’ problem. If the field wants the nomenclature, then it behooves them to apply rigour to who can call themselves an engineer or architect. Blaming middle management is missing the wood for the trees. The root cause was a bad patch. That is developments fault, and no one else’s. As to why this fault could happen, well the design of Windows should be scrutinised. Again, middle management isn’t really to blame here, software architects and engineers design the infrastructure, they choose to use Windows for a variety of reasons.
The point here m trying to make is blaming “MBAs and marketing” shifts blame and misses the wood for the trees. The OP is as on the holier-than-thou “engineer” trip. They are not engineers.
I think engineering only means something because of culture. It all starts from the culture of collective people who define and decide what principles are to be followed and why. All the certifications and licensing that are prerequsite to becoming an engineer are outcomes of the culture that defined them.
Today we have pockets of code produced by one culture linked (literally) with pockets of code produced by a completely different ones and somehow expect the final result to adhere to the most principled and disciplined culture.
Not entirely true. The company I worked for, major network equipment provider, had a customer user group that had self-organised to take it in turns to be the first customer to deploy major new software builds. It mostly worked well.
This is the thing that gets me most about this. Any Windows systems developer knows that a bug in a kernel driver can cause BSODs - why on earth would you push out such changes en-masse like this?!
In 2012 a local bank rolled out an update that basically took all of their customer services offline. Couldn't access your money. Took them a month to get things working again.