For most of us, the holiday season means fun parties, good times, family events, and a pleasant and relaxed atmosphere. That’s for most of us; not for Santa and his staff, though. For them, this is the busy season – and thanks to worldwide supply chains, source manufacturing, international shipping via supertankers, and micro-marketing via social media and mobile apps, it’s busier and more complicated than ever.
Complications beg mistakes – for all his sleigh-flying powers, Santa is still human (though we’re not quite sure what species elves belong to). And when humans come into contact with complicated systems, errors often ensue. How can Santa protect himself, his operation, and reputation from “service outages” or snafus that deliver the wrong toy to the wrong child – or worse, deliver nothing at all to the deserving kids who set out milk and cookies for St. Nick?
Santa, of course, is a metaphor (don’t tell your kids, though!) – but the super-complicated world of holiday gifting is all too real. How does a department store, major online e-commerce site, toy manufacturer, or any other player in the holiday shopping stakes get their product to market, reach customers (retail or wholesale), collect payments, and ensure that goods get delivered on time?
Digital data is the nuts and bolts of what makes the business world run today. Cloud-based e-commerce systems, huge customer databases, and advanced shipping systems that coordinate two day shipping between two points thousands of miles away are just a small sample of the deep technology involved in the gift business.
That all these diverse systems work together – and work altogether – is somewhat of a tech miracle, considering the many things that could go wrong. But as it turns out, many things do indeed go wrong. From hack attacks to major service outages, hardware and services fail on a constant basis, causing untold delays and losses for the businesses that need these systems to do business. A recent benchmark study of cloud resiliency at some of the US top companies shows that significant risks of downtime and security breaches are prevalent in each and every cloud environment tested. Most had multiple downtime risks, and the vast majority (82%) also had data loss risks.
In a study of over 100 companies, no fewer than 97% had at least one unplanned service outage in the past year. While huge outages that hit the retail sector (for example, when ATMs belonging to HSBC go on strike for a week) make headlines, not many people are aware of an outage of a logistics system, which holds up shipping because the databases were corrupted. Yet it happens, and the price paid for those outages in lost time and efficiency – as much as $75,000 a minute – are made up for at the cash register, both on the wholesale and retail level.
The sources of these outages are clearly due the extremely deep complexity of IT systems. These systems are constantly updated to provide more, faster, and better service to users, but often the upgrade process itself is the cause of a service outage. Installations of new hardware or software to improve or add services can inadvertently create conflicts and incompatibilities with other services and introduce new risks into the system.
When problems crop up, time is of the essence. According to a report by Ponemon, each minute of downtime can cost a company as much as $750,000 apiece!
But to expect an IT team, talented as it is, to immediately ferret out the source of a problem and repair it, is to expect a holiday miracle. There is no way IT teams can be familiar with the correct settings of thousands of configuration files, how they interact with services and virtual or online systems, the details of the dependencies of services, which one takes priority or requires more resources relative to others, etc. Many companies set up mirror servers before installation to test new software or add services, but those isolated test environments do not reflect the full IT environment. Usually, the only way an IT team will find out that there is a problem is when a problem ensues.
Leaving the forensics of that search in the hands of humans could take days, not weeks – and with just X shopping days until Christmas, a company that relies on that kind of tech remediation might as well just close up shop until the spring thaw.
An automated big data system that constantly crawls a company’s IT infrastructure – checking out the dependencies, determining the way resources are allocated, and proactively detecting problems – could help IT teams ensure full availability. When a change occurs that has the potential of causing a service disruption, the system alerts IT personnel, pointing out where the problem is and what needs to be done. The issues are highly visible, and remediation procedures are available to allow for a quick resolution before the business is affected.
Thus, the system might recommend changing the dependencies and/or contents of configuration files, or reallocating resources to provide more memory to a system that is slowing down everything else. The automated detection system can thus ensure that IT services remain live, or at least can be quickly revived in the event of a problem – and ensure that they operate properly. Santa and his enterprise partners are traditional sorts – but this kind of IT issue remediation technology is something that even a fellow who still hasn’t traded in his reindeer for a rocket ship can embrace.