Depending on the faith, the Season of Forgiveness varies between different times of year – but in the world of e-commerce, it appears that July was the month of sin and salvation. That was the month, after all, that thousands of merchants on the Etsy platform apologized to their customers, the month that Etsy itself apologized to the public, and the month that Worldpay – the group that caused the apology-fest – itself apologized for its failings.
Pity poor Etsy; the debacle wasn’t its fault at all, but still, in order to placate angry customers, the site profusely apologized for its “wrongdoing.” If you’ve done something wrong, apologies are important, of course – but shouldn’t we be striving to avoid having to have something to apologize for?
Worldpay indeed owes Etsy an apology, but not just them. It’s not Etsy and the customers it let down – it’s their shareholders, their workers, and their families, all of whom will probably suffer in the wake of the damage done to the service’s reputation. The lesson for Worldpay – and any other service provider – has to be how to avoid getting into a situation where apologies are needed at all. The question for enterprise is, is that even possible?
In a debacle that eventually stretched across three weeks, merchants who use UK-based Worldpay – among them that nation’s British Airways and the National Lottery – noticed that payments were not going through. Most affected by the issue were customers of Etsy, the crafts sales platform that sells more than $1.5 billion in merchandise a year; many Etsy merchants couldn’t process payments for three weeks, resulting in angry e-mails from customers, loss of reputation – and, of course, irretrievable loss of sales.
What went wrong?
The specific reason for the outages were not revealed, with Worldpay saying initially that it was a “glitch,” and later blaming it on a software update. The world may never know the nature of that glitch, of course, but it could have been one of a million things – perhaps a poorly configured file that stopped functioning when a key piece of software was updated, or perhaps a permissions problem that prevented transactions from being executed on a server. If the history of these kind of outages in enterprise is any indication, chances are the Worldpay people may not even know exactly.
It took time, but both Etsy and Worldpay apologized to each other, and to the merchants who were caught up in the situation. On July 19th, Worldpay issued a statement saying that it was “experiencing an isolated issue with one of its gateways which is affecting a very small proportion of our customers (substantially less than 1 per cent) and a small proportion of the transactions that we process daily. Efforts to resolve the issues causing settlement delays are ongoing. We sincerely apologize for the inconvenience this has caused.” In another statement the next day, Worldpay said that “We are taking steps to implement changes, with further testing already underway, with the aim of restoring normal operational service as soon as possible, and have proactively communicated with all affected customers. We sincerely apologize for the inconvenience this has caused.”
In its own statement, Etsy told merchants on July 18th that the company was “deeply sorry for the inconvenience and frustration these delays have caused. We thank you for your continued patience and for being part of our community.” The site repeated the apology on July 25th, adding that – nearly four weeks after the problems began on July 1st – it appeared that the glitch had been resolved. That, of course, was little comfort to the merchants who lost out on sales, and goodwill, of long-time customers – and who were likely to permanently lose at least some of them, despite the apologies they themselves were forced to issue.
While the Etsy debacle garnered a lot of attention because it affected so many people on a consumer-facing site, we need to realize that the same kind of thing goes on every hour of every day on enterprise, government, business, and infrastructure sites. Such outages can be caused by regular day to day IT activities – upgrades, whether of hardware, software, or even of cloud-based tools that a company uses to provide services to clients – as well as misconfiguration, bugs (often included in software upgrades), traffic issues, power outages, security issues – and of course, human error, like “fat finger” syndrome, where an IT worker simply presses the wrong button(s).
But according to a study by the University of Chicago, the biggest reason for outages is “unknown” – as in, the IT system is too damn complicated for workers to figure out what went wrong. Can anything be done to prevent an “unknown” glitch? Not by people who don’t know what to look for;
In the wake of modern IT’s evergrowing complexity – there are thousands of things that can go wrong, and in order to catch such glitches in a proactive way (as opposed to learning of their existence when all hell breaks loose) – more stringent quality controls must be put in place – preferably following each and every change. This is, obviously, not something that can be done manually! With thousands of virtual machines, millions of configurable items, rapid technology evolution and frequent, daily changes – there’s just not enough time.
The only valid approach is to harness the power of automation. Indeed, more and more enterprises in the financial, telco, utility, retail and public sector have come to rely on daily, automated configuration validation systems. The guiding principle is to deploy a risk detection engine, coupled with a dynamic knowledge base loaded with relevant risk signatures. Much like anti-virus tools in the end-point computing arena – such risk detection tools can harness the experience of multiple vendors and enterprises, to provide a community-driven knowledgebase. To date, some of the offerings in this field go as far as automating thousands of risk signature checks. With the power of automation, it is possible to proactively detect a huge portion of the issues that today remain dormant in IT – dramatically improving resilience and proactively preventing the next outage.