Broadband Reports posts about what happened:
So many people were waiting for the promotion that the entire Amazon website - not just the promotion page - sank without a trace from just before 2pm, to at least 2:12pm. The home page, the product pages, everything, were unavailable.Sounds familiar. When I was at Amazon, every year we in engineering would try to avoid spikes in traffic, especially around peak holiday loads, and every year marketing folks would want to run some promotion specifically designed to create a mad frenzy on the site. Usually, we convinced them to change the promotion, but apparently engineering lost (or was asleep at the switch) this year.
Broadband Reports goes on to point out that this reflects badly on Amazon:
We wonder how many amazon shoppers elsewhere in the site abandoned their purchases halfway through after they found their experience destroyed by the vote rush going on in the next room ... Some people got quite irate.I don't think it quite works that way. A DDoS attack, which this effectively was, can generate way over the x10 peak load for which the website would be designed. Even so, it still is pretty lame for Amazon to DDoS itself.
The poor performance of the amazon site during the giveaway also reflects badly on the Amazon "elastic compute cloud" offering (Amazon EC2) which is designed, supposedly, to offer instant capacity to companies which need to deal with exactly this kind of sudden rush.
It appears the contest is running again next week with the same structure. I wonder if Amazon will crash itself again?
Update: It appears Amazon is looking at changing the structure of this promotion to prevent another brownout. Currently, there is a message up that says, "Due to the popularity of Amazon Customers Vote, we are extending the Week 2 voting period. Customers who cast a vote will be sent an e-mail notification of the new sale date."
Update: Mike at TechDirt reports that "Amazon Cries 'Uncle' On Promotion Traffic" by changing the rules to prevent another outage.
9 comments:
How else could Amazon get people to stress-test their network (and their entire system) except by engineering a massive overload at a predictable time?
At a relatively small cost they can see exactly how the system reacts to high loads and where the points of failure are.
You don't stress test the website right before the holiday shopping season by DDoSing it with a special. You do it way earlier in the year, in the middle of the night, with your own tests. You especially do not do it in such a way as to leave many, many customers unhappy because they had no fair shot at a good deal. Customers will not tolerate feeling like they were unfairly denied prices that others received. It is about the worst thing you can do to a customer besides ignore them or steal from them.
They sold 1000 of these. The people who got them will either use them to make a quick buck, or be happy with their purchase. Probably 50x-100x that number are pissed off at Amazon because they couldn't get in on the deal. An even greater number are pissed because Amazon went kaput in the middle of the day on a day where everyone's off work and probably thinking about holiday shopping. In other words, this generated next to no good will, and had a huge negative impact on sales, both of which are the exact opposite results that are usually intended from these Holiday Frenzy-type sales.
So, "a relatively small cost" is understating it to a large degree. That said, of course, I am only guessing based on my experience there, so maybe they only lost $10. I doubt it.
Further, even if this did point out "points of failure" it's a little late in the game to make major changes to solve those problems, and further, it's not clear that they should. They CAUSED this spike themselves. They couldn't even take normal DDoS precautions because these were all REAL customers. In no way will this situation ever come up in "real life," therefore spending money to fix it is low-return. They'd be better off just not shooting themselves in the foot like this.
Does this mean that S3 doesn't scale? :)
Any word on whether this affected Amazon's S3 or EC2 services?
Amazon doesn't use ec2 and s3 for its website. Also, you don't know where the bottleneck was - could have been an Oracle bug for you all we know.
I was one of the many people who tried to get at the Xbox360 deal. What amazed me was that other parts of the site were down as well as the promotion page. I would have expected better from them.
One thing I have noticed is that, when the website is down, Amazon's web services might still be responding. Last year, when I was doing a project with ECS, and the website was down, I was still able to query ECS. With regards to EC2, I have no idea...
I'm really curious to know what happened to EC2 and S3. I recently asked one of the leaders of the web services team why developers and web site owners should trust Amazon with their sites, data, etc. when Amazon was unwilling to provide any details on how the data was stored/served/etc or any guarantees in terms of uptime and responsiveness. His response was a smug "Trust us" and he pointed out that the main site ran on identical architecture. Well, clearly the site is not invulnerable. Furthermore, I'm curious to know whether Amazon starts flipping its "instant" capacity from those services over to the main site when they need it.
I'm fairly certain that amazon does use ec2 and s3 for its web site - well, at least they use the same hardware and stuff internally that ec2 and s3 uses. I've read various things that, like google, they have the kind of IT where even the internal web sites use something like the web services APIs they offer externally.
The point is, that unless amazon logically and physically separates the infrastructure, network connections and software web services go through, a downtime of amazon is a downtime for web services. That might not be true in reverse (i.e. if they bandwidth or CPU-limit their web services apps, a rogue web site using ec2 can't take down amazon.com).
Now, to be fair, amazon's got a pretty good down time record. This latest incident can be viewed as a "blip". But this was clearly not the intended result. They wanted people to be excited and for this to generate buzz and hopefully follow-on sales (the items were priced in a "loss-leader" fashion). This wasn't a "stress test" - it was a poorly thought out marketing promotion that pissed those who knew about the items but felt it was impossible to get one and those who were otherwise shopping the site and couldn't get through.
No one I've talked to, including my wife (who just tried to buy one of the non-XBOX items) was happy. Most felt like it was a scam, and the slowness and "fine print" (couldn't use one-click) made them think they should shop for bargains elsewhere.
Just a typical big corporation mistake, if you ask me. Like Walmart heavily advertising some set of items for sale on Black Friday and then having 1/10th the needed amount.
Forgive me for being perhaps overly blunt, but this is a frustratingly lame promotion for reasons way beyond the (obviously also important) DDOS'ing.
- It rewards speed over anything else. Dialup users are hosed. People with slower computers that can't refresh a browser page a zillion times a minute are SOL.
- Two out of two weeks, geeks are winning the vote. Hey, I *am* a geek, but I'm hella annoyed that no matter what the other three items are, the geek item is gonna win every time. Amazon Prime for 50 cents? Probably would still lose out to a geek toy. And you can't tell me that this accurately reflects Amazon.com's current or desired shopping demographic.
- 1000 items. How lame is that! That is a tiny, tiny drop in the bucket of goodwill, as someone else noted.
* * *
Seriously, is this all Amazon.com's marketing folks got up their sleeves for holiday time? Methinks there needs to be some more creative ideas available out there...
Post a Comment