Saturday, March 25, 2006

Early Amazon: Oracle down

The load on Amazon's website was remarkable.

As an ex-Amazonian said on one of my earlier posts, "There wasn't anything [Amazon] could buy ... that would handle the situations Amazon was encountering ... it was all ... written in-house out of necessity."

Not everything was written in-house, of course. Amazon needed a database. It used Oracle.

Unfortunately, our use of Oracle was also unusual. The strange usage patterns and high loads we inflicted on our Oracle databases exposed bugs and race conditions they had never seen before.

At one point in early 1998, the database had a prolonged outage, taking the Amazon.com website with it. The database tables were corrupted. Attempts to restore the database failed, apparently tickling the same bug immediately. We were dead in the water.

My DBA skills are even weaker than my sysadmin skills. Though some like Bob impressively leapt at the problem with pure debugging fury, there was little others like myself could do that would be helpful. We sat on the sidelines, watched, and waited.

It was like having a loved one in the hospital. Amazon was down and stumbled every time she tried to rise.

Through the herculean efforts of Amazon's DBAs and the assistance of Oracle's engineers, the problem was debugged. Oracle sent us a patch. The helpless fear lifted. Amazon came back to us at last.

5 comments:

zippy said...

I think this is a more common experience with software than most people realize -- once you step outside the zone where 95% of the customers live, you're in unexplored territory. Everything from Oracle to Mac OS X to Linux libraries shows this trait. Customers are the best debuggers, or at least they're the best source of error cases, so if you're an atypical user, you're blazing a trail.

A similar problem bit us at Amazon subsidiary Alexa Internet, we wanted to use off-the-shelf software to do our site rankings, but our logs contained far more unique sites than the off-the-shelf software was designed to handle. Somewhere deep inside, there was a table designed to contain enough unique sites for a "large" server's weblog -- maybe one with a few hundred vhosts. At Alexa, we saw hundreds of thousands (if not millions) of unique hosts in our logs (this was around 1999-2000). After several days with the manufacturer's lead coder on-site, we gave up and rolled our own.

--Pat

Chris said...

Hi!

I think you are talking about this:

http://www.ora-600.net/articles/disaster-diary.pdf

Chris

Greg Linden said...

Thanks, Chris. Starting on page 3, that document by Jeremiah Wilton (the DBA at Amazon at the time) has an exceedingly detailed description of the outage.

Thanks for pointing it out.

Denish Patel said...

i think export would have been a solution for Mr.Wilton....setup a new database and import schema by schema!!

Jeremiah Wilton said...

Hy Greg, what's up? Just a brief response to Mr. Patel. The database could ot be opened. Upon attempting to open it, it raised ORA-600[4000]. To use export, a database must be open. During the array corruption episode, the solution that was used required far less downtime than exp/imp would have. Sure seems funny to still be talking about an outage that took place well over eleven years ago.