Tuesday, March 14, 2006

Remote storage on Amazon S3

The Amazon web services team just launched Amazon S3, "a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web."

Michael Arrington at TechCrunch gushes about the new service, saying, "S3 changes the game entirely," and, "Entire classes of companies can be built on S3 that would not have been possible before."

Similarly, Mike at TechDirt says, "It's like Amazon just provided much of the database and middleware someone might need to develop a web-based app."

But, there's two obvious problems with building anything on top of S3: latency and reliability.

On latency, despite Amazon's marketing goo, any time you go across the internet to a remote machine to get data, you're looking at 100ms+ of latency. Compare that to a 1-3ms for local disk and effectively 0ms for local memory (or memory of local machines on a LAN) and you see the problem. There's no way you can make more than a couple data requests to S3 in the 0.5 - 1 seconds you have to serve your web page in real-time.

Reliability is the second problem. Amazon says the system is reliable -- "99.99% availability ... All failures must be tolerated ... without any downtime" -- but you can see if they're willing to stand behind that by looking at the legal guarantees on uptime. There are none. The licensing agreement says the service is provided "as is" and "as available".

I don't think it would be wise to use this for a serious, real-time system. As others have pointed out (in the comments to the TechCrunch post), it might be able to be used for an asynchronous product like online backups.

Even for online backups, there are problems. You must be willing to tolerate the lack of a guarantee of being able to recover your data once stored. And, with Amazon's fee of $15/month for storing 100G, it would be hard to add a surcharge to support your little online backup startup on top of Amazon's fees and still have a price point on your service that is attractive to customers.

Amazon S3 is an interesting idea. I think we will see some cool things implemented on top of S3, smaller projects by hobbyists, I'd think. But this is not a game changer for startups.

Update: While I do see a little support ([1] [2]) for being skeptical about Amazon S3, it is clear that I am swimming against the tide ([1] [2] [3] [4] [5] [6] [7] [8] [9]) on this one. Definitely worth reading the more optimistic folks and coming to your own conclusion.

Update: About two weeks after launch, Amazon S3 had a seven hour outage. As I said, the two obvious problems with building anything serious on top of web services like S3 are latency and reliability.

Update: Five months later, the CEO of SmugMug raves about S3. Even though they are only using it as backup, it is a great counter-example to the arguments I made above. [via New Media Hack]

Update: Nine months later, Amazon S3 has an extended outage that causes some to question the reliability of the service and the wisdom of using it for real-time applications.

Update: A year later, Don McAskill posts a presentation (PDF) with plenty of great details about SmugMug's experience using Amazon S3. They are mostly positive, though reliability and speed are concerns.

15 comments:

Anonymous said...

Hey Greg. Good points. I didn't mean to imply that this was an enterprise level offering or scalable offerings for large numbers of users. Instead, I'm talking about smaller "situated apps" that can suddenly be easily created without much difficulty. The ease of being able to just use this rather than setting up a system on your own can make it much easier for rapid prototyping and simple design for small focused apps -- an area I think is going to be very important.

Andy Harbick said...

Definitely applicable for backup/disaster recovery. Also interesting for something like storing images (ala Flickr).

However it doesn't expose any sort of search API so you can store data but you can't find it unless you know the exact ID you're looking for. Seems to me that any intelligent application is going to want search... So you keep the data on S3 and have indexes elsewhere in your infrastructure? Why?

Anonymous said...

I would think there is a lot of room to add a surcharge on top of the $15/month for 100GB.

XDrive charges over $1100/month for 50GB!

Of course, the bandwith charges from Amazon would increase your costs somewhat.

Anonymous said...

S3 is offering a base platform with a defined set of economics that startups will have to beat in order to be competitive. Hence my suggestion that this is sort of raising the minimum bar.

Anonymous said...

I see this being useful for web applications that require large amounts of storage, but where storage as such is not the product.

For example, a good photo sharing service is not primarily a storage service, but it would require large amounts of online storage in a way that's difficult or expensive for small startups to provision; the primary value-add is not in the storage, however, but in the photo sharing app itself.

Greg Linden said...

That's an interesting thought, Guan. But I'm having difficulty seeing exactly how that might work.

So, you'd build out your Flickr knock-off photo sharing site front-end, but only have a small pool of webservers for your app? Use Amazon S3 for storing all the photos?

So, you probably can't retrieve all the photo data in real time from Amazon S3 due to latency. At the very minimum, you'd have to cache the active set of thumbnails on your webservers. You could store inactive thumbnails and the full photo data on Amazon S3.

Okay, so lets say you do all this. What savings did you get from using Amazon S3?

You still have to build out the webserver cluster and thumbnail storage cluster. You outsourced having to build a large array of disks for storing the larger imagine files.

Is that all that much easier? Even if it is, does the hit to response time on your website, the higher ongoing costs of using S3 over your own disk cluster, and loss of control from outsourcing this piece of your infrastructure make this a good tradeoff?

Not at all clear to me. But maybe I'm the only one.

Anonymous said...

I can't see anyone using this to build a real business. There are countless web hosting services that offer significantly cheaper storage, PHP/CGI capabilities and no latency (assuming you are building a web app hosted on their machines). Or, you buy and co-locate a machine if you need more power/storage.

If you're building a application that needs access to a massive amount of storage you wouldn't use someone elses remote web-based storage. It just doesn't make any sense.

Just like your comment about other "Web 2.0" sites turning to sh*t once they get big enough to be used by more than hobbyists, this is something that will never scale. Oh sure, amazon will have plenty of disk space and bandwidth to serve billions of people, but no business-class (read: revenue generating) service that uses a lot of either would tolerate the crappy performance and security issues.

I can't even see this being used for amazon's web merchants. They pay some hosting company a small fee so they can have a URL and some files there, and then use amazon's merchant interfaces and existing databases for everything else. They don't need S3. Unless amazon gets into the "domain/web site hosting" business and then everything is on their servers.

I guess that hobbyists who pay for a very small hosting package and need storage for their service that makes it so you never have to leave the house to interact with the 43 people who share photo libraries of koala bears with you via personalized RSS search. Maybe they need S3.

Web 2.0 is just as much bubble and hype as Web 1.0.

John Cormie said...

Internet latency doesn't *have* to be 100s of milliseconds. Having datacenters close to most of your users or edge caching can take that down to 10s of millis. And with asychronous javascript you can make web applications *feel* responsive even though their data is more than a local disk seek away. Look at Google maps (an app that uses both these techniques) and reconsider your comment "There's no way you can make more than a couple data requests to S3 in the 0.5 - 1 seconds you have to serve your web page in real-time."

amckinnis said...

It's amazing how someone can sneeze at Google and it's front page news, while there are other very cool companies (much like yours Greg) doing cool stuff. It's nice to be "hot" until of course the time comes when your "cold". S3 looks interesting and Amazon has been very quiet about it.

Anonymous said...

How about using the S3 service as a second tier back up, where all you house there is encrypted datasets that you go get when something really bad happens?

Anonymous said...

I suppose I'm sitting on both sides of the fence here. On the one hand, I've just added automated backup onto Amazon S3 as a feature of Cardbox, and I can definitely see some opportunities for building exciting business on the basis of S3.

On the other hand, there are some serious risks too.

I've been going into some detail about both the benefits and the risks of using S3 in an ongoing blog series. I'm using an (imaginary) iTunes backup service called Tunesafe as an example. The benefits were covered last week; this week I'm going through the risks, and they get worse the longer you look. By the end of the series it looks as if using S3 will put you at risk of arrest...

Anonymous said...

Check out http://www.btdigitalvault.bt.com and also http://www.bell.ca/PersonalVault

The big Telco's are catching up - the differentiator is that after you upload your data what do you do there after? Hence a portal to share all your media and also social networking extensions.

I believe that this solution is build by a company called Casero and the website is www.casero.com

Anonymous said...

Greg, I always appreciate the way you do a "look back" on a post and give updates as time passes.

Good stuff,
Joff

Robert Vadnais said...

S3 now has a service level agreement

http://www.amazon.com/gp/browse.html?node=379654011

Why do the comments list only the time they're posted and not the date? How useless and annyoing.

Greg Linden said...

Thanks, Robert, glad to see that they are doing that.

Oops, sorry about the timestamp format on the comments. I fixed that.