Tuesday, December 13, 2005

AWSP offers shell access at Alexa?

I thought Amazon Mechanical Turk was one of the strangest things I've seen in a while, but Amazon is weirding me out again with their new Amazon Web Search Platform (AWSP).

AWSP is supposed to be a developer framework to innovate on top of the crawl and index data available from Alexa. As part of this package, it appears the AWSP offers ssh access to the Alexa cluster where you can write arbitrary C code.

This is either incredibly bold or absurdly foolish. On the one hand, this could be a useful platform for some developers, a utility computing server farm where you can rent machines by the CPU hour and access the incredible Web data available from Alexa. On the other hand, arbitrary C code can do arbitrary things, nicely accessing the data it is supposed to or evilly cracking the machine, fondling other people's data, and launching attacks on other servers.

You have to hand it to Amazon. They've been doing an amazing job thinking outside the box lately. But, sometimes, the box is there for a reason.

Update: In the comments, a couple people are arguing that these accounts appear to be isolated in virtual machines and that I may be overstating the risk. They might be right, perhaps I am being too paranoid, especially given that there are easier targets out there.


Anonymous said...

I haven't looked at it, but unless Unix has a bug, users should be protected from each other. I assume they also limit network access in various ways on those boxes.

Greg Linden said...

It's really the potential for mischief that I'm concerned about. Ideally, yeah, users would be isolated, the servers bulletproof, and network access crippled.

However, I am concerned that people will be clever about finding ways to abuse this system in truly unexpected ways. We'll find out soon enough, I suppose.

Anonymous said...

I'd assume they've spent quite some time thinking about those issues. Probably you get access to fresh VMs ("vm01"); I think it's quite within reach to lock those down appropriately.

Greg Linden said...

Thanks, Christian. Perhaps you're right. This may look no worse than the issues involved with leased shared servers, both for Alexa and AWSP users.

But, that brings up another question. Who is the target? Would companies be willing to entrust their data and code to any shared environment?

If the target is not companies, how many individuals would be willing to pay the costs of this service (a single scan of the 100T crawl data would cost at least $1/50G * 100T = $2000)?

Anonymous said...

My guess is that this is another classic case of a large company extrapolating "too far" from what's interesting internally and assuming it's interesting to anyone else.

Mechanical Turk? Interesting if you're amazon and can't hire fast enough. Not amazon... then you probably don't care.

Search web services? Interesting if you can't get one of your subsidiaries to innovate fast enough. Not amazon... then you probably don't care.

And any issues like you pointed out (why would you want to share so much with Amazon) are just dismissed by insiders, even though those ultimately prove to be the reason these things (see passport) don't catch on.

Amazon is rapidly becoming Microsoft. About 87% of what Microsoft produces never makes it out the door. Another 20% makes it out the door but ultimately doesn't catch on (everything except xBox, Office and the OS). All the money's made in the last 10%.

For amazon that's product sales + marketplace sales. That's it. Everything else is R&D (or bloat, depending on how you see it)

Anonymous said...

Greg - on the crawl data costs, I'm thinking that for many vertical search applications, you'd do an initial selection on the index metadata, and avoid touching the full crawl data until you know what you want to build a new index on.

I haven't read far enough to tell if that's possible, or whether you'd just have to move each arc file across anyway.