Geeking with Greg: 01/01/2006

Tuesday, January 31, 2006

Talk on new economics of media

Umair Haque has an interesting talk on the "New Economics of Media" (PPT).

To summarize, the talk argues that lower barriers on media publication increase the abundance of media and the importance of attention. We need tools that surface what is relevant and interesting, that help people find what they need.

I particularly liked slide 54 on smart aggregators:

The Aggregator 2.0

Allows consumers to navigate complex media landscapes by efficiently allocating scarce attention according to preferences and expectations

Leverage deep information about content to predict utility derived by consumers, slashing search and transaction costs of consumption

Examples:
Collaborative filters
Recommendation & rating systems
Similarity & difference filters
Etc.
Smart aggregation is aggregation of content plus aggregation of information, expectations, and preferences about content

And slide 57 on reconstructors:

The Reconstructor is the aggregator 3.0

Deconstruct micromedia by altering, remixing, and filtering microchunks to reconstruct 'casts of personal media ... [For example,] blog entries from individual blogs, tracks from individual playlists

See also Scott Karp's summary of the talk.

See also my previous posts, "The RSS explosion" and "RSS sucks and information overload".

Monday, January 30, 2006

Early Amazon: Xmas at the warehouse

Most retailers do most of their sales in the fourth quarter around Christmas. Amazon was no different.

This creates a bit of a problem. For physical retailers, parking lots are full, checkout lines grow, and stores become crowded. For online retailers, traffic on the website spikes, databases are strained, customer service overwhelmed, and the warehouses buried.

The warehouse is a particularly interesting story. The huge influx of orders meant a huge outflow of shipments. It's a nice problem to have, but someone has to pick those books and pack them in boxes so they get out the door.

Someone was us. For much of November and December, everyone at Amazon who wasn't holding the wheels on the website was either answering customer service or packing books at the warehouse. Everyone means everyone. Jeff Bezos, the CTO, software engineers, web devs, marketing guys, editors, everyone.

It may sound like a burden, but it was actually quite interesting. Working in the warehouse means learning everyone about how a book gets from a virtual order in a database to a physical package on your doorstep.

In those days, Amazon only had one warehouse, a scrappy, cluttered building in the industrial neighborhood of south Seattle. By today's standards, the building was tiny. Amazon's largest distribution center can hold 13 football fields now; that first warehouse might have been satiated with just one or two.

I spent many days shuffling through the warehouse. I pulled incoming books off the truck and placed them on the shelves. I picked books for orders; my favorite was always singletons, one pick and you're done. I packed orders into boxes, making sure to slap the mailing label onto my shirt before closing the box so it didn't get lost. The only thing I didn't do was gift wrapping. A gorilla had more talent with wrapping paper than I.

I touched thousands of shipments on their way out the door. Who knows, that book you ordered one Christmas may have been shipped by me. Or, perhaps, by Jeff Bezos.

Packing books in the warehouse continued for many years, at first out of necessity, then just to give us stuffy people in corporate HQ a clue about life on the front lines. I might have complained at the time, but those days in the warehouse were a remarkable experience.

Saturday, January 28, 2006

Google and the tipping point

After fawning over the upstart search engine, are the press and public opinion starting to turn against Google?

Ivan Fallon at The Independent describes the mood:

It has been the week from hell for Google.

Once the much-loved and unblemished hero of the web, the giant internet group has suffered a series of blows that have exposed for the first time its feet of clay. The company that stood for "freedom of the net" is accused of humiliatingly submitting to Chinese censorship, conniving at the suppression of freedom in Tibet, exploiting the work of American writers and of running what is arguably the biggest porn and violence website in the business.

The biggest worry of all ... was the abrupt shift in sentiment among its almost messianic customers, who are suddenly asking awkward questions .... Google's problem is that the world expects better of it. It had stood up to the US government and championed free speech, but now ... it has lost the high ground.

Every company contains the seeds of its own destruction, and it may be that even Google, the miracle of the new media age, has reached the tipping point in the past week.

As John Battelle said, "Google? We thought...well, we thought you were different."

[Found on Findory]

Update: Dennis Kneale at Forbes has a more even-handed take in his excellent article, "Gunning for Google".

Early Amazon: Dogs

Amazon allowed dogs in the office. The most famous was Rufus.

Rufus was the proud child of two early Amazonians. He was a friendly, fun Corgie, a pleasure to have around.

In the deteriorating Columbia Building on 2nd avenue, having dogs was not much of a problem. Water leaks might have been a problem, but dogs were not.

Once Amazon moved to fancier surroundings, it took some effort to find a pet friendly building. Apparently, one of the sticking points on the PacMed building lease was whether dogs would be allowed. While I find it hard to imagine Rufus ever would have been banished, some others doggies may have.

This was quite serious. The bond between man and canine was so strong that some threatened to leave Amazon if their furry companions were banned.

Ultimately, PacMed allowed dogs. The battle was won. Dogs, once again, roamed free.

To this day, Rufus has a presence on the Amazon.com website. If you fail to find a page, Rufus will be there to help.

Friday, January 27, 2006

Early Amazon: Inventory cache

Like the projects at Google that have come out of 20% time, what people are supposed to be working on at Amazon sometimes could be less important than what they played with on the side.

I was working on a few projects, but I wanted to step out and learn more parts of the code. Just reading code gets to be dull. I needed a specific task in mind, a purpose that forced me to shine my light into the back corners of the source.

So, in idle moments, I wandered off looking for performance optimizations. Focusing on the high traffic pages -- home page, book detail pages, search results -- I asked, where was big bad obidos spending its time?

I turned up some interesting tidbits. The first thing I found had to do with shopping carts.

When you walk into a the grocery store, the first thing you probably do is grab a shopping cart. Similarly, the first thing Amazon did when someone appeared at our store was hand them a shopping cart, reserving a little bit of space in our database for them to store all their virtual loot.

However, grocery stores don't have to contend with robot hordes or other window shoppers. If they did, they would have to have a lot more shopping carts around, and almost all those shopping carts would be empty.

Given all the looky-loos, it makes more sense for Amazon to wait a bit and quickly slip a shopping cart into your hands when you grab the first thing you want to buy.

This little change helped more than you might think. All those shopping carts add up.

But a bigger issue was the real time availability lookups. When you looked at a book at Amazon, the site went off to the warehouse, rummaged among the shelves, and checked if we had any copies. If it turned up bupkis, it checked how quickly we could order the book. All in real time.

This turned out to be the single most expensive operation on a book detail page. Ugly business, checking availability.

But, do you really need to know the availability right now? Maybe knowing what it was N minutes ago is okay. Huh, right, cache the data. It's okay if it is a little stale.

Because I was doing this on my own time, I started playing with some less obvious methods of doing availability caching. I thought that, given how much this would be hammered by the site, I might try to find a way to minimize locking. I also thought that I might be able to load the cache preemptively, so there would be no delays to shoppers on the site when refreshing the parts of the cache.

I hacked up something that seemed to worked well. In tests, latency to a shopper on the website dropped from entirely too long to very near zero. I was starting to talk to a few other people about the prototype, asking what they thought, seeing how it could be improved.

Right about then, several other people were working on a major redesign of the Amazon site, some combination of an extreme makeover and new features. I was approached by someone who wanted to show book availability on search result pages, something that was completely impossible without caching, but would be possible if my quick prototype could be dressed up and pushed out the door. And out it went.

Of course, all of this is obsolete at this point. Back when I built the inventory cache, it was designed for one small Seattle warehouse and a single big honkin' iron webserver. The massive inventories across several huge distribution centers -- some of which can swallow thirteen football fields and come back for seconds -- combined with a switch to a cluster of commodity webservers eventually made the old cache inappropriate. It lasted well beyond its time, so long that the heroics of its youth lay forgotten under the problems of its senility.

Today, I look back at the inventory cache as just one of many examples of the benefits of time to wander. 20% time has value well beyond its proportions.

Thursday, January 26, 2006

Early Amazon: boy-am-i-hard-to-please

Quite a bit of Amazon's early code was written by the first two employees. Nearly all of obidos and many of the tools outside of obidos showed their fingerprints.

Somehow, and I have no idea how, these two managed to find time to do some fun side projects. One of my favorites was called Eyes.

Eyes was well before its time. It allowed readers to sign up to get an e-mail any time a new book came out that matched a search query. It was a great way to hear about new releases, especially in non-fiction.

The page describing Eyes was amusing. It said:

Eyes, your automated searcher, is amazing. Tell it what authors and subjects interest you, and it will track every newly released book matching your interests, author by author and subject by subject. Sign up with Eyes and we'll send you e-mail when the books you want to know about are published.

If you don't think this free service is both cool and useful, please send mail to boy-am-I-hard-to-please@amazon.com and tell us why.

Eyes lived on for many years. It was eventually replaced by Amazon Alerts, a good service, but one that is missing the snarkiness of the old.

Sometimes, memories lurk deep only to surface as strange inside jokes. From the beginning of Findory, one of the customer service e-mail addresses has been boy-am-i-hard-to-please@findory.com.

Microsoft Live Labs and Search Labs

Microsoft announced two new R&D groups, Microsoft Live Labs and Microsoft Search Labs:

Live Labs will investigate a broad and comprehensive set of research topics such as multimedia search, machine learning, distributed computing and data mining, and will engage in rapid prototyping and the incubation of disruptive technologies.

Search Labs will focus on areas such as personalization, socialization and improved user experiences while maintaining strict regard for user privacy.

It is interesting that Search Labs is focusing on personalized search and social search. With Yahoo's efforts in social search and Google's personalized search already launched, this move may be a reaction to a perception that Microsoft is falling behind in these areas.

I found another part of the announcement also to be telling:

Unlike basic research, which is geared toward visionary discoveries that may or may not end up in actual products, and product development, which is feature-focused and geared toward solving tactical engineering problems, Live Labs'’ applied research will study the relationship and applicability of theories or principles to the solution of a problem or an actual product or service.

At Google, the hordes of PhDs are integrated into the broader organization. Everyone is doing applied R&D.

At Microsoft Research, the researchers are isolated. This preserves their independence, but makes it quite hard for the good ideas in MSR to make it to the product teams, a problem that is aggravated by a NIH attitude in the product teams that extends even to MSR.

Live Labs and Search Labs sound like an effort to bridge that gap, to be a little more Google-like with their brainiacs.

A bit more information on Microsoft Live Labs is available at labs.live.com.

See also articles by Gary Price and Todd Bishop.

See also my previous post, "Microsoft adLab and targeted ads".

Update: Interesting thoughts from Nathan Weinberg:

Everyone, both people working at Microsoft and outsiders, agrees that MS gets outdone by three-person startups that can be more nimble, more reckless and more innovative.

Live Labs ... is free from the restrictions normally imposed on development teams. They will be able to work without worrying about how their product affects existing teams and existing revenue models, with the end results being the sole purpose of the team.

In a sense, it's a startup within Microsoft.

On a related note, Guy Kawasaki, in his post "Intrapreneurship", has some good advice for those seeking to create a team that acts like a startup within a much larger firm.

Wednesday, January 25, 2006

Early Amazon: Get big fast

Amazon wasn't the first online bookstore, but it grew to be the biggest.

I love books. I loved selling books. It's a noble mission, spreading knowledge, promoting education, encouraging reading. It felt great to be growing so fast, doing something I loved and respected.

And, in the middle of 1997, Amazon was growing fast. To us software engineers, it seemed almost too fast. When building new systems, we'd often design for x4 current load, only to see x4 current load in just a few months. Often, it felt like it was all we could do to keep the wheels on.

But to Jeff, it wasn't fast enough. Jeff saw that Barnes & Noble and other booksellers would come online eventually. Amazon had to be big enough to stand our ground when those giants noticed the upstart challengers and finally turned our way.

"Get big fast", Jeff said. And we grew.

This picture is of the back of the T-shirt from the 1997 Amazon.com summer picnic. Like everything at Amazon, it was a frugal affair. A few kegs out on the lawn. A band made up of Amazon employees (they were good). And, yes, cheap hot dogs.

Looking back at those days now, Amazon succeeded beyond our wildest dreams. We had a small online bookstore with hopes of becoming a big bookstore.

Amazon is now a superstore, truly a Wal-Mart of the Web, selling books, music, movies, software, electronics, food, hardware, apparel, toys, and more.

"Get big fast," Jeff said. And we did.

New detail in Google Maps

Chikai Ohazama announces on the official Google blog that Google Maps has new levels of detail available.

Fun stuff. Check out this remarkable view of the Frank Gehry building in downtown Seattle.

Tuesday, January 24, 2006

Early Amazon: BookMatcher

When I interviewed at Amazon, I arrived full of ideas. I rattled off a dozen ways I thought I could improve the Amazon website, including an idea about offering book recommendations.

I love books. I loved the idea of a recommender system for books. What fun, discovering new, interesting books. I wanted to make it happen.

In my first week at Amazon, I was disappointed to discover that some at Amazon were already working on a book recommender system. Early users of Amazon.com might remember it. It was called BookMatcher.

BookMatcher users would start by rating 20-30 books, then get recommendations. As you can imagine, few used it. The hurdle of rating 20+ items was too much for most users.

The software that powered BookMatcher was provided by an outside firm. Unfortunately, it didn't work very well. The initial hurdle was just one issue. The recommendations tilted toward bestsellers and away from the tail. The system was slow. Topping it all off, the system was unreliable, falling down under load several times.

This wasn't what Amazon needed. Book recommendations at Amazon needed to work from sparse data, just a few ratings or purchases. It needed to be fast. The system needed to scale to massive numbers of customers and a huge catalog. And it needed to enhance discovery, surfacing books from deep in the catalog that readers wouldn't find on their own.

There had to be a better way.

Yahoo gives up?

Yahoo CFO Susan Decker said:

We don't think it's reasonable to assume we're going to gain a lot of share from Google. It's not our goal to be No. 1 in Internet search. We would be very happy to maintain our market share.

Very happy just to maintain? Is Yahoo entering its senior years? Very happy if they just don't decline any further?

Even if Susan believes this to be true, this statement is utterly destructive. Who wants to work at a company that has lost its edge? Who on the Yahoo search team wouldn't look at this and think, maybe I should go 5.8 miles down the road to a place where they do care about making the best search?

[via Steve Rubel and Danny Sullivan]

Monday, January 23, 2006

The value of recommendations

Laurie Flynn at the New York Times reports on the value of recommendations for e-commerce companies. Some excerpts:

Web technology capable of compiling vast amounts of customer data now makes it possible for online stores to recommend items tailored to a specific shopper's interests. Companies are finding that getting those personalized recommendations right - or even close - can mean significantly higher sales.

At NetFlix ... roughly two-thirds of the films rented were recommended to subscribers by the site - movies the customers might never have thought to consider otherwise. The company credits the system's ability to make automated yet accurate recommendations as a major factor in its growth from 600,000 subscribers in 2002 to nearly 4 million today.

Similarly, Apple's iTunes online music store features a system of recommending new music as a way of increasing customers' attachment to the site and, presumably, their purchases.

The article mentions Amazon.com too but offers few details. I am surprised that the article doesn't explore Amazon in more depth given that, as with Netflix, the recommendations lead to a substantial percentage of Amazon's sales.

See also my previous post, "Zen and the art of Amazon recommendations".

[Found on Findory]

BusinessWeek on Yahoo's Social Circle

Ben Elgin at BusinessWeek writes about Yahoo's community and social software strategy. Some excerpts:

By cultivating online communities -- and encouraging people to tap into the collective knowledge of these groups -- Yahoo is hoping to change the way people find information online ... "social search"

All major engines analyze the link structure of the Web as a key ingredient in determining what pages are most relevant ... Social search aims to shift power from Web publishers, who create these links, to everyday Internet users by examining their bookmarks or giving them tools to express their opinions.

It sounds great, but there are plenty of skeptics. ... Some question whether enough Internet users will spend the time on these sites needed to make them effective ... Others doubt the wisdom of crowds will offer much of an upgrade over the feats of raw computing power. "It really adds very little value to what is available now," says Raul Valdes-Perez, CEO of Vivisimo ... "The best description of a document is the document itself."

Google's long-term bet remains on personalization -- using its mammoth computing horsepower to sort through data and better discern what users are thinking.

For more thoughts on tagging and social search, don't miss Danny Sullivan's old posts, "Tagging Not Likely The Killer Solution For Search" and "Yahoo My Web Tagging & Why (So Far) It Sucks".

See also my previous posts, "Yahoo gets social with MyWeb" and "Questioning tags".

Saturday, January 21, 2006

Google recommended news

Google News just launched news recommendations!

If you are signed in to Google News, do a search or two and click on a few stories. Come back to the Google News front page. There should be a section added to the front page, "Recommended for X", with list of recommended stories.

Just like early experiments with personalized news at MSNBC and MSN Newsbot, only the "Recommended for X" section seems to change, not the entire front page.

So, how does it work? From the help page:

Google News can suggest news stories just for you. If you have Personalized Search enabled, you can sign in to your Google Account to get recommended news stories based on your past news selections.

Google News can ... compare your tastes to the aggregate tastes of other groups of similar Google News users. Simply put, we recommend news stories to you that have been read by many other users who've also read similar stories as you in the past.

Personalization comes to Google News.

It is unclear exactly what technique Google News is using for their news recommendations. Their description implies they are using some kind of collaborative filtering or clustering-based technique.

However, when I used the news widget in Google Sidebar (part of Google Desktop), which also learns from the articles you read, it appeared that the recommendations were based on subject categories. So, read a technology article, get another technology article.

In my usage of the new Google News feature, I found the recommendations also seemed to be showing most popular articles based on topic. Clicking on three articles on Google brought up articles on space probes, digital music, WiFi, and intelligent design. A quick check of Google's most popular articles for tech and science turned up many of these articles. Clicking on a few more articles about Google did not change the recommendations noticeably.

This would suggest that the underlying recommendation engine either is using some combination of subjects and similar subjects (e.g. read an article on internet technology, get recommendations of popular articles in subjects similar to internet technology) or a clustering approach (build a large number of groups of different readers with various interests, match you to the most appropriate cluster based on your history, recommend things popular with other readers in the cluster).

That's okay, but the problem is that the recommendations will tilt toward the popular and away from the tail. You want the recommendations to be surprising, to enhance discovery, to help me find things I wouldn't have found on my own. Clustering or subject-based techniques won't get you there.

Very cool stuff. Now that Google is doing both personalized search and personalized news, I expect we'll see a lot more activity in this area.

See also reviews and thoughts from Gary Price, Philipp Lenssen, and Steve Rubel.

Update: If you are signed in and still cannot see the recommendations, make sure Search History is enabled.

Update: Nathan Weinberg posts a review. He says, "If Google's personalization engine works, it is either very subtle, or needs a lot of data." It's a great point. Changing immediately, obviously, and appropriately on any new data is important. Google seems to have missed that.

Update: Principal Scientist Krishna Bharat announces on the official Google Blog that Google News is now out of beta and talks about the new recommendation feature: "You'll receive recommended news stories based on the previous stories you've read ... All of this is done automatically using algorithms. For example, we might recommend news stories to you that many other users have read, especially when you and they have read similar stories in the past."

Friday, January 20, 2006

Early Amazon: Door desks

Amazon often seems to be lumped into the world of dot com excess, but it never enjoyed the Aeron chairs or free massages of the VC heavy startups.

Amazon was a frugal culture. When I joined, the health plan was a high deductible plan that offered little comfort, catastrophic coverage only. Salaries were low, the worst offer I received as I recall.

But the quintessential example of Amazon's frugality was the door desk.

Door desk, you say? Okay, I have a door. How do I make that into a desk?

Leave it to Jeff Bezos. Buy a wooden door, preferably a hollow core wooden door with no holes predrilled. Saw a couple 4" x 4" x 6' pillars in half. Bolt them to the door with a couple of scary looking angle brackets. Put it in front of a programmer. Door desk.

In addition to being inexpensive, door desks offered a lot of surface area. Put your computer monitor on top and in barely makes a dent. If Amazon wasn't so bloody cheap, you could have put three more monitors up there. Plenty of space for all of your crap.

Ergonomically, door desks leave a lot to be desired. Keyboards were usually too high. Typing for hours could be uncomfortable. And those angle brackets have sharp edges; accidentally scrapping exposed flesh against those was a mistake that wouldn't be repeated.

But door desks came to symbolize the Amazon frugal culture. They took on a live of their own. Years later, in 2001, there was a 6.8 magnitude earthquake here in Seattle. We were in a different building by then -- the attractive PacMed building up on Beacon Hill above downtown Seattle -- but we still had our door desks. And I can't tell you how happy we were to have that door desk over our heads as the building shuttered and swayed around us.

Thursday, January 19, 2006

Early Amazon: Group discounts

What do you get when you pair an eager but inexperienced graduate student with a freshly minted MBA? Yep, you know where this is going.

My first project at Amazon was group discounts. Now, I'm no marketing guy, but my understanding is that one of the tools marketing folks like to use is discounting. Want more shoppers? Offer a sale. Want new customers? Give them a discount on their first purchase.

Selective discounting did violate some of the basic philosophies of Amazon. Jeff envisioned Amazon as the "Wal-Mart of the Web", a place with every day low prices. Nevertheless, Amazon's fledgling marketing group felt they were missing a valuable tool. They wanted a way to offer discounts as part of various promotional campaigns.

Someone had to be sacrificed, and the new AI grad student programmer was the lamb. But I didn't see the project that way at the beginning. Changing pricing impacts systems throughout the code, from the website to warehouse to accounting. I was excited by this chance to dig my teeth into some juicy code.

To get started, this new hire computer geek met with the new hire marketing MBA. This project was ours, ours to define, ours to build.

Pairing two new hires without a clue is bad enough, but this was made worse by my wide-eyed enthusiasm. I started gathering up a list of what we wanted to build. The list grew and grew as ideas poured out. The answer to "do you need to..." was always "yes". Soon, we had requirements for the mother of all discounting systems, something that allowed individual discounts, group discounts for everyone coming to Amazon through a specific URL, limited time promotions, buy X get Y free, and every other promotional offer you could imagine.

Fear started to set in. Fresh out of school, I didn't know much, but I had heard others warn of being crushed under the steaming turd ball of project bloat. And this was starting to smell of steaming turd ball.

I tried narrowing the scope of the project. Having investigated the code changes required, I proposed offering simpler coupon codes instead of group discounts. In addition to having less complicated offers (all of the form "$X of a purchase of at least $Y"), the discounts would only appear at the end of the order form instead of on nearly every page on the website.

Coupons were soundly rejected. They wanted group discounts. They wanted the discount to appear on every page on the site. My prototypes probably did little to temper their enthusiasm. Once they saw a book detail page that said, "AARP members get an extra 5% off!" and "AARP Price: $X", it was like smack, and they were not going to be denied their prize.

So, I started coding.

To start, I knew nothing about Oracle databases. Fortunately, I worked at a bookstore. I plowed through textbooks like they were comics, learning about databases and good schema design. Despite my best efforts, the first review of the dozen or so new tables in my database schema was scathing.

I pushed on. Callbacks had to be inserted into almost every page on the site, detail pages, search results, browse pages, home pages, the shopping cart, the order form. These callbacks applied the discount and changed the wording on the page. New entry pages had to be created that greeted group discount members. Accounting and customer service tools had to be modified. New tools had to be created to create, modify, and maintain the various discounts.

It was huge. And, in the end, it was done. It all worked. Oh yeah, I got your group discounts right here.

It was an utter failure. The marketing group ran a half a dozen discount programs before deciding they wanted to deemphasize discounts in favor of every day low prices. A few years later, someone at Amazon implemented a quite successful coupon system.

Old code never dies. It lives on in the eddies between the lines. Several times, I begged to remove the group discount code, but I was always rebuffed. Those callbacks lived on for years, broken, but alive, pissing off any programmer who came across it and wasted time exploring what it was.

For years, I would cringe whenever anyone mentioned "group discounts". Even now, it pains me to hear it.

The mainstream and saving people time

In his post, "Web 2.0 Is Not Media 2.0", Scott Karp argues most Web 2.0 products miss the key feature for the mainstream, saving people time. Some excerpts:

The average person does not have much time (if any) to spend creating media and has patience for only a finite amount of choice.

Bloggers and others who put a lot of time and effort into media consumption and media creation are outliers -- people may want something more customized than the morning paper, but they still want the simplicity and leisure feel.

Most people don't have time to do a lot of voting, tagging, saving, and commenting -- there's already too much filing and sorting to do at work and with the monthly bills. For the average person, media consumption consists of reading or viewing and then moving on to something else.

These kinds of tools are only suitable for early adopters, people who like to tinker and are willing to endure some level of suffering.

But most people are lazy. If you ask them to do a lot of work, they won't do it. As they see it, you're only of value to them if you save them time. And, you know what, they are right.

See also my previous post, "People are lazy".

Wednesday, January 18, 2006

Early Amazon: The first week

Scruffy. Chaotic. Exciting. That was Amazon in early 1997.

Amazon's offices were on 2nd Avenue in Seattle, near Pike's Place Market, a couple floors of a run-down brick structure called the Columbia Building. Those with a window in the front enjoyed a view of the local methadone clinic and a bizarre wig shop. You couldn't quite see the strip clubs; they skulked out of view a couple blocks away.

Now, I didn't have a window in the front, of course. That would be a bit much for the rookie, some wide-eyed student fresh out of grad school.

I had the kitchen. Space always was at a premium at Amazon, and that time was no different. My first day, I was led into my office, a card table set up in the back corner of the kitchen with a PC sitting on it.

The kitchen office was actually quite a bit of fun. I knew few at Amazon, and most people were too heads down for idle chit-chat. But they did want tea and coffee off the counter a few feet from my face. I set up a candy jar -- mmm, free candy -- and tried my best to suck knowledge out of anyone who passed by too closely.

My first assignment was to start learning the code base. Open a shell, fire up emacs, and start reading the code. I spent a few days tracing through the dispatches for different URLs, tracking how good ol' obidos -- the large CGI program that powered the website -- handled different queries, the home page, book detail pages, search, shopping cart, and the order pipeline. To this day, most URLs at Amazon still contain /exec/obidos.

Next, group discounts...

Early Amazon: The series

I have been enjoying the Xooglers series by Ron Garret and Doug Edwards about their experiences working at Google.

Inspired by their example, I thought I'd experiment with a series here on Geeking with Greg about my early days at Amazon.com.

I will cover some tidbits related to personalization, but I suspect it may be a bit off topic for this blog. If you like what I'm doing or don't like what I'm doing, please leave a comment to let me know.

Recommender systems and toy problems

In a recent literature search, I came across a paper by Guy Shani, Ronen Brafman, and David Heckerman called "An MDP-Based Recommender System".

This paper starts off by recasting the recommendation problem:

Typical Recommender systems adopt a static view of the recommendation process and treat it as a prediction problem. We argue that it is more appropriate to view the problem of generating recommendations as a sequential decision problem.

We suggest the use of Markov Decision Processes (MDP) ... [that] can take into account the utility of a particular recommendation [and] suggest an item whose immediate reward is lower, but leads to more likely or more profitable rewards in the future.

The paper goes on to say that, in their tests, "all the MC [Markov Chain] models ... yielded better results than every one of the Predictor-k models."

Sounds good so far, but the first sign of trouble is in the definition of their model:

The states in our MC model ... contains all possible sequences of user selections.

Of course, this formulation leads to an unmanageable state space with the usual associated problems -- data sparsity and MDP solution complexity. To reduce the size of the state space, we consider only sequences of at most k items, for some relatively small value of k.

As you might expect, the state space for the model has to be harshly clipped to make it manageable. But it gets worse. Look at the data sets used in the experiments:

We used two data sets, one containing user transactions (purchases) and the other containing user browsing paths obtained from web logs. We filtered out items that were bought/visited less than 100 times and users who bought/browsed no more than one item. We were left with 116 items and 10820 users in the transactions data set, and 65 items and 6678 users in the browsing data set.

116 items in the largest data set, 11k users. In contrast, Amazon has 7M+ items in their catalog and 40M+ users.

It is not hard to beat an algorithm designed to work on massive data sets with an algorithm that can only work on toy problems. The challenge is to find techniques that outperform existing solutions on realistic data sets.

I can't tell you how many times this happens. So many papers I come across propose some exciting new technique with claims of higher accuracy, but the technique is impractical, unable to scale to real world problems.

In this case, the paper claims its contribution is an MDP recommender system. But applying a well-known technique like MDP to product recommendations is not difficult. Finding a way to make that technique scale is difficult.

In the end, papers like me leave me with the feeling that everyone just wasted their time. The hard problems were ducked. The researchers left us tinkering with toys.

I only hope that no readers are mislead into believing that their experience with small toys prepares them for the crushing data sets of the real world.

Tuesday, January 17, 2006

Recommender Systems Research at Yahoo

At the Beyond Personalization 2005 workshop early last year, several people from Yahoo presented a short paper, "Recommender Systems Research at Yahoo! Research Labs" (PDF).

The paper describes "some of the ongoing projects at Yahoo! Research Labs that involve recommender systems ... and solutions relevant to Yahoo's business."

The paper is too short to have much detail, but it does give some examples, including a content-based movie recommendation prototype that sounds vaguely similar to some stuff at IMDb.com and a more unusual project attempting to make music recommendations based on similarities in the raw audio streams. It also briefly discusses some more general questions, including the cold start problem and using content versus user behavior data.

The paper says that Yahoo plans to integrate recommendation technology into Yahoo Search and Overture, among other places. Personalized search and personalized advertising, coming to you soon from Yahoo.

Yahoo acquires SearchFox

Last week, SearchFox announced that they are shutting down.

Today, Mike Arrington at TechCrunch reports that SearchFox's assets and some employees will be acquired by Yahoo.

SearchFox never came out of closed beta, so I didn't get to see it, but it apparently was some kind of personalized feed reader. From Mike Arrington's profile of SearchFox:

Our RSS reader learns by watching what individuals and the entire community find interesting ... Existing RSS readers only show information chronologically, which quickly leads to information overload.

Our goal is to that you see what's interesting to you on the first page, rather than on the 20th page. Initial studies show that our personalization engine surfaces 50% of the interesting posts to the first page after a week of use, and reaches the 90% level after two weeks of use.

Huh, that sounds familiar ([1] [2] [3]).

I have seen some comments by people in the closed beta group who very much liked SearchFox and were disappointed by the shutdown. But I have seen other comments that SearchFox had gotten annoyingly slow even though the user base was tiny.

I am surprised that SearchFox never got out of closed beta. Even if they were desperate, why would SearchFox not open their doors to the world?

Perhaps they didn't because they couldn't. Building scalable personalization systems is hard. Techniques that work fine on toy problems completely break down at scale. The systems have to be designed from the start to do fast recommendations in real-time for hundreds of thousands of users.

Maybe SearchFox would have worked fine with a large user base. I don't know. But, if it would have, I think it is odd that they didn't just launch it.

Friday, January 13, 2006

Microsoft adLab and targeted ads

It is being widely reported that Microsoft announced their new research group for advertising, Microsoft adCenter Incubation Lab (adLab).

From the press release:

Today at the adCenter Demo Fest on the Microsoft campus, the researchers gathered to present prototypes ... [that] promise to change online advertising dramatically in areas such as paid search, behavioral targeting and contextual advertising.

adLab will be headed jointly by Ying Li, Ph.D., of Microsoft adCenter in Redmond and Jian Wang, Ph.D., of Microsoft Research Asia in Beijing, and will consist of a team of dedicated scientists with specializations in the areas of data mining, information retrieval, statistical analysis, artificial intelligence, auction theory, visual computing and digital media.

John Cook at the Seattle PI wrote, "In the next six months, the company plans to handle all of the search advertising functions on [MSN Search] ... a task that is currently outsourced to Yahoo."

The most interesting part for me is what they are planning on doing with behavioral targeted advertising.

Shankar Gupta reports:

One product scheduled for imminent release was a behavioral targeting tool that maps users' Web-browsing habits, and allows advertisers to create their own segmentation, choosing what sort of Web-goers they want to target with MSN search ads.

Allison Linn at AP gives more details:

They aim to give advertisers a better sense of the age, gender and other traits of people who are viewing certain information online. For example, the technology could give a car advertiser the best shot at reaching women over 45, or men under 25. A movie company, in turn, could be given a better chance of reaching people who are or have recently visited sites related to entertainment.

This process sounds more like segmentation -- picking ads based on categories or clusters, usually from explicitly provided registration data -- rather than personalization -- using implicit data about interests derived from user behavior -- but it certainly could be a step toward personalized advertising.

Segmentation and personalization appear to be the key differentiators for Microsoft adCenter. Mike Grehan chatted with two members of the Microsoft team:

I asked both women what they thought would set MSN apart from Google's AdWords product. They immediately focused on years of profiling, user behavior, and data mining. In short, they know a heck of a lot more about their audience than Google knows about its own.

If Google has an Achilles heel, this is it.

See also my previous post, "Kill Google, Vol. 2".

See also my previous posts, "Google wants to change advertising", "Yahoo testing ads targeted to behavior", and "Is personalized advertising evil?"

Update: Don Dodge at Microsoft posts some interesting thoughts on the future of online advertising.

IBDNetwork interviews Findory

Findory has a short interview posted on IBDNetwork's Under the Radar weblog.

Update: The interview appears to have moved. It is now here.

Power, performance, and Google

Google Principal Engineer Luiz Andre Barroso wrote an ACM article called "The Price of Perfomance" where he discusses issues with power consumption in the Google cluster.

Don't miss the graphs on the first page showing that performance per watt has been relatively flat and that the cost of power over the lifetime of the hardware may soon exceed the cost of the hardware.

Luiz also makes an interesting point about slow development of CMP (chip multiprocessor) commodity hardware, saying, "Desktop volume still largely subsidizes the enormous cost of server CPU development and fabrication, the lack of threads in the desktop has made CMPs less universally compelling."

That is a problem with building massive server clusters on commodity hardware. The hardware is cheap, but it is designed to solve a different problem, powering a box in a desktop environment.

On a related note, notebook sales exceeded desktop sales for the first time in 2005. If this trend continues, the bigger market of mobile processors may be the driving force in CPU development in the future.

It will be interesting to see if Google's switches to using mobile processors for their cluster. Notebooks prioritize power consumption, which is an issue they share with Google's massive cluster.

And I do wonder how much Google would actually benefit from multiprocessor hardware. Luiz seems to suggest that they would, but I would think that the cluster is mainly bound by disk I/O. That would mean the goal is to keep as much data in memory as possible across the cluster, so additional processing per node would have less value than additional RAM.

If so, the surprising conclusion might be that switching to slower, low power mobile processors may actually increase overall throughput if it allowed more nodes and more data to be held in memory across the cluster.

Update: Seven months later, a paper on Google Bigtable mentions that Google is using machines with two dual-core Opteron 2 GHz chips (almost certainly the low power HE chips). Google seems to be putting a lot of processing in each node, more than I would have expected. The memory per node was not disclosed.

Thursday, January 12, 2006

Podcast recommendations

Matt Marshall posts about Loomia, a podcast aggregator.

Loomia has an unusual personalization feature, recommendations based on the podcasts you rate. From their About and FAQ pages:

Loomia provides search, recommendations, and personalization for podcasts and videocasts.

Rating podcast channels and items tells the recommendation system what you like and don't like. The more information that the system has, the better recommendations it can provide. Ratings also factor in whether a channel or an item is a top-rated one or not.

I'm not much of a fan of podcasts -- I find they take too much time to listen to -- but I did try it out.

Loomia easy to use and, impressively, is able to make reasonable recommendations even with just a few ratings. It seems like a useful way to discover interesting podcasts that might be difficult to find on your own.

In my experiments, however, the recommendations seemed a bit slow and seemed to tilt toward popular items and away from the tail. It is difficult to determine if these are a widespread problems.

Nevertheless, this looks like a promising effort to help people cut through the crap and surface good podcasts.

See also reviews of Loomia by Michael Arrington and Alex Williams.

Tuesday, January 10, 2006

Google Video and missing the wow

Mike at TechDirt perfectly captures my reaction to the new Google Video in his post, "We Sat Around Waiting For Google Video And All We Got Was This?":

Remember the good old days when Google used to "wow" people with their new products? AdSense. Gmail. Google Maps. These were products that took what was apparently a mature market and completely changed the game -- and did so in a way that made people say "wow."

Google ... really misplaced the magic pixie dust with Google Video.

The product clearly isn't ready for prime time ... Sure, it's "beta." But just because it's "beta" doesn't mean it should be terrible ... Something has gone horribly, horribly wrong.

As Brad Hill said, painfully disappointing.

Update: David Pogue at the NYT writes, "Review: Google Video lays an egg". [Found on Findory]

Monday, January 09, 2006

Summing collective ignorance

A month ago, when I was talking about Yahoo Answers, I wrote:

A popularity contest isn't the best way of getting to the truth.

People don't know what they don't know. Majority vote doesn't work if people don't have the information they need to have an informed opinion.

Today, I saw Nathan Torkington's post on O'Reilly Radar, "Digging the Madness of the Crowds":

Steve Mallett, O'Reilly Network editor and blogger, was very publicly accused, via a Digg story, of stealing Digg's CSS pages. The story was voted up rapidly and made the homepage, acquiring thousands of diggs (thumbs-up) from the Digg community along the way. There was only one problem: Steve didn't steal Digg's CSS pages.

Take a majority vote from people who don't know the answer, and you're not going to get the right answer. Summing collective ignorance isn't going to create wisdom.

See also my previous post, "Digg, spam, and most popular lists".

On a side note, the O'Reilly Radar story mentions a site called Pligg. Like de.lirio.us for del.icio.us, Pligg is a free open source clone of Digg.

Update: There is a good discussion going on in the comments to this post.

Update: Yahoo Answers PM Yumio Saneyoshi and Yahoo My Web Community Manager Matt Stevens dropped by and left comments on this post. Well worth reading their thoughts.

Friday, January 06, 2006

A Google personalized TV ad engine?

Robert Cringely's latest column speculates that Google may change the TV advertising industry by targeting ads to your individual interests:

Google is an advertising company. Their edge is granularity.

How often do you see an ad on TV for something you're currently in the market for? I'm guessing almost never. But imagine if everyone watching "American Idol" only saw ads for things they might really buy? Or, better yet, only saw ads for things they had already expressed an interest in? The value of those same 30-second commercial slots would increase by orders of magnitude.

Google imagines a world where only single people see match.com ads, and people who can't drive see ads from taxi companies where others see Toyota campaigns. Where fraternities see ads for strip clubs, beer, Cancun weekends and LSAT prep courses, and only seniors (and their adult children) see ads for Alzheimer's drugs.

What would be the value of that increased efficiency?

We are all bombarded by advertising in our daily lives. Junk mail, ads in magazines, TV ads, it is all ineffective mass market noise pummeling us with things we don't want. It is a useless waste of time, a missed opportunity to capture a fleeting glimpse of my attention.

Advertising is content, potentially useful information about products and services. The advertisements we see should be useful and interesting, not annoying and irrelevant. Targeted to advertising to individual interests, personalizing advertising, can make advertisements helpful and relevant.

See also my previous posts, "Personalized TV advertising" and "Google wants to change advertising".

[Cringely article via Michael Bazeley]

Monday, January 02, 2006

RSS sucks and information overload

Paul Kedrosky makes a great point in his post "RSS Sucks":

RSS is just a clunky high-volume replacement for web browsing. Rather than making it easier to consume information, it makes it easier to drown in context-free news, inducing that panicked feeling we all eventually learn too well when you see an RSS folder stuffed full with hundreds of unread posts.

If you use a feed reader, there must have been at least a few times you've looked at the overwhelming pile of unread articles with a sigh. So much to read. You have to go in, click on each feed, laboriously skim the articles, and slog on through the pile.

But the issue here isn't so much with RSS. RSS is just a data format after all. The problem is that the current generation of feed readers merely reformat RSS for display. They don't do anything else, no prioritization, no filtering, no help dealing with the flood of information.

I saw a quote a while back that, I think, perfectly captures the problem:

We are drowning in information but are starving for knowledge.

The problem is scaling attention. Readers have limited time. They don't want information. They want knowledge. Our job is to help them, to help them focus, prioritize, and find what they need.

Next-generation feed readers should help people find knowledge. Cut through the undifferentiated glut of information and find focus. Cut through the noise and discover knowledge.

See also my previous posts, "Turning noise to knowledge" and "Organizing chaos and information overload".

Sunday, January 01, 2006

Two years of Findory

Tomorrow, January 2, 2006 will be the two year anniversary of when I launched Findory.com.

It has been great fun building Findory as it grows and grows. And Findory sure has been growing. Below is a graph of total viewed hits on Findory.com each quarter for the last two years:

"Viewed hits" means hits on the webservers excluding robots and redirects. Viewed hits in December 2005 was 4.5M, total hits 7.1M, viewed page views 3.2M, and total page views 5.9M.

Growth was 16% month-to-month in Q4 2005, a little slower than the growth rate earlier in the year, but still healthy growth. At that rate, Findory's traffic doubles every five months.

It is interesting to compare Findory's traffic levels with some related startups and websites. For one example, according to Alexa, Findory's traffic is roughly the same level as Rojo, Memeorandum, and MSN's Start.com.

When I finished the bookkeeping for this month, I was pleased to discover that Findory was also cash flow positive (by a few dollars) in December 2005. Fun milestone.

Including early work and prototypes before Findory.com launched, I have been working on Findory for about 2 1/2 years now. Running this scrappy, resource-starved, self-funded startup has been a remarkable experience, learning how to do everything from system administration to economical viral marketing to wading through legal goo. I look forward to what 2006 will bring.

Digg, spam, and most popular lists

Digg is a news site that allows users to submit articles and shows the most highly rated articles -- articles people "dig" -- prominently on the front page.

The site has grown rapidly in recent months, so rapidly that Wired Magazine gushed that "Digg Just Might Bury Slashdot".

Predictably, the growth increased the incentives to spam. I have seen some spam on Digg recently, so I wasn't surprised to see an announcement from the Digg team about upcoming anti-spam features. Since the same list is shown to everyone, it makes an attractive target for spammers.

I am still skeptical about Digg. Yahoo, CNN, Bloglines, and many others already have lists of most popular or most highly rated articles.

Spam is one problem, but the bigger issue is that most popular lists are always a poor match to individual interests. They tilt toward the sensationalistic and frivolous, toward the mass market and away from the tail. It is a simple herding of the hordes, not an effort to uncover the wisdom of the crowd.

See also my previous post, "Getting the crap out of user-generated content".

See also Steven Cohen's post, "Digg fights spam".

Update: Alex Bosworth posts some interesting thoughts on spam in social bookmarking systems.

Update: Russell Beattie says, "Digg.com is ... really full of crap ... The links produced are pure garbage. It seems that the ranking system they’ve created ends up promoting only the most sensationalist headlines to the front page."

Update: Rashmi Sinha has an excellent post arguing that Digg's "social structure is more like a mob" because Digg fails to maintain "independence of members from one another."

Update: Fourteen months later, Nick Wilson at Search Engine Land writes:

The Digg community is out of control. Nobody would argue otherwise, it just remains to be seen whether they can turn it around and get the more aggressive and abusive elements of the mob to stop frothing at the mouth long enough to realize that they're ruining the site.