Wednesday, April 05, 2006

MyLifeBits, Memex, and Google Desktop Search

Gary Price recently pointed to an interesting new MyLifeBits technical report, "MyLifeBits: A Personal Database for Everything".

The MyLifeBits project at Microsoft Research is an attempt to make most aspects of a person's work and life experiences searchable. MyLifeBits captures every e-mail, every web page visited, documents read, every phone call, every bit of music, every photo. They even started taking video of the researcher's daily life and making that searchable.

As you might expect and as discussed in the paper, the project is inspired by Vannevar Bush's Memex.

The paper was a fun read, but I was shocked when I discovered they initially expected users to spend time organizing and cataloguing all this information. Unsurprisingly, they found that unworkable. Some excerpts:
With large quantities of information, users are not just unwilling to classify, but are in fact unable to do it ....

Even with convenient classifications and labels ready to apply, we are still asking the user to become a filing clerk -- manually annotating every document, email, photo, or conversation.

We have worked on improving the tools, and to a degree they work, but to provide higher coverage of the collection more must be done automatically ....

Even capture itself must be more automatic on this scale so that the user isn't forced to interrupt their normal life in order to become their own biographer.
In the future work appendix to the paper, the authors describe how it would reduce the work required for classification and cataloguing all the accumulated data if they had automatic speech to text for audio and video and face and object recognition for images and video.

I have to say, there is no way I would organize or tag this kind of data manually. The gigabytes of photos I have on my computer look like the "big shoebox" mentioned in the paper, and, if changing that requires any effort from me, they will never look like anything else.

In general, while many would get value from something that searched over all the data generated in their daily life, I suspect few would be willing to do substantial work to get those benefits.

In fact, I find that Google Desktop Search is approaching what I need. It already finds information about who I have e-mailed, meetings I had, documents I read, and web pages I visited.

It would be marginally more useful if it searched every phone conversation (after dealing with legal issues and improving speech to text), every photo (after improving face and object recognition), and every TV show and movie (legal issues, speech to text). But only marginally.

Having searchable video of my daily life (360 degree video 24 hours/day) might also be useful, but the privacy, legal, and technical issues there are extreme.

So, this has me wondering. How close is Google Desktop Search to the low hanging fruit, the most useful parts, of Memex?

After reading the MyLifeBits paper, I am wondering if the endpoint envisioned in that paper is really all that desirable. Perhaps we are closer than we might think to the parts of Memex we really need?

See also the Microsoft Research project, "Stuff I've Seen".

See also my Oct 2004 post, "Google Memex".


chad said...

The most useful part of the memex though is having everyone ELSE's knowledge available too. I guess you could say the Google search engine plays that role for the Desktop search product.

isb said...

I agree that there is no point doing manual (or automated) classification if you have powerful search, especially when the collection size is small. However, I think there are some obvious things like grouping that can be done to improve the presentation of search information.

Anonymous said...

While I enjoy the freedom that comes from good search tools for my documents, emails, etc. (I use Lookout, for Outlook) I don't want any more computer intrusion in my life. What's so bad about not being able to remember whether the person you saw walking on 3rd avenue was wearing a green or brown shirt? Or why would you need to know how many people you've met in your life that were black women vs. hispanic men.

Nicholas Carr said something similar about Bill Gates' vision of a "digital lifestyle" :

"What's revealing about Gates's vision of the future is that it is completely devoid of direct human contact. It's a geek's paradise. You get to fiddle with software all day, from the moment you get out of bed to the moment you fall back into it. We're not freed from the box; we're trapped inside it. Endlessly."

Along with all the privacy concerns you mentioned (which should not be brushed away lightly), I see no evidence that auto-recording and tagging our entire lives will lead to anyone feeling better about themselves. In fact, I can't see how it will help people at all, beyond being "Neat" and another way to avoid focusing on how to save more for retirement, devote more time to volunteer work, enjoy your job more, spend more time with your kids, get more exercise or eat healthy.

Show me the web services that help me do those things (online banking is an example of something that frees time, potentially, for other stuff) and I'll do it.

Having an endless record of all the trivia in my life (which most events on most days are) is "Neat", but nothing else.

Arnab Nandi said...

There is a lot of techology that is available to process information, but the important missing piece here is relations. A piece of information is many times more useful when it is displayed in relation to other information. i.e. Google desktop can point you to a random email that says, "yes, works for me", without showing you the draft document this conversation was talking about, or the people involved. Gmail's
"conversations" concept is a domain-specific but feasible implementation of this idea.

Interesting projects in this area include MIT's HayStack, IBM's ReMail, amongst others.

As far as mining audio and video are concerned, the Reality Mining project has done a fair amount of work here.

Greg Linden said...

Thanks for the references, Arnab.

I agree that there is value in the relations between the data. The problem is that, if the exponential quantity of relationships need to be manually specified by the user, it is never going to get done.

If those relationships can be extracted automatically, then we would be able to enjoy that value.

An interesting start would be to make it easy to browse related documents.

For example, after searching for an e-mail I received that mentioned "information overload", I would like to be able to easily browse all e-mails from that person, ordered by relevance to "information overload", then jump to all e-mails from that person about "Google", then browse a PDF file I read that mentioned both "Google" and "information overload".

All of that should be able to be done automatically. I should have to categorize, classify, or tag anything. The relationships should be exposed for me to discover, browse, and explore, all with little effort on my part.

In this sense, the tool becomes an assistant, almost a second brain, allowing me to rediscover old data and uncover new relationships in that data.

It seems to me that this is still low hanging fruit. A lot of this may be able to be done with relatively simple extensions to Google Desktop Search and without the more ambitious data collection and classification described by MyLifeBits.

Anonymous said...

Greg wrote: In this sense, the tool becomes an assistant, almost a second brain, allowing me to rediscover old data and uncover new relationships in that data.

Bingo! This is one of the biggest benefits of tool-based information retrieval. Sometimes it is just as useful, when using Ask or Vivisimo or whatever, to see the set of clusters or related suggestions or "narrow your search" suggestions, as it is to use those tools to better express your information need. Those tools are essentially a window into the relationships between and patterns among the collection you are searching.

I don't understand why you prefer personalization over tools in the web search environment (to the point of almost complete tool exclusion it sounds like) but don't mind it in the personal search environment. Wouldn't tools be just as useful, in the web domain, for exposing relationships "for [you] to discover, browse, and explore, all with little effort on [your] part"?

Anonymous said...

Another follow up: Google goes gaga for another tool. This time, it is exactly as I hoped. Instead of the search engine making the decision for you as to what the most relevant aspect of your query is, this tool lets you select from amongst a number of related topics, for you to give explicit feedback to the engine as to which aspect of the query is most relevant to you.

Google might give lip service to the notion that users are too lazy, and won't ever use tools. But they just paid good money that says otherwise.