Query Logs and Privacy
Wednesday, May 9th, 2007Should search engines retain a record of search queries? What benefits or harms flow from retaining that data? Should academic researchers be able to get access to “query log” data from search companies? What kinds of research can be done with this data? And — critically — what about the privacy of the search engine users?
All of these questions were debated and discussed in a workshop yesterday at the WWW 2007 conference entitled “Query Log Analysis: Social and Technological Challenges.” WWW is the leading annual academic conference focused on the Web and the Internet. This year the conference is in Banff in the Canadian Rockies (making staying indoors for the sessions quite a challenge).
The Query Log workshop addressed a fascinating set of issues, the foremost of which is the significant privacy risk raised by the retention (or distribution) of logs of search terms on sites such as Google, MSN, Yahoo, Ask etc. As the WWW event is an academic conference, there was much attention to the plight of researchers outside of the search companies. Researchers are frustrated that they have little or no access to actual data – the actual queries entered into search engines.
The companies are hesitant to disclose search data, both out of concern about compromising trade secrets about how they execute and track searches, but also because the backlash about the incident in August 2006 in which AOL released millions of search terms from about 650,000 users. Although AOL replaced user IDs with pseudonyms, it was relatively easy to identify some individual people from their search terms. There was, appropriately, a huge uproar about the harm to privacy, and AOL quickly took the data down.
Although the release of the data was clearly a mistake, AOL’s intentions were in fact honorable – AOL was trying to allow academic researchers access to actual search data. And ironically, the AOL data release did allow researchers to analyze core issues about privacy. In that data, for example, were social security and credit card numbers (raising privacy concerns by themselves), and researchers were able to document how privacy could be breached using the aggregated search of individuals’ searches.
(more…)


