What search engines store about you
And what you can do about it.
By Mary Brandel, Computerworld | Computerworld UK | Published: 01:00, 20 July 2007
What if there were a giant database that contained your hidden insecurities, embarrassing medical questions, and the fact that you still think from time to time about your high-school romance? Well, such a data store does exist -- if you've ever plugged such private topics into a search engine.
The fact is, search engines like Google, Yahoo, and Microsoft Live Search all record and retain in their vast data banks any term that you query in addition to the date and time your query was processed, the IP address of your computer, and a cookie-based unique ID that -- unless you delete it -- enables the search engine to continue to know if requests are coming from that particular computer, even if the connection changes.
Microsoft Live Search also records the type of search you conducted (image, web, local, etc.), while Google additionally stores your browser type and language. And when you click on a link displayed on Google, that may also be recorded and associated with your computer's IP address.
While Google recently announced that it would make its search logs anonymous after 18 months' time by deleting part of the IP address and obfuscating cookies associated with search queries, Microsoft and Yahoo haven't yet made their retention policies public. AOL stores this data for just one month.
The upshot: If someone were to ask one of these search engine companies to produce a list of IP addresses or cookie values that searched on a particular search term, they conceivably could. Conversely, given an IP address or cookie value, the search engine firm could produce a list of terms searched by the user of that address or cookie value.
Don't worry; be happy
Some people say there's not much to worry about because the server logs don't associate these search terms with personally identifiable information, such as your name or email address. However, if you have an account with or have registered for any of the additional services on a search engine site -- email, social networks, calendars, shopping lists -- it's feasible that that connection could be made, says Brad Templeton, chairman of the board at the Electronic Frontier Foundation, a group that protects liberties and privacy in cyberspace.
In the case of Microsoft and Yahoo, that information can be extensive because of how much personal information these search engine firms ask for on their account registration forms, including your occupation, job title and marital status, and the number of children in your household.
According to Whitney Burk, PR manager at Microsoft, "There is no systematic way of identifying, isolating, or cross-referencing search data with personally identifiable information." Google also says it stores the two types of information separately. However, according to Templeton, "it would be very difficult to make it impossible for someone to make that correlation."
Templeton emphasises that he doesn't know exactly how any of the search engine systems are designed, but -- given typical designs -- there are many different ways that someone with the right access and knowledge could make a retroactive correlation between search terms and personally identifiable information. Considering that search terms can reveal personal information that ranges from medical prescriptions to religious beliefs and political preferences, that's not an association many of us would be happy to see.
Even if you didn't provide any personal information, an IP address alone could be traced back through a reverse DNS lookup to the Internet service provider and city of the IP address, according to Danny Sullivan, editor-in-chief of Search Engine Land, a blog dedicated to search news. Contacting the ISP could result in a positive identification of the account holder by finding out which account accessed the search engine at the time recorded in the search log.
Last year, reporters at The New York Times didn't even need an IP address to track down the identity of an AOL user when AOL published anonymous search logs of 500,000 users over a three-month period. The identification was made possible simply based on the specificity of the search terms the user queried, such as real estate searches in the small town where she lived. (If you have any question as to what collected search terms reveal about an individual -- accurate or not -- check out those AOL search logs.)
Hello, George Orwell?
If all this sounds Big Brother-ish to you, you're not alone. Individual users, consumer interest groups, government regulatory committees around the world, and privacy groups are growing increasingly worried about how much personal data search engine firms retain and what they could do -- or be forced to do -- with this information.
In recent months, Google seemingly hasn't been able to make a move without drawing speculation and suspicion about its ability to construct personal portraits of user behaviour.
Several consumer interest groups have filed a complaint with the US Federal Trade Commission regarding Google's acquisition of DoubleClick. The groups claim it would give Google unprecedented insight into consumer behaviour because it could track both people's Internet searches and their website visits. And when Google released its History feature, which associates individuals' search and page visitation histories with their account information, some observers, such as veteran blogger Anil Dash, called it both "brilliant" and "scary."
"With the release of Web History, especially in the context of its recent acquisitions and announcements, Google may have crossed the line where regular users start to react with skepticism and caution instead of unabashed enthusiasm," Dash says in his blog.
The creepiness factor surrounding online search data was also upped by the revelation in early 2006 that the US Justice Department had subpoenaed Google, Yahoo, Microsoft, and AOL to turn over a random list of web queries conducted over the course of a week, divorced from the names of those submitting them.
AOL, Microsoft, and Yahoo turned over some of the requested information, but Google resisted. Although Google set a good precedent by doing so, "Governments can and will do things that companies have to comply with," says Chris Sherman, executive editor of Search Engine Land, referring to the Chinese government pressuring Yahoo to turn over the name of a user who posted to an online forum.
"The US government is a growing concern because over the last several, years it's been expanding its power to ask for such information, especially as the political climate has changed," Templeton says. For instance, he points out, warrants have been easier to get since the USA Patriot Act was enacted. "It's something we should worry about more than in the past," he says.
Of course, some in the government are trying for more, not less, protection for online data. US congressman Edward Markey introduced a bill in early 2006 (H.R. 4731) to require owners of websites -- not just search engine firms -- to destroy obsolete data containing personal consumer information.
Putting aside the government, there are other ways for private data to be revealed, Templeton says, in the form of internal employees. "Everyone knows the history of most large database [breaches] comes down to a story of corrupt employees who sell access to private records," he says. "In the private investigation world, they can use bribes to get people's tax returns."
Learning from history
All of this raises the question: Why do search firms store all this data? Google offers three reasons: It can help the company improve its services, maintain security and prevent abuse by looking for patterns indicating fraudulent activity, and comply with legal obligations to retain data. The company asserts that it can use this information to determine how often users are satisfied with the first result of a query and how often they proceed to later results. Or it can determine how many times an advertisement is clicked in order to calculate how much the advertiser should be charged.
In his blog, Sullivan is more direct. "Google is big on personalisation," he writes. "Big, big, big. For Google, getting up close and personal with individuals is seen as a big leap forward on many fronts -- and 2007 is the year Google is going all out after it."
The more Google can know about you, Sullivan explains, the more it believes it can deliver you a better experience, not to mention more targeted ads. "But in particular," he says, "personalisation is seen as the next generational step in delivering better search results."
But Templeton questions whether the search firms need to store as much information as they do and for as long as they do. "We regularly advise Google that they're keeping too much information," he says. While some people, such as Sullivan, applaud Google's move to limit the amount of time it retains search logs, saying that will make it nearly impossible to trace any query back to a particular computer, Templeton thinks total destruction of the data would be far better.
"History is full of incidents where people thought they could anonymise or destroy data, but people find a way to recover it," he says. "You have to be more thorough in their destruction -- you have to eventually destroy the connection between the IP address and the searches."
Plus, Google's move may or may not address backup data, which Sullivan notes is not as easily accessible or altered. And the 18-month window also doesn't address data stored via its Web History feature. That information, Sullivan says, is not being destroyed or anonymised over time. "If you want it wiped out," he says on his blog, "Google says you have to do that separately." On the positive side, Web History lets users know exactly what data pertaining to them is stored, and they can take it into their own hands to delete those histories at any time. The same is true of Yahoo's MyWeb feature, which stores user searches if it's been switched on.
With all the focus on user privacy, the search engine firms say they are taking measures to increase the anonymity of users. Google has said it will build privacy protections into its non-search products, including Google Talk's "off the record" feature, as well as Google Desktop's "pause" and "lock search" controls. It has also said it will provide easy-to-understand privacy policies for users on its website.
For its part, Microsoft says it's actively engaged with data protection authorities around the world on what user information is collected and for what purposes as well as policies around notice and consent. "There is no universal consensus on the 'right' policies," Burk says. "However, we will continue to be active with privacy advocates and authorities as these decisions are made."
In the end, Sherman says, it's up to individual users to decide whether they trust search engine firms with their personal information. "It's something everyone has to decide -- at what level am I comfortable with the reality of improving my search results vs. my identity being connected with the types of queries I do?" he says. And he's careful to note that the question extends way beyond Google, which in his eyes takes strong measures to secure user data. "When you go to Google, you can't get anywhere near their datacentres," he says. "There are levels of security in the company where few people are cleared to get into areas where people can see personally identifiable information."
But to EFF's Templeton, that's not enough. "Even when people try to do a good job, things happen, and data still gets out," he says. "If it's collected and in a place that can be accessed, it can get out."
Furthermore, Sherman points out, it's not just search engines that store personal data. "Your ISP knows more about you than any search engine -- not just what you're searching on but every website you've visited," he says. On his blog Sullivan adds, "Google may be anonymising its records, but your ISP might not be."