Google result counts are a meaningless metric.

You've come to this page because you've used the result count in a Google search, of any flavour, as supposed evidence to back up an argument.

This is the Frequently Given Answer to such arguments.

You can find another approach to this answer in the paper Kilgarriff 2007.

Google result counts are a meaningless metric. The count that you are pointing to proves nothing at all. Stop using this meaningless metric and make a proper argument based upon proper research instead.

The basic problem with the Google hit count reported in search results, particularly for phrases and searches using "AND" or "OR" operators, is that it is an estimate. It's not actually a count of anything, at all. It's the result of a calculation based solely upon the words that the query comprises, as Kevin Marks notes. Google explicitly states in the GSA doco that it's an estimate, although it is coy about what that estimate is actually based upon. To quote one un-named Google employee, "these are all estimates, and we just haven't tried that hard to make the estimates precise". A named Google employee said much the same (Feldman 2010) after this frequently given answer had been around for some years.1

For example: When Google Web reports 17,200 results for the string "de Boyne Pollard" (as it does at the time of writing this Frequently Given Answer), it hasn't searched its entire database to count all of the pages that match that string. That would be very inefficient, considering that it only needs to find (by default) 10 matches in its database in order to return a result page, and that many people don't go beyond the first few pages (or even the first page) of results. What it has done is taken the individual words "de", "Boyne", and "Pollard", and, using the word frequency tables that the Google Web spider generates when it crawls the World Wide Web, produced, from the frequencies with which those three particular individual words occur, an estimate of the number of pages that probably would match.

To demonstrate for yourself that these estimates are meaningless numbers, take a few searches and click on the "Next" button to bring up further pages of results until you reach the last page. You'll see that the actual number of results, known once you reach the last page, will almost always be nothing like the estimated number of results that appeared on all of the prior pages.

Even the actual page count isn't necessarily correct. In part this is because Google caps all queries at 1000 results, and in part it is because of several other other problems with the Google hit count, both estimated and final, that exist.

One problem is that Google normalizes the words in queries. It strips punctuation, transforms diacritics and ligatures, combines inflections, performs "stemming", and substitutes alternative spellings. A related problem is that it also includes as results pages that don't necessarily include the matching words at all, but where those words are used in hyperlinks pointing to those pages (c.f. the practice of link bombing that takes advantage of this). So it isn't always matching, or making estimations using, the actual words specified in the query in the first place.

Similarly, searches are not context-sensitive, and pay no attention to syntax, parts of speech, or even punctuation and sentence divisions. (Non-phrase searches, in particular, include pages where words occur in completely unrelated pieces of text.) So the fact that, for example, Google Web finds this very page, along with several others, when looking for the nonsense string "metric the count" doesn't prove anything at all. It certainly doesn't prove that it's a phrase with any actual meaning.

And then there's the fact that the estimates are rounded to multiples of 10, or 100, or even 1000 sometimes (depending in part from how many results per page one has asked for).

Other problems include polysemic words (words with multiple meanings, such as "bush"), and, as Lionel Beehner points out, people, places, and things that share the same names. So even if the hit counts were correct, they still wouldn't necessarily be meaningful.

These latter problems, in addition to the former problem of the "hit count" not actually being a count in the first place, cause problems for linguists who try to use the page counts and results reported by search engines as metrics. Linguists have long since discovered that hit counts are meaningless metrics, and hardly useful even for simple word and phrase frequency counts that don't involve what the words or phrases actually mean. (For just some of the literature on this, see Baroni & Ueyama 2006 and Nakov & Hearst 2005.)

These problems are not specific to Google, by any means. (It's just that people generally don't employ the result counts of other search engines in flawed attempts to prove things.) Yahoo! and Bing have exactly the same problems. For a detailed study of how these hit count estimates vary on Windows Live Search (now called Bing), for example, see the paper Thelwall 2008.

Footnotes

  1. Even though the Google search API documentation told them that the Estimated Result Count was an estimate, and even though this Frequently Given Answer existed through all that time pointing to the documentation and explaining how and why the count was estimated, search engine optimizers and linkspammers failed to comprehend the word "estimated" that was right in front of their noses, and whinged to Google for two years that this estimate should be "fixed". Finally, in November 2010 Adam Feldman of Google provided some basic Clue:

    The Estimated Result Count is just that — it's an estimate of the number of results available. Due to the nature of calculation, this estimate is not stable, and can change from request to request, mimicking the similar count on google.com.

Further reading


© Copyright 2008,2016 Jonathan de Boyne Pollard. "Moral" rights asserted.
Permission is hereby granted to copy and to distribute this web page in its original, unmodified form as long as its last modification datestamp is preserved.