|
But the elves were surprised when they rolled out the new update in November 2003. "The results are horrible, and we could achieve the same thing with simple randomization!" they cried. The marketing suits soon called them on the carpet. "We don't know what you algo elves think you're doing, but you should keep on doing it. Our ad revenue is up dramatically." At that point the elves realized that they were obsolete, and allowed the algorithm to continue unchecked. By the end of January it took over new sectors of ecommerce, and expanded into Western Europe. In February marketing moved AdWords into Eastern Europe, to prepare for the next update. The elves framed their PhDs and displayed them on the wall, and pretended that they were still in control. Sadly, they were addicted to the free gourmet lunches in Googleland. And that, boys and girls, is the story of how Google lost its way. |
The Great Google Filter Fiasco:
Mom & Pop take it on the chinby Daniel Brandt
December 2003 - February 2004
On or about November 17, 2003, countless ecommerce sites no longer appeared near the top of the rankings when their owners used the search terms that they considered most important. Four days later, I discovered that by adding a nonsense exclusion term, the links returned by Google shifted dramatically, and the results were very close to what these site owners had come to expect over the last few months.
A filter was in place, and it could be defeated by using one, or sometimes two, exclusion terms. If an exclusion term consists of characters that would never be found on a web page, then normally the addition of this excluded term to your usual terms will make little or no difference. Under normal circumstances, a search for callback service should return the same links as a search for callback service -qwzxwq because no sane web page has the term qwzxwq on the page.
There have been thousands of posts about the November update on various forums where webmasters trade information. When the exclusion trick was discovered on the evening of November 21, I assumed that we had only the two-day weekend to discover more about this filter. A similar trick using a hyphen between two keywords worked briefly for about three days prior to November 21, at which point it stopped working. It was this previous trick, discovered by someone else, that made me aware of the filter in the first place. When it stopped working I already knew what to look for, and was in a position to try other tricks. That's how I stumbled onto the exclusion word as a means to turn off the filter. I announced the new trick on a popular forum and other webmasters confirmed that it worked on their keywords. A day later some webmasters reported that if you use three words in your favorite search term, then you often need two different exclusion terms after them to defeat the filter.
By Wednesday of the following week, it was still working. I didn't expect this, and started this Scroogle site using a script that compared the top 100 results from Google with exclusion terms, against the top 100 results for the same terms, but without the exclusion. I began recording the terms entered by visitors to the site, along with the "casualty rate" for those terms. This rate is the number of links in the unfiltered top 100 that were missing from the filtered top 100.
It's confusing, that's for sure
There is no easy answer about which terms to avoid. Two-word terms are often more deadly than either of the words alone. With three-word terms there are even more variables, and just moving the terms around sometimes makes a big difference. There is some sort of initial threshold that determines whether your search terms will be subjected to the filter. Perhaps it is a probability variable. It seems that information sites, such as .edu, .org and .gov have been exempted from the filter, either due to their domain name or because the terms used to reach those sites don't show up in the so-called "filter dictionary." Blogs have been unaffected also. The target is ecommerce.Once your search terms are found in the dictionary (this is an oversimplification, but will do for now), then the pages returned by the search are analyzed for their "over-optimization" on those terms. The use of the terms in the title, in headlines, in links (domain, path and filename) and in anchor text attract extra attention. Word density and incoming links may also play a role. There is evidence that the over-optimized keywords for ecommerce pages are precomputed. This would mean that even if you clean up your page, you still have to wait for the next run of this computation. All Google would have to do is store a few suspect words along with each web page from dot-com sites, according to some algorithm that calculates various optimization characteristics on that page. They can do this by crawling their own database. Such precomputation is likely because this would mean that most of the computational overhead is done off-line, and only once per page.
Each of these two thresholds -- the dictionary lookup and the page parsing for optimized terms -- is more complex than represented here. More than one layer of analysis is involved, and there are no easy answers. Some have suggested that the final level introduces a bit of randomization, solely for the purpose of keeping us all guessing.
Why did they do it?
This filter is taking down a lot of innocent sites at the worst possible time of year, the Christmas shopping season. How can Google be so dumb? By now it has been confirmed by the vice president of engineering at Google, Wayne Rosing, that this is part of a new algorithm. In other words, it's mainly deliberate. We knew this already, because otherwise Google would have turned it off or rolled back the update. The algorithm may have produced unexpected results, and it can certainly be described as a "screw up," but it's not merely a bug.
The best scenario for what happened goes back about eight months, when Google stopped their monthly crawl of the web. Something ugly happened and Google had to throw out an entire crawl, and revert back to old data. Ever since then, Google has functioned without the old-style, once-per-month calculation of PageRank. I speculated on this in a June essay, Is Google Broken? Today I still think this is a reasonable point of departure for understanding Google.
For the last eight months it has been easy to spam Google using keywords in the anchor text of external links. Such linking overrode PageRank so completely that strange results were showing up on the first page for very competitive searches. It used to be called "Googlebombing" when bloggers started playing with it a year earlier. But that was different. The bloggers would have fun bombing for a few weeks only. Then the next monthly crawl came along, PageRank was recomputed for the entire web, and their cute tricks were buried in the rankings. The first year of Googlebombing was mainly a consequence of the "freshbot," not the "deepbot." During the last eight months, however, the same tricks were sticking from one month to the next, as the old-style monthly crawl was discontinued. This opened the door for a lot of ecommerce spam.
The current filter appears to be a rear-guard attack on this ecommerce spam. It was ill-advised. One poster on a webmaster forum speculated that it got approved at Google due to a statistical oversight. Someone may have assumed that the rate of false positives for the algorithm was acceptable, but was computing this rate for the entire ecommerce sector. When you apply it to the actual web you discover -- too late! -- that the probability equation's false positives are horribly skewed at the top of the results. Only the first two pages of links (10 links per page) really matter much for searchers. Indeed, the hit rate for ecommerce terms is very high, even going as deep as 100 links. Many innocent mom and pop sites are getting buried, and many of the sites that remain are spammy directories. It's not working well.
Another observer felt that the entire effort was aimed at affiliate programs, which are concentrated in the travel, real estate, adult, gambling and pharmacy sectors of ecommerce. But baby products, maternity, and bridal accessories, which are often home businesses run by women, are also hit hard. Innocent site owners such as these are angry with Google. Many feel that they are being deliberately forced to bid on AdWords so as to enhance Google's profit margins in the months before filing an IPO. For its part, Google claims that the department responsible for the main index has nothing to do with the advertising side of Google. Whatever you choose to believe, the fact of the matter is that whether it was deliberate or not, the "dictionary" terms used in the filter overlap very substantially with the terms that fetch the highest AdWord bids.
My feeling is that Google has reached the limits of fast software when it comes to separating and ranking web pages. They cannot merely slap a new algorithm onto their index at this point without butchering innocent site owners. Perhaps it's time for Google to do some real content analysis and clustering of pages. But that would mean more computational overhead, more hardware, more money, lower profits, and slower speeds.
Short of that, Google could use some sort of structured appeal process for webmasters who have been treated unfairly by new algorithms. Google won't consider this because it isn't "scalable" -- which means you can't expand it in your quest to take over the web unless you keep throwing more money and effort into it. Algorithms, on the other hand, are cool because you write them once, and copy them to 10,000 cheap computers.
Google has to do something, and they could afford to hire some ombudsmen if they wanted to. Even a contract employee on minimum wage, with a little training, can tell the difference between a spammy affiliate site and a family niche business. If Google's Ph.D.s with their clever algorithms can't do as well as temp employees, then the Ph.D.s should be replaced.
It's a mess. Google's integrity is on the line. If they keep this up, all their dreams of riches from stock options will vanish. Who's in charge at the Googleplex anyway? There isn't much time.
December 27 update: What's the verdict?
Attempts to discover consistent patterns in the evidence continue, and the filter is still a major factor that affects the ranking for search terms related to ecommerce. There is no question by now that this is a new algorithm instead of a bug. If it's some sort of self-learning algorithm, which would be the most benevolent interpretation of Google's intentions, then it also has to be admitted that it doesn't learn very quickly. For the last two weeks, the filtered results have been fairly stable.A small number of webmasters have claimed improvements after tweaking their pages to avoid the filter. They rarely want to say exactly what they've done, but it can be assumed that certain two-word terms that seemed toxic were broken up, or synonyms used, or variations introduced by using stemmed forms of those words. It must be said, however, that any jump in rankings may have merely meant that the change was picked up by the freshbot, so that the usual freshbot boost was temporarily given to the new page independently of the filter. If this is the case, the next major update cycle could spell trouble for their pages once again. There was an apparent mild update one week ago, which provided some evidence of this.
For almost all search terms, the filter is sufficiently "fuzzy" so that it is nearly impossible to come up with specific antidotes. There are various theories about why it's so fuzzy. One theory is that Google is able to tweak the filter's sensitivity to certain broad categories of keywords. This happened during the second week in December, at the same time that Google fixed the exclusion term glitch. Many suspect that the Applied Semantics CIRCA technology, acquired by Google in April 2003, is the best explanation for the fuzziness. CIRCA is used in AdSense. It allows targeting and tuning based on meanings and concepts derived from terms used on the page, or in links pointing to the page, after distilling the context in which those terms appear.
Another theory is that Google is experimenting with regional targeting of ecommerce terms. They already have a system working that can do this with AdWords. Google is ahead of other advertisers in this area, and there is money to be made once it starts working well. For the user, it would mean that you don't have to put in the name of the city when doing an ecommerce search; you'd get steered to sites in your region automatically, as Google detects your IP address. Many advertisers will pay more for this sort of targeting.
Finally, it's possible that Google has deliberately introduced a random element into ecommerce searches. This would frustrate those who are most likely to spam Google's results, and force them into AdWords or Froogle. The fact that innocent niche ecommerce sites are caught up in this is probably regarded by Google as acceptable collateral damage.
The ecommerce sites most affected by the filter are in the areas of real estate, hotels, rentals, travel, and vacations. This in itself is suspicious, because such geolocated ecommerce is an area where Google has yet to introduce a major advertising initiative. Froogle is only for products that can be shipped, which leaves a huge market for lodging, help wanted, professional services of every type, and sales of items from autos to condos. Search terms with "real estate" in them have been partially desensitized from the effects of the filter. Of all the geolocated services, real estate websites are the best organized as a profession. They would have been the first to file a complaint with the Federal Trade Commission, or file a class action lawsuit, if they could collect solid evidence.
Google's continued silence about the new algorithm has increased the numbers of webmasters who feel that Google is trying to improve their profit margins in the months prior to an IPO, by forcing more ecommerce sites to bid for AdWords. If true, the FTC would expect Google to disclose this in the interests of consumer protection. If not true, everyone else expected that Google would have rolled back the filter by now.
Google is in a situation where the best option is to keep quiet, and stealthily promote fear, uncertainty, and doubt. No answers, just more fuzziness. Our major corporate media, which is currently drooling over the prospect of a Google IPO that rejuvenates the entire technology sector on Wall Street, can be counted on to support a Google coverup.
February 16, 2004 update: The filter is deliberate
There was a major update during the last week in January. Some ecommerce sites in the U.S. that were hit in November reported better results. Others reported that their page tweaks that reversed the November losses, were now set back to zero. Overall, the movement among U.S. ecommerce sites failed to reveal a consistent pattern. The filter is still in place, at approximately the same level of force.The big news in January was that the filter was extended into Western Europe. Many non-English-language pages there felt the effects of the filter for the first time. Simultaneously, Google began advertising for AdWords workers who could speak various Eastern European languages.
Meanwhile, search engine optimization pundits in the U.S. were spouting off new theories: "authority sites," "hilltop algorithm," "topic-sensitive PageRank," and so on. Anyone crude enough to suggest that Google was doing anything greedy, was part of the great unwashed who are too stupid to understand new algorithms. Or better yet, they're conspiracy theorists who wear tin-foil hats. At least no one could still insist that Google was going to fix this bug and roll back the filter. (God, please save us from SEO pundits. If God won't, it looks like Google might.)
By now, I have reached two conclusions. One is that the filter is a deliberate strategy rather than a temporary aberration. Scroogle works almost as well as it ever did, using a slightly modified technique after Google patched the original technique. The results jump around a little bit, perhaps as different data centers are accessed in each subsequent search (I rotate Google's IP addresses on our scraper), but overall the "before" and "after" results are stable. The filter is here to stay.
The other conclusion is that you don't need a stack of algorithm papers by Google PhDs to figure out what's happening. Google has apparently decided to separate web sites into "information" and "commercial." Think of the white pages and the yellow pages. If you look up a company in the white pages, you get a list of departments and telephone numbers. If you use the yellow pages, you access it by keywords and you get paid ads. Google wants the same thing -- information on the left side of the screen, and ads on the right. Generic commercial keywords produce directory pages about ecommerce on the left. The ads on the right are produced by sites that outbid their competitors for the relevant search terms.
If your website is selling one or two very specific products, then it may get filtered out of the top results for those one or two specific search terms. Google appears to have a self-learning list of word associations for ecommerce. Try a search for ~widget -widget [ tilde + keyword + space + hyphen + keyword ] to see what other terms Google might associate with the term "widget." This search asks Google for synonyms for "widget," but not the word itself. The search results may come back with highlighted synonyms.
If this "information" vs. "commercial" theory is correct, then Google will reward (by neglecting to filter) ecommerce pages that include more of these synonyms, on the suspicion that these pages are more informative. Of course, this technique will discriminate against countless deserving sites that are structured a certain way. I'm not trying to excuse Google, I'm merely trying to understand what might be happening.
This might explain why directory pages that list commercial sites or categories are replacing more specific commercial pages. Google's filter considers directory pages to be generic information about ecommerce. It might also explain why real estate and travel/accommodation sites were hit so hard by the filter. Such sites are frequently geo-specific, which means "not generic" and "not information." Moreover, they are easy to identify by analyzing the content against a dictionary of place names. Along with Google's geo-targeting, which operates on both a country level and also on a marketing-region level within the U.S., anything geo-specific that is selling products or services belongs in the yellow pages. Get out your checkbooks!
Finally, it explains why noncommercial sites continue to be unaffected by the filter. Many noncommercial sites are doing better now than they have ever done in Google. This move to separate information sites from commercial sites will also make Google more competitive, and better equipped to fight spam. So far there is no evidence that either Yahoo or Microsoft is serious about crawling the noncommercial sector. If true, then this move by Google will set Google apart as a search engine. And, by the way, driving ecommerce into buying ads will make Google even richer. It's a win-win for Google, a win for nonprofits, and a big loss for mom and pop ecommerce.