Stop Word Tokenizers in Search Engines
After doing some research in search engine stopwords I came up on a very interesting concept regarding Google’s approach to search as a multi stage query process.
I quick search through the web, on Google no less, revealed a copy of their patent for this specific search process from 2008 .
It appears that even back then, Google was fighting with the limitations of stop words, and perhaps stop words are becoming an antiquated and inflexible way of considering meaningful and relevant search.
An interesting point, specifically related to Finite State Transducers as they relate to search engines, is that on a very basic scale of utilizing the Automata, in this case the specific ” finite of complex symbols” is actually based on an index built by each specific search engine and corresponds to the bias inherent in each indexing whether algorithmic or simply based on paid sponsorship.
Nontheless, it is very interesting to see the application of their(Google) methods to stop words in general- whether they are ignored or not. In fact it appears that there is an effort in looking at meaningful relationships between stop words and actual search terms, rather than simply ignoring them.
One approach is the creation of an “exceptional” list which compares clusters of words/phrases that are known to be more relevant as a group. In this way, stop words with meaningful relationships are included in the search.
Another approach is to try and do a search with and without the stopwords.
The results from both lists will be compared for similarity and the most meaningful results will be extracted.
Some searching revealed a basic short list specifically for Google when not associated with the methods above-(I, a, about, an, are, as, at, be, by, com, for, from, how, in, is, it, of, on , or ,that, the , this, to, was, what, when, where, who, will , with, the, www).
An informative if outdated comparison(2009) chart of how the major search engines, BING, YAHOO, GOOGLE and ASK approach stop words in frequency. A big realization is that while the stop words might be similar, it is the relational difference in the way they are approached in determining the relevance of results based on comparing those words to an index.
Also even now, it appears that Yahoo search is powered by Bing in the US and Google in Japan.
Perhaps this is due to the difference of how tokenizers opearate in different characters/languages?
