Feb 11
18
Stop List – Stop Tokenizer – Google Patent
“Stopwords” are those words (and potentially phrases) that search engines and search parsers filter out from the query. In my own experience, we frequently refer to them a “noise”. Typical examples include “a”, “an” and “the”. Most electronic content management systems (“ECM’s”) which store large quantaties of text data will remove these from the full text index because they add a significant amount of space without adding much value.
When a search string is put together, the parser that takes the search string will often remove these terms.
This is usally a good thing, but not always. I find myself amazed that I have real-world experience with this. One of my responsibilites in my current job is to act as an “electronic discovery liaison”, which essentially means acting as an attorney, but, in part, also crafting or assisting in crafting queries to gather electronic documents for discovery purposes.
In really abbreviated terms, “discovery” is the process where one side in a legal dispute has the right to request documents and other materials from the other side as a part of the litigation, regulatory inquiry or the like. Often this involves very very broad requests — for example, “all documents relating to home mortgages”. At one of my former employers, roughly 1 billion unique email messages were generated every year. This means that putting together and parsing an appropriate full-text query is a very challenging process and requires real research on the subject and frequently interviews with the experts.
Unfortunately, sometimes stop words can be a hindrance. For example, if one was search for information on “Subordinated Convertible Debentures, 2008 Series A” (as opposed to series B), removal of the letter “A” would be problematic.
Apparently Google has recognized this issue and has, sadly, obtained a patent on their solution to it–> http://www.seobythesea.com/?p=1109. In simple terms, it appears that Google runs searches both with and without stop words, analyzes the results and determines whether the stop words were effective.
Stop words are obviously an important search engine optimization (“SEO”, or “how do I get my page to show up first?”) issue, and apparently much work has been done by SEO analysts to determine how the major search engines approach this issue. A really interesting analysis was done here –> http://www.seobythesea.com/?p=2795, which, among other things, offered the attached following graph.
Ironically, a simple search of Google, Yahoo and Bing all resulted in basic *and* huge stop word lists.
http://www.lextek.com/manuals/onix/stopwords1.html
http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
http://www.textfixer.com/resources/common-english-words.txt
http://www.ranks.nl/resources/stopwords.html
I opted for a relatively small stop word list and modified the sample code by Heather, together with some random text pulled from –> http://watchout4snakes.com/CreativityTools/RandomParagraph/RandomParagraph.aspx which generated the following paragraph:
The birthday accommodates stop words on top of an explanatory lyric. The degenerate overlooks stop words throughout the league. The originator precedes search engines underneath an eccentric injustice. How will stop words orbit without a clock? Stop words discriminates search engines within an appearance. The rose figures stop words into each regardless obstruction.
The result set follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | START END TOKEN 4 12 |birthday| 13 25 |accommodates| 26 30 |stop| 31 36 |words| 40 43 |top| 50 61 |explanatory| 62 67 |lyric| 73 83 |degenerate| 84 93 |overlooks| 94 98 |stop| 99 104 |words| 105 115 |throughout| 120 126 |league| 132 142 |originator| 143 151 |precedes| 152 158 |search| 159 166 |engines| 167 177 |underneath| 181 190 |eccentric| 191 200 |injustice| 211 215 |stop| 216 221 |words| 222 227 |orbit| 228 235 |without| 238 243 |clock| 245 249 |stop| 250 255 |words| 256 269 |discriminates| 270 276 |search| 277 284 |engines| 285 291 |within| 295 305 |appearance| 311 315 |rose| 316 323 |figures| 324 328 |stop| 329 334 |words| 340 344 |each| 345 355 |regardless| 356 367 |obstruction| |
And the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | import java.util.Set; import com.aliasi.tokenizer.EnglishStopTokenizerFactory; import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory; import com.aliasi.tokenizer.LowerCaseTokenizerFactory; import com.aliasi.tokenizer.StopTokenizerFactory; import com.aliasi.tokenizer.TokenizerFactory; import com.aliasi.util.CollectionUtils; import com.lingpipe.book.tok.DisplayTokens; public class SimpleStopTokenizer_v2 { /* Dave Boyhan - david.boyhan@gmail.com 2/18/11 * A small modification of the SimpleStopTokenizer written by Heather Dewey-Hagborg here --> http://itp.nyu.edu/varwiki/uploads/SimpleStopTokenizer * the Indo-European tokenizer will tokenize, the resulting tokens will be converted to lower case, and then stop words will be removed. */ public static void main(String[] args) { String text = "The birthday accommodates stop words on top of an explanatory lyric. " + "The degenerate overlooks stop words throughout the league. The originator precedes search engines underneath " + "an eccentric injustice. How will stop words orbit without a clock? Stop words discriminates search engines" + " within an appearance. The rose figures stop words into each regardless obstruction. "; //Note that punctuation was added to the stop list as well Set<String> stopSet = CollectionUtils.asSet(".", "?", "a", "able", "about", "across", "after", "all", "almost", "also", "am", "among", "an", "and", "any", "are", "as", "at", "be", "because", "been", "but", "by", "can", "cannot", "could", "dear", "did", "do", "does", "either", "else", "ever", "every", "for", "from", "get", "got", "had", "has", "have", "he", "her", "hers", "him", "his", "how", "however", "i", "if", "in", "into", "is", "it", "its", "just", "least", "let", "like", "likely", "may", "me", "might", "most", "must", "my", "neither", "no", "nor", "not", "of", "off", "often", "on", "only", "or", "other", "our", "own", "rather", "said", "say", "says", "she", "should", "since", "so", "some", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "tis", "to", "too", "twas", "us", "wants", "was", "we", "were", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "yet", "you", "your"); TokenizerFactory f1 = IndoEuropeanTokenizerFactory.INSTANCE; TokenizerFactory f2 = new LowerCaseTokenizerFactory(f1); TokenizerFactory f3 = new StopTokenizerFactory(f2,stopSet); //Upon trying the EnglishStopTokenizerFactory below, I received a poorer result set //TokenizerFactory f3 = new EnglishStopTokenizerFactory(f2); DisplayTokens.displayTokens(text,f3); } } |
