Stop List – Stop Tokenizer – Google Patent

“Stopwords” are those words (and potentially phrases) that search engines and search parsers filter out from the query. In my own experience, we frequently refer to them a “noise”. Typical examples include “a”, “an” and “the”. Most electronic content management systems (“ECM’s”) which store large quantaties of text data will remove these from the full text index because they add a significant amount of space without adding much value.

When a search string is put together, the parser that takes the search string will often remove these terms.

This is usally a good thing, but not always. I find myself amazed that I have real-world experience with this. One of my responsibilites in my current job is to act as an “electronic discovery liaison”, which essentially means acting as an attorney, but, in part, also crafting or assisting in crafting queries to gather electronic documents for discovery purposes.

In really abbreviated terms, “discovery” is the process where one side in a legal dispute has the right to request documents and other materials from the other side as a part of the litigation, regulatory inquiry or the like. Often this involves very very broad requests — for example, “all documents relating to home mortgages”. At one of my former employers, roughly 1 billion unique email messages were generated every year. This means that putting together and parsing an appropriate full-text query is a very challenging process and requires real research on the subject and frequently interviews with the experts.

Unfortunately, sometimes stop words can be a hindrance. For example, if one was search for information on “Subordinated Convertible Debentures, 2008 Series A” (as opposed to series B), removal of the letter “A” would be problematic.

Apparently Google has recognized this issue and has, sadly, obtained a patent on their solution to it–> http://www.seobythesea.com/?p=1109. In simple terms, it appears that Google runs searches both with and without stop words, analyzes the results and determines whether the stop words were effective.

most common words chart from http://www.seobythesea.com/?p=2795

Image from http://www.seobythesea.com/?p=2795

Stop words are obviously an important search engine optimization (“SEO”, or “how do I get my page to show up first?”) issue, and apparently much work has been done by SEO analysts to determine how the major search engines approach this issue. A really interesting analysis was done here –> http://www.seobythesea.com/?p=2795, which, among other things, offered the attached following graph.

Ironically, a simple search of Google, Yahoo and Bing all resulted in basic *and* huge stop word lists.

http://www.lextek.com/manuals/onix/stopwords1.html

http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/

http://www.textfixer.com/resources/common-english-words.txt

http://www.ranks.nl/resources/stopwords.html

I opted for a relatively small stop word list and modified the sample code by Heather, together with some random text pulled from –> http://watchout4snakes.com/CreativityTools/RandomParagraph/RandomParagraph.aspx which generated the following paragraph:

The birthday accommodates stop words on top of an explanatory lyric. The degenerate overlooks stop words throughout the league. The originator precedes search engines underneath an eccentric injustice. How will stop words orbit without a clock? Stop words discriminates search engines within an appearance. The rose figures stop words into each regardless obstruction.

The result set follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
START   END TOKEN
4    12  |birthday|
13    25  |accommodates|
26    30  |stop|
31    36  |words|
40    43  |top|
50    61  |explanatory|
62    67  |lyric|
73    83  |degenerate|
84    93  |overlooks|
94    98  |stop|
99   104  |words|
105   115  |throughout|
120   126  |league|
132   142  |originator|
143   151  |precedes|
152   158  |search|
159   166  |engines|
167   177  |underneath|
181   190  |eccentric|
191   200  |injustice|
211   215  |stop|
216   221  |words|
222   227  |orbit|
228   235  |without|
238   243  |clock|
245   249  |stop|
250   255  |words|
256   269  |discriminates|
270   276  |search|
277   284  |engines|
285   291  |within|
295   305  |appearance|
311   315  |rose|
316   323  |figures|
324   328  |stop|
329   334  |words|
340   344  |each|
345   355  |regardless|
356   367  |obstruction|

And the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import java.util.Set;
 
import com.aliasi.tokenizer.EnglishStopTokenizerFactory;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.LowerCaseTokenizerFactory;
import com.aliasi.tokenizer.StopTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.CollectionUtils;
import com.lingpipe.book.tok.DisplayTokens;
 
public class SimpleStopTokenizer_v2 {
 /* Dave Boyhan - david.boyhan@gmail.com 2/18/11 
 * A small modification of the SimpleStopTokenizer written by Heather Dewey-Hagborg here --> http://itp.nyu.edu/varwiki/uploads/SimpleStopTokenizer
 * the Indo-European tokenizer will tokenize, the resulting tokens will be converted to lower case, and 
 then stop words will be removed.
 */
 
 public static void main(String[] args) {
 String text = "The birthday accommodates stop words on top of an explanatory lyric. " +
 "The degenerate overlooks stop words throughout the league. The originator precedes search engines underneath " +
 "an eccentric injustice. How will stop words orbit without a clock? Stop words discriminates search engines" +
 " within an appearance. The rose figures stop words into each regardless obstruction. ";
 
 //Note that punctuation was added to the stop list as well
 
 Set<String> stopSet = CollectionUtils.asSet(".", "?", "a", "able", "about", "across", "after", "all", "almost", "also", "am", 
 "among", "an", "and", "any", "are", "as", "at", "be", "because", "been", "but", "by", "can", "cannot", 
 "could", "dear", "did", "do", "does", "either", "else", "ever", "every", "for", "from", "get", "got", "had", 
 "has", "have", "he", "her", "hers", "him", "his", "how", "however", "i", "if", "in", "into", "is", "it", 
 "its", "just", "least", "let", "like", "likely", "may", "me", "might", "most", "must", "my", "neither",
 "no", "nor", "not", "of", "off", "often", "on", "only", "or", "other", "our", "own", "rather", "said", 
 "say", "says", "she", "should", "since", "so", "some", "than", "that", "the", "their", "them", "then", 
 "there", "these", "they", "this", "tis", "to", "too", "twas", "us", "wants", "was", "we", "were", "what", 
 "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "yet", "you", "your");
 TokenizerFactory f1 = IndoEuropeanTokenizerFactory.INSTANCE;
 TokenizerFactory f2 = new LowerCaseTokenizerFactory(f1);
 TokenizerFactory f3 = new StopTokenizerFactory(f2,stopSet);
 //Upon trying the EnglishStopTokenizerFactory below, I received a poorer result set
 //TokenizerFactory f3 = new EnglishStopTokenizerFactory(f2);
 
 DisplayTokens.displayTokens(text,f3);
 }
 
}
Leave a Comment