Stop Word Sets – Tokenization – Search
When thinking about the importance of a “stop list” or “stop word list”, essentially a low level gateway through which initial tokenization takes place, it’s hard not to consider on a more general level, what has become the increasingly competitive realm of search technology and how its respective players approach this space differently (or not) including the use of “stop lists” and other key points of search mediation.
We first have to understand that the body of information being searched is in no way finite or perfect, that is… that all the available information on the internet is a dynamically changing structure; a body or set that is changing constantly not only by additional information but by context and various cultural filters that are also changing… and so any effective search method will have to be adaptive and evolve in a number of ways.
This being said, it would appear that the competitive market for search technology is a great space for variation in which each player might take advantage of diverse market demands through specialization. This can be seen in some cases with services like Wolfram Alpha and Ask.com, though seen more often is competitive search models that veer towards homogenization. This could be seen as stemming directly from a comparative user analysis of one service against another… When a user notices that one engine can find something more easily than another, the general assumption is that it can find everything more easily than the other. Perhaps when ad sales hang in the balance we encounter such things as one search engine like Microsoft’s Bing.com directly copying the results of another like Google.com. Until recently such occurrences have been rare at best and are presumably the result of some hard hard fisted executive demanding competitive results at whatever the cost. When this happens the question of one service’s approach, including their specific methods of stop tokenization, as opposed to another’s becomes null. The system and market becomes discreetly hierarchical, with one dominant presence defining the validity of other’s below it.
On the other hand specialized approaches to tokenization are hugely important -especially with regard to personalization services in which a system learns specifically from the patterns of a single user, custom tailoring relevancy of search results to meet that individual. This however does bring forth another problem, being the possible constraining of an individual’s search experience in which they are limited to a stylized subset of information.
Generally speaking this is a good thing, in that the growing body of searchable information overwhelms our ability to parse it. We require filters, even adaptive personalized ones, to return information that is relevant to us, but without unnecessarily constraining true discovery. While current search technology is impressive in its ability to return even faintly relative strings of information, it is always relative information in some proximal sense. The ability to discover something that has no apparent logical or referential connection to the user but that is rewarding or stimulating all the same… This is the major disconnect with recommendation engines and there are a lot of different approaches to bridging it.
A mildly successful project called ipexplore.com (a collaborative effort between myself and Andrew Childs) approached this idea in much the same way that one might think of sequentially “flipping” through channels on late night television… perhaps stopping unexpectedly on some public access oddity. We saw the IP structure of the net as being a root level unbiased door to things that we wouldn’t even know how to search for… By randomizing IP addresses within the actively used range we found such things as http://98.130.162.140/ -what appears to be some kind of Korean gaming site and http://98.130.162.26/ a Zambian risk prevention company (read: guns for hire who know how to use computers)… We soon realized just how much of the IP band was dead space… large swaths owned by government agencies with an Apache server login now and then, which is perhaps why finding something… anything in the dark abyss of the IP felt more significant than a conventional search… they were context free… -true unknowns… without any connection to anything.
In my opinion the real questions posed for the future of search are that we often don’t know what it is that we want when we search, or when we do know what we want we seldom know how to ask the right questions. Expert systems that can learn a user’s individual shortcomings in their ability to articulate a query, perhaps through machine learning rather than simply labeling each with a tailored genre, could someday bridge that inequality.. in other words a system that learns dialogically on a personal and mass level… always adapting to the crowd and the source, seeing over time that they are one and the same.
Assignment – To write a simple Stop Tokenizer:
While my intention was to utilize the LingPipe installation through the Eclpise IDE… a serious meltdown involving resources and file permissions made that more difficult… three rebuilds later.. I have a working install but resorted to proof of concept in Python during the down time. Using a simple word frequency against the entirety of “War and Peace”… taking these top candidate words to build my stop list.. then applying that list set against an input string… in this case: the first paragraph of the same text.
My stop list:
the and to of a he in his that was with had it her not him at i but as on you as are for she is said all from by be were what they who this one which have so dont an up them or when did been there their no would now only if me are out my could will do about into how we then
Derived from a simple word frequency script:
# WordFrequency - Common words import re filename = 'WarAndPeace.txt' word_list = re.split('\s+', file(filename).read().lower()) print 'Words in text:', len(word_list) freq_dic = {} punctuation = re.compile(r'[.?!,":;]') for word in word_list: word = punctuation.sub("", word) try: freq_dic[word] += 1 except: freq_dic[word] = 1 print 'Unique words:', len(freq_dic) freq_list = freq_dic.items() freq_list.sort() for word, freq in freq_list: print word, freq
My input string:
“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist I really believe he is Antichrist I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you sit down and tell me all the news.”
Sets intersected – Output as difference against the stop set:
#WaR'n'PeAce::..Stop Set - Alex Dodge - ITP - Bit by Bit 2011 import sys import re f1 = open(sys.argv[1]) f2 = open(sys.argv[2]) punctuation = re.compile(r'[.?!,"\':-;@$%^&*()<>|\/]') #Delimit and convert text to a Word Set def stringSet(txt): for line in txt: line = line.strip() line = line.lower() line = punctuation.sub("", line) word = line.split(" ") return word t1 = stringSet(f1) t2 = stringSet(f2) #Input Set Against Stop Word Set intersect = set(t1).difference(t2) #Put it all back together output = " ".join(intersect) print "" print output
AND… finally the output.. in theory the token:
perpetrated infamies family yourself horrors antichrist still frightened prince try faithful defend call really tell friend more slave means warn nothing news believe me down longer just sit well see war estates genoa buonapartes lucca