Learning_Bit_by_Bit

Clusters and Medical Resources (and lawyers)

Posted by on April 8, 2011 at 8:41 am

In working with clusters, I was reminded of the old statistics saw that “correlation does not necessitate causation” (or something to that effect). The point being, among other things, that just because clusters are identified does not mean that there is any underlying meaning behind the clusters. Unsurprisingly, what sounds at first glance like a [...]

Computer Generated Suggestions – Ubiquitous, Annoying, Generally Worthless

Posted by on April 1, 2011 at 10:37 am

Once I started really looking at this, I was utterly amazed at how it affects every aspect of our interaction with computers. And how it has become, in essence, more noise to be ignored than useful. Spell-Check / Autocorrect — Everywhere, everyday. Occasionally useful in word processing, less useful and more annoying in GMail, genuinely [...]

PageRank and “the Market”

Posted by on March 25, 2011 at 11:02 am

I had an argument once with a colleague about the quality writing and communication and the popularity of those communications. In short, my colleague argued that even if another colleague (a “coac” or “colleague of a colleague”) wrote a blog that only 10 other people read, it was still a good blog. My counter was that, [...]

Parts of Speech, Bayesian Analysis, Magic

Posted by on March 4, 2011 at 10:56 am

Oh, spelling and grammar. My greatest educational and professional bane. My (close to) greatest source of embarrassment. Until roughly the late 19th century, spelling for most people, even the “educated classes” was essentially a process of deciding what looked good. It wasn’t even probabilistic so much as optimistic. Note that Terry Pratchett described one character’s [...]

Ngram’s, Press Releases and the SEC

Posted by on February 25, 2011 at 12:37 pm

I tried a few different datasets to generate the text. The first was a set of press releases. The second was from filings made by companies with the Securities and Exchange Commission. press-releases text-release-generated 10q_mda generated

Stop List – Stop Tokenizer – Google Patent

Posted by on February 18, 2011 at 10:41 am

“Stopwords” are those words (and potentially phrases) that search engines and search parsers filter out from the query. In my own experience, we frequently refer to them a “noise”. Typical examples include “a”, “an” and “the”. Most electronic content management systems (“ECM’s”) which store large quantaties of text data will remove these from the full text [...]

Chatterbot – The Interview

Posted by on February 11, 2011 at 10:09 am

I’m still playing with this and getting the code to work as an applet embedded in a WordPress page is proving more complex than I thought. But, it is working and compiling and was a total hoot to play with. For those, like me, wrestling with RegEx, there are several extremely good websites on it [...]

Turing vs. Searle

Posted by on February 3, 2011 at 8:09 pm

I’m a little embarrassed to admit that I originally read Searle’s piece more than 25 years ago. I’ve diligently re-read it, together with re-reading Turing’s piece. And, in truth, I still think Searle is hiding the ball when it comes to the core issue. In short, Searle makes completely valid arguments through about 70% of [...]