Data

With physical computing we were making the point that more of  the expression or your body and its environment  needs to be digitized.  We set about putting sensors, cameras and actuators in unexpected places that turn our expression in a bunch of numbers that computers can make use of, namely data.  Besides all your crazy new sensors putting off data, the increasing part of your experience that passes through laptop and smartphone keyboards is alreaedy leaving behind a wide trail of easy to manipulate, pre-digitized artifacts of your conscious and unconscious expression.  The companies, venues and countries that you come near are already pretty good at provoking you to generate such data and then gathering and analyzing it.  Politically it seems like if you want to dominate the world or at least not be dominated you should learn about data analysis.  But artistically this is a tool with new capabilities to reveal things about your self and about groups. The computer lends prothesis us to see smaller and larger contexts, over timescales, and in permutations of juxtaposition that our unaided minds can’t.

Strings

Data is mostly going to come to us as something called a String.  I guess it is called a string because it a bunch of characters strung together.  You have run across this already because we mostly chose to have the Arduino send strings to Processing.  Even when we sent numbers we sent “100” instead of 100 (print() instead of write()).

String, The  Special Object

A string is an object of type String so when you declare the variable to hold it you use the type “String.”  Because you make strings so often you don’t have to bother with the “new” and parenthesis you use when creating all other objects.

 

All the Things a String Object Can Do

The easy way of making a String seems normal and that is how you would make an int or a float or byte or a boolean.  But those are all just primitive numbers that have nothing else inside of them.  Don’t let the easy way the String objets are made hide the fact that  Strings are objects with tons of functions inside them like:

 

You can get a look at some of the power inside Strings in the processing documentation.  But this might be a good moment to show you that you are really writing java code rather than processing code.  Processing really gives you a more civilized view of what is really java.  Although it is more civilized it is also more limited.   The string object is a place where you might want to see the full power of maybe 50 String functions in java beneath the civilized veil of maybe 7 listed processing.  So for instance the useful indexOf() function is used to search within a string although it is not mentioned in the Processing subset.

 

A Common Error with String in If Statments

The special easy way of making String objects  leads to a very common mistake when using them in “if statements” like you would primitive numbers instead of objects.  They have to compare them using the equals() function inside the String object, not the == that you are used to using for numbers.  The == just checks to see if it is the same object not if the contents are the same.

Drawing a String

The text(“Drawme”, 100,100) function is used to draw text on the screen.  You just supply the x and y coordinates.  Pretty easy.  To use a particular font you need to use a font object.  Hopefully you are used to the routine of objects by now, declare it at the top, make it in setup and use it in draw (usually).

 Much More About Strings

This was a very quick tour through the String object.  This is a great tutorial with more details, particularly about animating the text.

Words (old fashion data)

People have been exchanging strings of characters in the form of prose for millennia and the advent of the computer has radically increase the amount of words a typical person writes.  While it is hard to get a computer to be as original or to understand ambiguity as humans can, the parsing and comparing words they can do if a very complimentary skill to ours.  It is obviously useful when computers read through all your emails for for the word “Mussolini Memorabilia” so you do have to spend hours doing it.  But computers can also find patterns in your prose that you never find no matter how much time you spent.

Word Counting

Textual analysis starts by counting words and finding their relative frequency.  To find the topics that are important in a particular corpus of text you could find the words that get repeated frequently there and infrequently in a bigger corpus of text.  In other words check the most commonly used words in your college application against all college applications.  You can do this fairly intuitively or apply some more sophisticated math to it. This naturally causes the small and common words that everyone uses all the time to fall away and leaving just the “important” words.

Another approach is to concentrate on the small words, called functional words.  These words don’t tell you much about the topic of the text but they can tell you a lot about the style.  Try counting the relative use of pronouns like like I, me and my.  The Secret Life of Pronouns is an interesting read on this subject.  Counting these small words and comparing style is an example of something that is extremely hard to do without a computer.

Using Dictionaries to Store Key/Value Pairs

So to count words you need to keep track of the words and the counts for each.  Because there are a lot of them, it sounds like a job for an array and that might work.  Actually there is a special kind of an array in Processing called a “dictionary” or “dict” for short that is perfect for the job because every entry has a key/value pair, in this case the word and the count.  It would look something like this:

[“of”:55, “the”:22, “Mussolini”:5, “Memorabilia”:4]

Processing has an intDict, floatDict or a floatDict but intDict is the one we will use here.  These allow you to look up a word like in a dictionary or an in an index instead of going in a repeat loop to find the one you are looking for.

You find these special arrays with key/value pairs in all programming languages called different things like associative arrays (php), named arrays (actionscript) and hashmaps (java actually we might use these later).

The code below for doing word counting with intDicts has two new things.  We have run across the split command before when we tried reading in multiple sensors from Arduino and split them apart on the basis of the comma.  In fact the split command is probably the most common command in the world as text passes back and forth across the internet and needs to be parsed.  You can imagine in prose we would mostly want to split on the basis of the the space character between words instead of the comma. In this example we use splitTokens() can split based on the multiple punctuation characters that might delimit words.

The other new thing is how to use intDict.  The main function we call is increment() which should really be called addIfItDoesn’tExistOrIncrementIfItDoes() but that seemed too long.  Later after we have added all the words to the dictionary we get a list of the keys, that is the list of just the keys (words) as an array.

Finally we get the count for a given word from the dictionary with this function:

Here is all the code together:

Here are some other examples of word counting Word counting animation

Now the dictionaries don’t work if you want to save more than just the count about the word.  For that you would probably create a object for each word.  For instance for visualizations you might want to keep the x and y location or color of each word inside an object.  Instead of a dctionary you would use a HashMap where the key is still the word but the value is one of your objects.  Here is an example of using a HashMap.

It would be fun to try downloading all of your years of Gmail and parsing it.

Here are some other sources of text (thank you Adam Parrish):

  • Prepared example texts that I reference frequently in class
  • Project Gutenberg
  • Common Crawl, “a repository of web crawl data that is openly accessible to everyone”
  • Corpus of Contemporary American English: search for frequencies and contexts of words and phrases in “the largest freely-available corpus of English.” (Provides no API, unfortunately.)
  • Wordnik, a dictionary. The Wordnik API“lets you request definitions, example sentences, spelling suggestions, related words like synonyms and antonyms, phrases containing a given word, word autocompletion, random words, words of the day, and much more.”
  • Corpus resources

Computer Data

Now that we have computers we often change the way we punctuate our words to make them easier for the computer to parse.  For instance as compared with normal prose, when you sent from Arduino to Processing you used commas instead of the usual spaces between words and end of line character (‘\n’) instead of the period you use in prose .  This is a very common form of computer punctuation called CSV or Comma Separated Value.  This was easy to parse on the receiving side using readStringUntil(‘\n’) and split(input,”,”).  So you have already used computer friendly punctuations for easy computer parsing.

Sources of Data

Scraping HTML off Web Pages using Loadstrings() (not civilized)

You should not try this part because it is not civilized to parse HTML.  It is formatted for the benefit of your your eyeball not your code.  Still it is pretty powerful to know that anything on the Web could be your data source and manipulated as data by Processing.  Try viewing the source of this page (in chrome use mene View>Developer>View Source) and you will see the big long string using HTML formating that the browser parses and uses to render this page.  It is not ideal but you could write code to parse this,  for instance find all the headlines by searching for all the  “<H2>’s  in the string (see below).  But HTML is used to format text for the benefit of your eyeballs, mostly describing the colors and fonts rather than the meaning (semantics) of the text.  For instance it might say that “10011” should be red when you want to know that it is a zip code.  While HTML is not ideal as a data source, sometimes you have no other choice and you have to “scrape” the page and derive the semantics based on conventions of the formatting (hoping no one decides to the give their web site a new look).

The key function here is loadStrings which can be fed either a url or filename of text.  Notice that the text gets loaded as an array of Strings, one item in the array for each line in the text.  The join function right after the load function is used to glue the items of that array into one big string, disregarding the new lines.  We then enter a endless loop using indexOf() function to find what we are looking for and the substring() function to pull out the parts.  This is scraping and not very civilized.

Data Formatted with Machines in Mind

Making data available available to your eyeballs using HTML is good but if the data is formatted especially for the the machine to read, your code can more easily do you all sorts of favors of searching and correlating massive and diverse data sets.  Luckily the power of machine readable datasets is well know and many organizations with data have found it in their hearts to deliver data machine readable data.

Data Formats CSV, TSV, XML, JSON

Like I said you already have dealt with formatting.  When you sent the data using commas and end of line characters (‘\n’) from Arduino to Processing you were using CSV (Comma Separated Values).  All these fromat mainly differ on the separators use to delimit the “fields” (sensors) and records (different readings over time).  A good separator is one that will not be found in the fields you are trying separate.  So if you fear that a comma might be in one of the fields you want to separate you might use TSV (Tab Separated Values) where you separate the fields by ‘\t’ instead of a comma.  The separators in HTML and XML are “tags.”.  But where HTML uses formatting tags like <bold> XML uses semantic tags like <age>.  You may have heard of RSS which is a way some web pages will also provide an XML version in addition to the HTML version so they are machine readable.  For instance here is the XML version of this blog https://itp.nyu.edu/classes/icm-dano-spring2014/feed/ (you can just add feed to the end of your blog url).  XML is a little wordy and so another format called JSON is getting more popular and it uses a collection of curly brackets, colons, commas and square brackets for delimiters.  Very conveniently Processing had commands called LoadTable() for loading CSV formatted data, LoadXML() for XML formatted data and LoadJSON() for, you guessed it JSON formatted data.  Check out these examples:

Notice that all of these load functions also have a save functions.  This allows you to add to a database and become a datasource!

API’s

Sometimes you just get the data as a big long string.  Other times you have ask for certain parts of the data.  This is useful if the data source is very big or complex and the provider does you the favor of sorting or selecting parts of it for you.  The data provider will expose an API  (Application Programmer Interface) which provides an interface for your program rather than your fingers and eyes.  For instance Google could not very well just give you all the images in their database so you have to ask for them using search terms.  The API in this case is just a url with what is called a query string:


The query string is everything to the right of the question mark in any url.  It usually take the the form of key value pair separated by ampersands like v=10&q=”dog+sled”  Using a API in this way is a bit like calling a function and the query string is a bit like the parameters.  Here is the full google image API Example but be aware that they put limits on how many query a minute you can make.  Here is a processing library for getting the Yahoo Weather  Here is a list of API’s that don’t need and account.

Another reason providers like to provide API’s instead of just the raw data is that it give them some control over who uses the data and how.  So you have to sign up for an account and get some secret words to place in your code that identify you to the supplier.  This is called authentication and sometimes it is very simple like NYTimes API using JSON and sometimes a little tricky.  The most famous system of doing authentication is called OAuth which seems like a bother but it is not really so bad. For example Twitter now requires you to authenticate in this way  Twitter and Jer Thorp has a very clear tutorial about how to use it.

Using Threads for Networked Communication

The problem with the loadStrings() function and all the relate load functions is that they are blocking function.  That means that it stops all activity in your program until the data it was asking for comes back.  Because networks can be slow or unpredictable it would be better to mutlitask and allow your main draw loop to continue animating while another loop or “thread” runs in parallel asking data and waiting for it to come back.

Resources:

Leave a Reply