Data

With physical computing we were making the point that more of  the expression or your body and its environment  needs to be digitized.  We set about putting sensors, cameras and actuators in unexpected places that turn our expression in a bunch of numbers that computers can make use of, namely data.  Besides all your crazy new sensors putting off data, the increasing part of your experience that passes through laptop and smartphone keyboards is already leaving behind a wide trail of easy to manipulate, pre-digitized artifacts of your conscious and unconscious expression.  The companies, venues and countries that you come near are already pretty good at provoking you to generate such data and then gathering and analyzing it.  Politically it seems like if you want to change the world or at least not be dominated by it you should learn about data analysis.  But artistically this is a tool with new capabilities to reveal things about your self and about groups. The computer lends prothesis us to see smaller and larger contexts, over timescales, and in permutations of juxtaposition that our unaided minds can’t.  Check out these two articles:

Strings (aka Text)

Data is mostly going to come to us as something called a String.  I guess it is called a string because it a bunch of characters strung together.  You have run across this already because we have previously chose to have the Arduino send strings to P5.  Even when we sent numbers we sent “100” instead of 100 (print() instead of write()).

String, The  Special Object

A string is an object of type String.   In some other languages (like C , aka Arduino,  or Java, aka Processing) when you declare the variable to hold your string (aka your text) it you use the type “String.”  Because you make strings so often you don’t have to bother with the “new” and parenthesis you use when creating all other objects.

For example if you wanted to use text inside your Arduino could say:

In Javascript and P5 however, we can just declare a string and store it in a variable the way we declare any other variable:

Javascript knows its a string because we put our string in between quotation marks ” ” (you can also use single quotes ‘  ‘ to declare strings).

All the Things a String Object Can Do

The easy way of making a String seems normal and that is how you would make an int or a float or byte or a boolean.  But those are all just primitive numbers that have nothing else inside of them.  Don’t let the easy way the String objets are made hide the fact that  Strings are objects with tons of functions inside them like:

 

You can get a look at some of the power inside Strings in the this W3 Schools reference. But this might be a good moment to show you that you are really writing javascript code rather than p5 code.  P5 really gives you a more civilized view of what is really javascript.  Although it is more civilized it is also more limited.   The string object is a place where you might want to see the full power of even more String functions in javascript beneath the civilized veil of the few listed in p5 documentation.  So for instance the useful indexOf() function is used to search within a string although it is not mentioned in the Processing subset.

 

Comparing strings

In Javascript and therefore p5 we can compare strings the same way we do our other variables.

We can also combine strings in a variety of ways:

 

This next bit may not be too important right now, but it make come up at some point.  Javascript is also funny in the sense that you can choose to be strict about comparing strings to other variable types by using the triple equal sign (  ===  ) operator. The double equals ( == ) operator compares the two values.  The triple equal (  ===  ) operator compares value and variable type.

Don’t worry too much about this yet though.

 

Drawing a String

The text(“Drawme”, 100,100) function is used to draw text on the screen.  You just supply the x and y coordinates.  Pretty easy.  To use a particular font you need to declare it.

 

 Much More About Strings

This was a very quick tour through the String object.  This is a great processing tutorial with more details, particularly about animating the text.

Words (old fashion data)

People have been exchanging strings of characters in the form of prose for millennia and the advent of the computer has radically increase the amount of words a typical person writes.  While it is hard to get a computer to be as original or to understand ambiguity as humans can, the parsing and comparing words they can do if a very complimentary skill to ours.  It is obviously useful when computers read through all your emails for for the word “Mussolini Memorabilia” so you do have to spend hours doing it.  But computers can also find patterns in your prose that you never find no matter how much time you spent.

Word Counting

Textual analysis starts by counting words and finding their relative frequency.  To find the topics that are important in a particular corpus of text you could find the words that get repeated frequently there and infrequently in a bigger corpus of text.  In other words check the most commonly used words in your college application against all college applications.  You can do this fairly intuitively or apply some more sophisticated methods to it. This naturally causes the small and common words that everyone uses all the time to fall away and leaving just the “important” words.

Another approach is to concentrate on the small words, called functional words.  These words don’t tell you much about the topic of the text but they can tell you a lot about the style.  Try counting the relative use of pronouns like “like”, “I”, ” me” and “my”.  The Secret Life of Pronouns is an interesting read on this subject.  Counting these small words and comparing style is an example of something that is extremely hard to do without a computer.

So to count words you need to keep track of the words and the counts for each.  Because there are a lot of them, it sounds like a job for an array and that might work.

The code below for doing word counting two new things.  We have run across the split command before when we tried reading in multiple sensors from Arduino and split them apart on the basis of the comma.  In fact the split command is probably the most common command in the world as text passes back and forth across the internet and needs to be parsed.  You can imagine in prose we would mostly want to split on the basis of the the space character between words instead of the comma. In this example we use splitTokens() can split based on the multiple punctuation characters that might delimit words.

Here is all the code together:

 

Here are some other examples of word counting Word counting animation

It would be fun to try downloading all of your years of Gmail and parsing it.

Here are some other sources of text (thank you Allison Parrish):

  • Prepared example texts that I reference frequently in class
  • Project Gutenberg – e books
  • Common Crawl, “a repository of web crawl data that is openly accessible to everyone”
  • Corpus of Contemporary American English: search for frequencies and contexts of words and phrases in “the largest freely-available corpus of English.” (Provides no API, unfortunately.)
  • Wordnik, a dictionary. The Wordnik API“lets you request definitions, example sentences, spelling suggestions, related words like synonyms and antonyms, phrases containing a given word, word autocompletion, random words, words of the day, and much more.”
  • Darius Kazemi’s Corpora
  • Corpus resources

Computer Data

Now that we have computers we often change the way we punctuate our words to make them easier for the computer to parse.  For instance as compared with normal prose, when you sent from Arduino to Processing you used commas instead of the usual spaces between words and end of line character (‘\n’) instead of the period you use in prose .  This is a very common form of computer punctuation called CSV or Comma Separated Value.  This was easy to parse on the receiving side using readStringUntil(‘\n’) and split(input,”,”).  So you have already used computer friendly punctuations for easy computer parsing.

Sources of Data

Data Formatted with Machines in Mind

Making data available available to your eyeballs using HTML is good but if the data is formatted especially for the the machine to read, your code can more easily do you all sorts of favors of searching and correlating massive and diverse data sets.  Luckily the power of machine readable datasets is well know and many organizations with data have found it in their hearts to deliver data machine readable data.

Data Formats CSV, TSV, XML, JSON

Like I said you already have dealt with formatting.  When you sent the data using commas and end of line characters (‘\n’) from Arduino to P5 you were using CSV (Comma Separated Values).  All these fromat mainly differ on the separators use to delimit the “fields” (sensors) and records (different readings over time).  A good separator is one that will not be found in the fields you are trying separate.  So if you fear that a comma might be in one of the fields you want to separate you might use TSV (Tab Separated Values) where you separate the fields by ‘\t’ instead of a comma.  The separators in HTML and XML are “tags.”.  But where HTML uses formatting tags like <bold> XML uses semantic tags like <age>.  You may have heard of RSS which is a way some web pages will also provide an XML version in addition to the HTML version so they are machine readable.  For instance here is the XML version of this blog https://itp.nyu.edu/classes/icm-dano-spring2014/feed/ (you can just add feed to the end of your blog url).  XML is a little wordy and so we will mostly use (for now) another format called JSON.

JSON is javascript, it stands for JavaScript Object Notation, and it is getting more popular and it uses a collection of curly brackets, colons, commas and square brackets for delimiters.

Very conveniently P5 had commands called loadTable() for loading CSV formatted data, loadXML() for XML formatted data and loadJSON() for, you guessed it JSON formatted data.  Check out these examples !

Notice that all of these load functions also have a save functions.  This allows you to add to a database and become a datasource!

API’s

Sometimes you just get the data as a big long string.  Other times you have ask for certain parts of the data.  This is useful if the data source is very big or complex and the provider does you the favor of sorting or selecting parts of it for you.  The data provider will expose an API  (Application Programmer Interface) which provides an interface for your program rather than your fingers and eyes.

For instance if we want weather data from OpenWeatherMap, we cant just access the entire database of all the worldwide weather data. We have to ask for them using search terms.  The API in this case is just a url with what is called a query string:

 

The query string is everything to the right of the question mark in any url.  It usually take the the form of key value pair separated by ampersands like v=10&q=”dog+sled”  Using a API in this way is a bit like calling a function and the query string is a bit like the parameters.

Also notice we need an API key.  Another reason providers like to provide API’s instead of just the raw data is that it give them some control over who uses the data and how.  So you have to sign up for an account and get some secret words (API KEY) to place in your code that identify you to the supplier.  This is called authentication and sometimes it is very simple.  Try it with openweathermap.  Sign up for a free account and copy the api key into your code.  Leave no spaces in your string when you do so.

 

We now need to use the loadJSON function to actually get that weather data into our sketch.  Notice instead of using the preload() function we are now using a callback to asynchronously load our data.  loadJSON takes 2 parameters, the url to your data source, and the callback function, which i named getData().

We then need to write the getData function in order to know what to do with the data once we get it.  The data argument inside the parentheses will automatically populate and we can store that information in our weather variable:

Now lets visualize our data.  I’ll make a circle  whose diameter is dependant on the temperature of a given city

Much More:

There are many different APIs for all kinds of publications, social media platforms, news sources, and other publications.  The ease of use and level of security and authentication vary from source to source:

One authentication protocol you will encounter as you try new APIs is called OAuth (and OAuth2) which seems like a bother but it is not really so bad. For example Twitter now requires you to authenticate in this way  Twitter and Jer Thorp has a very clear tutorial about how to use it.  Roopa Vasudevan also has some examples for using the youtube api

Resources:

Leave a Reply