Vision

Parsable Space

Telling stories in moving pictures is the most powerful technology ever developed. Unfortunately in the usual progression toward an even more powerful medium that can be parsed and then recombined liked text, motion pictures are stuck in the age of scribes.  Film and video have been kept from having their “movable type moment” because scenes never get broken down beyond the level of the frame.  Each frame is merely a collection of colored pixels instead of a collection of separable and recombinable elements.  We are seeing hint of progress here with elements being captured and manipulated separately in  video games and special effects.  This week are going to take some steps to find some structure in video images.

Getting an Image

From a URL or a  File.

Many of you have already done this.  If you use use a filename rather than a url, you first have to put the file in a “data” folder in your sketch folder.   You can use the “Sketch>Add File” menu to do this in Processing.  Make sure you declare the variable at the top so you can load it in setup but use it in draw.  If you load it over and over in draw() your program will by slow.  The image function can take two coordinates for the position of the image but optionally you can supply two more scaling the width and heigh of the image.

 

From the Camera

We are going to make a “Capture” object for pulling in the images from your webcam.  This process is going to feel a little bit like setting up the serial port to listen to Arduino.  Firstly because the camera hardware is different on every platform (eg mac vs pc), getting a hold of them is handled by libraries specific to those platforms.  You have to import that library in the first line (or you could have the “Sketch>Import Library” menu do it for you).  Luckily the correct Capture library for your platform is downloaded with processing.

Next we declare a variable named “video” of type “Capture.”  This will behave just like the PImage object in the earlier example.  You should declare it at the top so it can be created in setup() and used in draw().

In setup() you should create the capture object and put in a variable.  By default it will probably use the web cam on your laptop.  If you want to use another camera attached to your laptop check how to specify that in the documentation of the capture object.  That will feel like picking the port in the Arduino example.  Don’t forget to kickstart the capture object by calling the start function in it (this is newish to Processing so you will see some examples without this).

Finally you want to show the images in draw.  You check to see if a new image is “available” and if so then “read” it. (Note this is similar to the polling method that I discouraged last week for checking for serial bytes coming from Arduino.  Indeed you can use a callback with a captureEvent function like the old serialEvent but I don’t find that it works well).   You use the image() function just as if it were a still image.  From this point on video is just treated the same as a still image.

From a Movie

You can also bring movies into processing.  This is a little less interesting to me than the camera because it is not interactive.  You can see this example uses the callback instead of polling style of seeing when new frames came in.  But like with the live video capture, at the end of the day everything is just an image.  You can look at the reference for more options.

Getting the Pixels Array from an Image

Images are not very computationally interesting.  They are really take it or leave, all or nothing entities with very little opportunity for interactivity.  We want to break down beyond the frame into the pixels.  Regardless of where your images came from, the internet, a file, a camera, a movie, in the draw loop they are all just images.

Behind every image there is an array of numbers with one integer (int) for each pixel of  the picture.  It is very easy to use dot notation to get at the array of pixels from inside the image object.  We will use this notation to ask about a pixel color or to set a pixel color.

video.pixels[0] // the first pixel of the image called video
video.pixels[55]  //the 56th pixel of the image called video

One Number Per Pixel?  That’s Weird

You could describe grayscale pixels with one number but for color pixels we typically use three separate numbers for red, green and blue.  Luckily an “int” variable is big enough that it can hold four bytes of information, one byte for red, one for green, one for blue and one for alpha or transparency.  This is usually a bummer that all the components of the colors get packed into one number but processing comes to the rescue with some function for packing and unpacking.

int r = red(video.pixels[55]);  //unpacks the red component of the 55th pixel
int g = green(video.pixels[55]); //unpacks the green component of the 55th pixel
int b = blue(video.pixels[55]; //unpacks the blue component of the 55th pixel
int b = brightness (video.pixels[55]; //unpacks the brightness of the 55th pixel
int h = hue(video.pixels[55]; //unpacks the hue of the 55th pixel
int s = saturation(video.pixels[55]; //unpacks the saturation of the 55th pixel

video.pixels[55] = color(255,0,255);  //pack a color with a lot of red and a lot of blue into the 55th pixel

Load Pixels and Update Pixels

Before you start operating on the pixels of an image object you should use the  loadPixels() function of that object to make sure the pixels are fresh.  On the other side once you are done changing the individual pixels of an image object you should use updatePixels() to make sure that array refreshes the image when when it is displayed.  You can get away without doing this quite a lot.

Visit All the Pixels

There are a lot of pixels so you will need a repeat to visit each pixel.   This is a pretty power thing to be able to visit hundreds of thousands of pixels every 30 milliseconds (30 frames/sec) and ask each one a question (how bright) and set it to a new color accordingly.

Try changing this code to look for edges by comparing each pixel to the previous.

Image Processing vs Computer Vision

As powerful as that repeat loop is, it is not good enough.  The code above does image processing and then the results are delivered to your eyeballs to interpret.  We want the machine to do a little more interpretation itself, for instance separate an object from the background and tell me the position of the object in the scene.  Rather than delivering the results only to your eyeballs I  want it to deliver the results in numbers that can then be used in your code for other things.  For instance you could have an if statement that says if their x position is greater than 100 pour a virtual bucket of water on them. Or play a higher pitched not as your hand moves up.   Your next goal to should be to track the position of something in the camera’s view.

Art School vs Homeland Security

You should contrive your the physical side of your tracking situation as much as possible to make the software side as easy as possible.  Computer Vision is pretty hard if you are trying to find terrorists at the super bowl.  But your users have a stake in the the thing working and will play along.  Rather than tracking natural things it is okay to have the user wear a big orange oven mitt.

The Classic: Rows and Columns and Questions

Unfortunately the pixels are given to you in one long array of numbers but we think of images in terms of rows and columns.  Instead of a single for loop we will have a “for loop” inside  a “for loop.”  The first loop will visit every column and the second will visit each row in the column (same to do the reverse visit each row and than each column in the row).  Then you pull out the colors of the pixel and ask a question about them (more on the possible questions below).  This “for loop” within a “for loop” , extracting color components followed by an if statement question will be the heart of every computer vision program you ever write.

 Formula for Location in Pixel Array for a Given Row and Column

There is just one trick in there for finding the offset in the long linear array of numbers for a given row and column location.  The formula is:

You perform this formula every time you try to figure out how many people are in a theater.  Pretending people filled all the seats in order you would find the number seats in a row and multiply times how many rows are full and then add how many seats are in the last partially full row.

Comparing Colors

As we will see, after you extract the colors of the pixel you might want to compare it to a color you are looking for, the color of the adjacent pixel, the same pixel in the last frame etc….  You could just subtract and use absolute value to find the difference in color.  But it turns out in color space as in real space finding the distance as the crow flies between two colors is better done with the pythagorean theorem.  Coming to the rescue in Processing is the dist() function.

Things You Can Ask Pixels

Color Tracking: Which Pixel Is Closest to A Particular Color?

In this code we will use three variables, one to keep track of how close the best match yet is.  And two more to store the location of the best match so far.

You click on a pixel to set the color you are tracking.  Notice we used this formula again in mousePressed() find the offset in the big long array for the particular row and column you clicked at.  Remember it is usually better to insert some patch of unnatural color, say a bright colored toy, into your scene than to make software that  can be very discerning of nature.

 

Smoother Color Tracking: What is the Average Location of of Pixels That are Close Enough Particular Color?

The key part of this code is the fact that I am not looking for a single pixel but instead using a threshold to find a group of pixels that are close enough.  In this case I will average the locations of all the pixels that were close enough.  In the previous example a bunch of similar pixels would win the contest of being closest in different frames so it was kind of jumpy.  Now the whole group of close ones win so it will be smoother.

Another thing to notice in this code is the key pressed function.  That is used to change variable, especially the threshold variable dynamically while the program is running.  This is important in computer vision because light conditions can change so easily.

This is not very important to don’t read it but I using a timer to test the performance.  As fast as computers are these days, you will really be testing them with computer vision where you need to visit millions of pixels per second.  If you add a third repeat loop in there you will really need to start checking performance.  We have seen this method before of using a variable to store the millis() and then checking the millis() against that variable to see how much time has passed.  There are about 33 millisecond in a frame of video at 30 frames/second.

Okay here is some smoother tracking.

 

Looking for Change: How does this Pixel Compare to the background?

For this we need to store another set of background pixels to compare the incoming frame to.  If you use the previous frame as the background pixels you will be tracking change.  If you set the background in setup or in mousepressed() (as in this example) you will be doing background removal.

 

Skin Rectangles: Is this pixel skin colored and is it part of an existing group of pixels?

Okay this example is combining two new things.  The most important thing is we are looking for multiple groups of pixels so find a single average for the whole frame as we were doing won’t be enough.  Instead we are going to ask each pixel first if they qualify against a threshold and then second if they are close to an existing group or whether they need to start a new group.  This could be used for pixels that qualify for things other than  skin colore, for example the pixels could qualify for being a bright or being part of the foreground.

This example also happens to be about skin color.  It is a heart warming fact that regardless of race we are all pretty much the same color because we all have the same blood.  We are however different brightnesses.  Because brightness is not a separable notion in the RGB color space we using a “normalized” color space where we are looking for percentages of red and green (the third color will just be the remain percentage).

Okay there is the code in two parts.  There is also a rectangle object for holding the the qualifying pixels.

 

 

Other Examples:

Face detection

Kinect

Network Camera

Related reading:
Learning Processing, Chapters 15-16
Assignment

Pixels Project

Track a color and have some animation or sound change as a result.  Create a software mirror by designing an abstract drawing machine which you color according to pixels from live video.
Create a video player. Consider combining your pcomp media controller assignment and build a Processing sketch that allows you to switch between videos, process pixels of a video, scrub a video, etc.
Use the kinect to track a skeleton. Can you “puppeteer” an avatar/animation with the kinect?

Leave a Reply