shahar @ itp

data viz

wikipedia exploration tool

For my Expression Frameworks final project, I decided to work on a Wikipedia exploration/free-association/network browsing tool, continuing what I had started in the last 4in4. Some examples exist, but are mostly concerned with the large scale network structure. The closest thing to what I’m imagining would be wikimindmap, but it’s not quite what I’m looking for.

Last time my project’s stopped short of the goal because of the performance problems I encountered while trying to work with the API live – it was just too slow for my purposes. This time I’ll be using the Wikipedia link dataset, which should make things incredibly faster.

Right now I’m looking into database options for holding this massive (massive for me, at least) dataset. I initially thought about using a NoSQL DB such as Reddis and the likes, but one thing I forgot was that a plain vanilla database won’t give me even the most basic graph features without some kind of workaround – for example, if your DB has a page as a key and then the pages it links to it as values, you would also need the reverse mapping – page->pages that linked to it. Graph databases solve that problem.

Greg pointed me to Twitter’s recently opensourced flockdb, which seems very cool but doesn’t have a built-in RESTful API (I’ll be using javascript + canvas/processing.js), lacks graph traversal features, which I might need in the future, and generally looks like an overkill that would be a real pain to install/deploy. Neo4j, on the other hand, is a true graph database, is simpler to get started with and offers RESTful API, so I’m planning to start with that.


dataviz data sources & schema

  1. The ER database
  2. World Database of Happiness: Nation, Satisfaction with Life (1-10 scale). Might be interesting to combine with the CIA World Factbook.
  3. Freebase: An entity graph of people, places and things. Not quite sure it is possible to represent with a simple schema. The Wikipedia link dataset is also similar.
  4. Metafilter Infodump – postid, userid, datestamp, category, comment count, favorites count, deletion status & reason.
  5. Activity/experience sampling self-study (using an iPhone app?) – Similar to, but not quite like either daytum, mappiness and your.flowingdata. The idea is to use the iPhone to is done in “beeper studies” (similar to what Cziksentmihaly did in Flow experiments) – basically, you sample what a person is doing and/or how they’re feeling at random times, and given enough samples you may be able to find some interesting correlations. The simplest schema for the data would have Time Sampled and Activity. Mood could be an additional column.

Other interesting sources I’ve stumbled upon: NYC Data Mine, Infochimps, Data Wrangling Blog, Numbrary.