Margaret McKenna

Reading the News Anew

This project presents both a visualization of how the New York Times distributes its international news coverage regionally, as well as a novel interface for exploring the news.

http://mlmckenna.com/news/

Classes
Introduction to Computational Media


This project analyzes the frequency and length of news articles published in the "World" section of The New York Times in 2010. Articles are visualized as circles and animate through time, in views of full month, topic category within a month, and single article.

The month view gives the user an idea of the cumulative regional news coverage, as well as the ability to pick out some significant news events (in January the Americas have a visibly dense cluster of articles, most of which pertain to the Haiti earthquake.) As the animation continues, the month view morphs to display only articles in particular topic categories (Political, Economic, Social, War, and Disaster).

Individual articles can be explored by mousing over each article's circle.

Though this project began as a way to quantify regional news coverage, it has become an interesting way to read the news. The ability to quickly mouse over many articles and see their titles allows for broader exploration of the news than a traditional page format where space is limited to 20 or so articles. In the month view, you can select from more than 500 articles and quickly understand something about the content of the article, the context of the article (what else was going on internationally at the time), and the significance of the article (is it front page news or a few lines in the "World" section that will be unnoticed by most readers).

The project also raises interesting questions: Does the frequency and depth of reporting reflect an actual bias in coverage, or is it simply an objective reflection of where "news" happens? Future work might include comparing this view of the Times to a similar one from the Guardian, comparing the source of articles across regions (e.g. is the article from the AP or Reuters or the Times), and analyzing historical data.


User Scenario
The visualization animates through time and topics, and will continue to do so without user intervention. In order to stop the animation at any point, press the space bar. Mousing over the circles will reveal the title of the article. Clicking on the highlighted circle will take you to the full article on the New York Times website.

Implementation
Visualization technique:

Each article is displayed as an ellipse, and the area of the ellipse is determined by the number of words in the article. In this way the difference between a 100 word regional briefing and a front-page story is emphasized. The visualization suggests that there are regional tendencies in both frequency and depth of reporting by the Times.

Data methodology:

The data for this project was culled from The New York Times Article Search API, using the following query terms:

-Query terms: nytd_section_facet:[World]amp&;begin_date=20100101amp&;end_date=20101215
-Returned fields: title, url, geo_facet, nytd_geo_facet, nytd_des_facet, des_facet, classifiers_facet, word_count, byline, date, day_of_week_facet, small_image, page_facet, source_facet

From a raw total of 10,927 articles, the following category of articles were dropped from the sample set:

-Names of the Dead (197)
-Slideshow/Interactive pieces (2,529)
-Duplicates by same title, different url (119)
-Duplicates by same title, different publication date (583)
-Articles with no regional identification, as determined by title and classifiers_facet fields (1,275)

Regional groupings are those of The New York Times editorial staff, and were determined from parsing the "classifiers_facet" and the title of non-classified articles (e.g. "WORLD BRIEFING | AMERICAS").

Each topic category (Political, Social, Economic, War, Disaster) was determined by grouping over 1,000 unique tags from the "des_facet" field in the Times API. With this method, 81% of articles could be filed in one of the five high-level categories. (1,029 articles were without des_facet.)

We perform regression analysis with fixed effects for each unique day of news coverage in order to determine whether each continents' average articles per day and words per article differ from the overall average by continent. This analysis is performed across all data and within each topic category. Detailed results from this analysis are available upon request.


Conclusion
My learnings came on two fronts: 1) technology and 2) data.

On the technology front, I chose to use processing.js because it loads quickly and easily in all HTML5-enabled browsers and thus increased the possibility that users would interact with it. (Alternatively I could have used Processing and created a Java applet, but the poor user experience in waiting for the applet to load might keep some people from interacting with it.) The downside of using processing.js is that it is a relatively new port of Processing and thus has very poor documentation. Much of my initial effort was spent figuring out the library and how to integrate it with the features of JavaScript I wanted to use. Another hurdle was browser performance. In my initial code the CPU usage was so great that I had to disable some of the interactive elements of the visualization in order to prevent a browser crash. In order to improve the performance, and increase interactivity, I handed off "processing-like" tasks to JavaScript where I could (e.g. the highlighted areas in the yearly view are created using HTML elements and css), limited the use of the draw function where possible (the year view is drawn once), and improved the method by which article objects were stored and referenced. In all, this was a great lesson in JavaScript performance issues, interactivity vs. performance trade-offs, and usability (some performance heavy tasks had to stay because without them the experience would be notably less usable).

In addition, there were a number of lessons learned on the data front. For one, lots of publicly available data (including articles made available through the NYT api) are curated by hand and thus don't have consistent taxonomies and attributes that would allow for more "perfect" or objective analysis of the data. During this project, many decisions had to be made about to how classify and group the articles (e.g. geo data for the articles might be a city or state with no relationship to country or a country with no relationship to a region, so we relied on a mixture of "classifier" data and article naming conventions like "WORLD BRIEFING|AMERICAS"). Similarly, many "articles" returned in the search were not articles as we think of them -- there were slideshows, infographics, and lists of "Names of the Dead" (soldiers killed in war) that had to be expunged from the data. The process of doing this (and further analysis on keyword tagging of the articles) made me think critically about the data being displayed to me in other visualizations that I see. As data visualization becomes a more common means of explaining the world, it's important that visualizations come with a rigorous explanation of the data methodology used, else one may come away with an impression that does do justice to the raw data.