R code for processing US Bureau of Economic Analysis Input Output Tables

I've just posted to github some R code that reads in the make and use tables generated by the US Bureau of Economic Analysis.  As a warning, I'm not an economist by training, so people who find this useful are highly encouraged to look through the source code and comment below on any problems. This represents my best attempts to understand and apply the Concepts and Methods of the U.S. Input-Output Accounts manual (starting at page 12-21). 

There are a few known issues, and things I'm not entirely clear on.  For example, the U and V matrices in the code do not contain scrap and noncomparable imports, since including these would result in rectangular matrices and (it seems to me) that the calculations performed on these later require square matrices.  Additionally, the BEA documentation, in working through the formulas of input-output analysis, derives several identities, which I used to check the calculations. These numbers match up except for one element in the vector (see part on qError and gError), and I haven't had time to track this down.  It may be related to the scrap and noncomparable imports data.

Despite these small warts, the code is interesting since it downloads the Excel spreadsheets straight from the BEA site, extracts and cleans the relevant bits of data, and populates the variables needed to do basic input-output analysis.  With the BEA spreadsheets, each year is represented as a separate sheet, and by specifying the year you are interested in (1998 through 2009), the code will access the data on that sheet.  This also means that if you want to construct time series data from the make and use tables, you just have to call the function I defined within a loop, and then merge the data together in whatever form you need.

For people interested in network analysis, this code is useful since it gives you data on interdependencies between industries and commodities in the US economy for more than a decade.  The code essentially creates adjacency matrices which can then be used directly with libraries such as igraph.

Beer, Cephalopods and Coconuts: Using R to explore what the FAO knows about what different countries eat

This post is an exploration of how to deal with data that has a high number of dimensions, in particular FAO data that describes the amount of calories per capita per day for over 110 different food groups.  This is a pretty incredible data set as it essentially describes the diverse types of diets that people around the world have.  The challenge in exploring this data is that people in different countries don't have diametrically opposed types of diets, so it's a bit difficult to say how similar or how different they are, especially given the diversity of food types.

The first tool used is the t-SNE package for R, which I discovered after finding an interesting example that used it to help characterize mutual funds. With the data we're looking at, if there were only three types of food groups, it would be easy to visualize the differences in diet, just by making a 3d scatter plot where each point was a country, and the axes each represented a different food group.  However, we're dealing with 110 different dimensions, which means that we need to reduce those down to a more manageable number.  With the t-SNE algorithm, we are able to reduce those dimensions down to two.  As seen in the image below, the algorithm works reasonably well, and there are clear clusters around Western Europe, the Caribbean, and southeast Asia among others.  Essentially, this demonstrates that it's possible to extract some information about geography just by looking at what's on people's dinner plates.  At the same time, strange examples can be found, such as Lithuania being grouped next to Chile.  It isn't immediately clear if this is incorrect, since there is a chance they may have similar diets.  The x and y dimensions of the image don't really mean anything.  The main thing to pay attention to are the clusters and how close different clusters are to each other.


Clustering of countries based on similar diets, using FAO data

While the image gives us a rough idea about the similarity of diets, it would be interesting to look a bit deeper into all 110 different food types used to created this clustering.   The image below uses the same layout as the one above, except that the size of the dot represents the amount of calories per capita per day for that country.  Corresponding this to the image above, we see that maize is quite popular in Central America, while rice is popular in the Caribbean, Africa, and Asia.  Admittedly, this isn't the most easy to read representation if we're trying to trace back how much of what each country eats, but probably best viewed as a test of how well the algorithm worked in reducing the huge number of dimensions down to something that we could visualize.


Comparing consumption of maize vs. rice for countries

Below is an image you can click to enlarge showing all of the categories, instead of just the two shown above.  The size of the dot represents the actual number of calories, so by summing up the size of the dots for a country, you could deduce the total amount of calories consumed.  What we can clearly see is that there are roughly ten categories of food that provide most of the calories for people's diets.

 A slightly different view is given below where the amount of calories is normalized for each food group.  In other words, the largest dots represent the countries which consume the most amount of calories for that group.  This also gives an indication of how the consumption of a type of food varies among countries.  As can be seen, for some categories, everyone eats the same amount, while for certain categories, one country eats much more than anyone else.

The images above are an illustration of the entire data set that the t-SNE algorithm had to work with. For the grid of food types shown, there does indeed seem to be reasonable clustering for many of the types.  At the same time, I need to do more work to really evaluate what this means, and to understand if the algorithm would have to be turned using different settings to get better results.  In the meantime, below we see what can be achieved by using the same visualization techniques as above, but overlaying their locations on a map.  The image below shows why interpreting the image at the top of this post is difficult.  Both maize and rice are consumed to a large degree in both Central America and West Africa, but Southern Africa focuses on Maize, while Asia prefers rice. 

Comparison of consumption of maize vs. rice, mapped

 

In the next graphs, the values for each food category are normalized, so it's easier to see which countries consume the most of one category, relative to the rest of the world.  From this we see that Europe leads the world in calories from beer: 

 
Cephalopods are popular in the Mediterranean, Japan, and Korea:
 
 Coconut Oil is popular in the tropics, but also strangely in some countries of West Europe:
 
 
Tea is actually more popular in South America than Britain or India:
 

 
 ...and finally, New Caledonia leads the world by a large margin in their consumption of animal organs.
 
 
Below is a ginormous picture showing the graphs generated for all food categories:
 

For those interested, I have included the R code used to perform this analysis and generate the graphs up on my github repository. The code also includes some preliminary work using the igraph library to visualize how the physical layout generated by the tsne algorithm matches the probability matrix generated by it.

Farewell Botnet Blog Readers

The recent shutdown of a large botnet seems to have had a large effect on the population of unknown readers of this blog.  I already knew something weird was happening when certain ip addresses would scan all the category links in rapid succession or keep hitting the pages throughout the day...

...however, my latest blog stats shows something absolutely astounding:

 Blog stats

It appears that my loyal bot following has been vanquished to the bit bucket.

Visualizations for TEDxRotterdam

Recently one of my colleagues, Igor Nikolic, was fortunate to be invited to speak at TEDxRotterdam where he gave a great talk on how we at TU Delft use insights from Complex Systems Theory in order to deal with complex socio-technical systems.

 

 

The movies used in this presentation come from my work in exploring how visualization techniques can help us to better understand complex systems.  Below the behind-the-scenes details regarding the creation of these is discussed for those interested in making their own visualizations, or understanding the data sources used.

The movies with the spinning globes were built using the Shapes3D library for the Processing visualization package, based on the example here. The Shapes3D library is great and I was able to create the visuals with a surprisingly small amount of code.  However, I wasn't able to figure out how to anti-alias the frames, so I just blurred them slightly with the with the imagemagick utility to get rid of the graininess.  This seemed to make things look more realistic, just as the atmosphere slightly blurs the surface of the earth.

The first movie uses as a base image the classic earth at night images by NASA, which shows how ubiquitous electricity consumption is around the globe.

 

 

The second movie overlays the NASA image with data from the Carbon Monitoring for Action site (in red dots), which contains information on locations, emissions and power output for over 50,000 power plants around the world.  What's interesting with this movie of power plants, is that you can see the different strategies that countries have undertaken in power generation.  For example, countries like France do not seem to have many power plants, but they also rely on large-scale nuclear power plants.  Other countries like Denmark and Germany seem to have a much more distributed portfolio of power plants.

 

 
 
The third movie uses the same frames as the second movie, but now adds data from the USGS Mineral Resources Data System (in yellow dots) which describes metallic and nonmetallic mineral resources throughout the world. It's quite a rich dataset containing several hundred thousand locations where people have dug into the ground for all sorts of things.  The data for the US is quite extensive even to the point of including gravel pits.  A very cool aspect of this dataset is that it actually lists the date of discovery/production for quite a large number of mines, meaning that you can see things like when the Spanish Conquistadors entered into South America, and the gold rush in California.
 
 
 

The next movie was built using the excellent Prefuse visualization package, and is meant to demonstrate the concept of intractability, which has been defined as "problems that can be solved, but not fast enough for the solution to be usable"(1). The white line represents the path that you actually take through life, and the nodes represent decisions made which can alter your path.  Every time you make a decision, you leave behind a whole realm of future possibilities that you could have lived (all the red branches are where you married someone else, pursued a different career, etc.).  This means that the future is hard to predict because of these combinatorial explosions.  The fastest way to predict the future is to just live through it.

 

 

What's interesting about this movie is that the frames are actually played backwards from the order in which they were rendered, due to the intractable nature of the layout algorithm used.  In other words, the only way to make this movie was to wait for the future to happen and then play backwards the path we used to get there. To explain, this figure uses a force-directed layout, where nodes push away from each other and edges bring them together like springs.  Essentially, this is a physics simulation.  The problem is that there's no way to easily predict/calculate where to add the nodes to the outside so that the forces are evenly balanced so that the inner nodes aren't pushed around, causing the whole figure to tremble in a rather distracting manner.  If you place the new node too far out, then it will drag the inner nodes outward.  If you place it too near, it will push the inner nodes further inwards.  To get around this problem, I first drew all the nodes and waited 30 minutes for all the forces to balance out and reach an equilibrium.  From there I started subtracting nodes randomly from the outside, until only a single node was left.

This last movie shows the evolution of the wiki at TU Delft since late 2004. Just like the intractability visualization above, this was done in Prefuse and uses a force-directed layout.  This layout works particularly well for complex network structures since it allows for the visualization to self-organize, whereby tightly interconnected pages migrate to the center, while nodes with few links are pushed to the outside.  For the colors of the nodes, red indicates inactive pages, while yellow means that edits or views are happening on that page. 

 

 

If you look closely at this movie, you can see breaks for weekends and summer vacations.  Blobs at the outside are likely notes that people made during conferences, or classes where students kept their notes and reports online.  This wiki actually started on a computer under Igor's desk, and once it became a part of the university ICT infrastructure in early 2009 you can see a significant flurry of activity.  Much of the activity on the outside is likely Google and other search engines indexing the site, while on the inside we see the adoption of the wiki by new groups across the university. 

Data Mining OpenEI.org, The US Department of Energy's Semantic Wiki

What will be shown here is an analysis of renewable energy trends in the US, using data extracted live from a US Government wiki that anyone can contribute to.

Most people are quite aware of the concepts of wikis, and their use in allowing people to collaboratively create content.  However, the problem with wikis is that their information moves at the speed at which people can read.  Search engines do help to locate content, but when it comes to people digesting information, they are still limited to reading the contents of a single page at a time.  Related information spread over multiple pages can only be gathered by going to each and every page.

Currently, there are new exciting extensions being used with wikis that allow for them to use Semantic Web standards.  These semantic wikis allow for a hybrid approach, where people can add plain text, just like on "traditional" wikis, but they also have the option to semantically annotate information.  This means that annotated information spread over multiple wiki pages can be queried in a similar manner as a database.  This semantically annotated information can be the types of information that is commonly placed in infoboxes on Wikipedia.  For example, a look at the pages on New York City or Paris shows infoboxes that contain a wealth of structured data, such as the population and the mayor.  Currently on Wikipedia, you can't ask questions like "who are the mayors of the world's 20 largest cities".  With semantic wikis, these types of questions can easily be answered if these types of facts are annotated.

A very prominent example of a semantic wiki is Open Energy Info, which is sponsored by the U.S. Department of Energy and developed by the National Renewable Energy Laboratory.  Browsing through the site reveals that many categories of things are documented, such as Energy Generation Facilities, which cover renewable energy generation facilities.  Clicking on one of these such as the Bear Creek Wind Farm reveals that it has an infobox set up for it, as shown below.

What's interesting about this is that 1) it's structured data, and 2) all of the other pages in these categories have these infoboxes as well, which allows us to examine the historical development of renewable energy in the US, solely based on information spread out over more than 1200 wiki pages.  

This next section goes deep into the more advanced features behind the scenes of the OpenEI wiki, in an effort to demonstrate what is currently possible when publicly available open resources are published using Semantic Web technologies.  A less technical option is available in the form of the more limited inline queries, that can also be conducted on the site.

The power of this site is that information is not locked into a single view.  OpenEI has provided a SPARQL endpoint where developers can write queries using the SPARQL language, which is somewhat similar to the SQL language in widespread use for querying databases.  With the query below, we get a table of the generating capacity of renewable energy power plants, along with their commercial online date:

select * where {
?plant <http://openei.org/resources/Property-3AGeneratingCapacity> ?capacity .
?plant <http://openei.org/resources/Property-3ACommercialOnlineDate> ?onlineDate  .
}

With a little cleanup of the results, the chart below was created:

 

 

This is a start, and shows that many renewable energy plants where constructed in the 1980's, with another large peak in the past five years.  However, it's a bit cluttered, and if two plants came online at the same year with the same capacity, then it would be impossible to distinguish them in the plots.  It would be much more useful to be able to see the total capacity added per year.  The query below is similar to the previous one, except that it sums the capacity added per date. Since the dates listed for these facilities all begin with January 1st, we know that this query will give us the total capacity added every year.

select sum(?capacity) as ?totalCapacity ?onlineDate where {
?plant <http://openei.org/resources/Property-3AGeneratingCapacity> ?capacity .
?plant <http://openei.org/resources/Property-3ACommercialOnlineDate> ?onlineDate .
} group by ?onlineDate order by ?onlineDate

 

 

This actually shows some surprising results given the media dialogue around politics in the US. For example, Ronald Reagan was known to be unfavorable towards renewable energy and notably removed the solar hot water heating system from the White House which was installed by Jimmy Carter.  According to the current OpenEI data, there is actually an uptick in the amount of renewable energy capacity installed during his term (1981-1989).  Also, for Bill Clinton (1993-2001) large amounts of capacity only came online near the end of his term, with not much happening before then.  In further interpreting this graph, we have to recognize that presidents can't achieve things overnight, and for each of these facilities, one could expect a time delay of several years for the design, permitting, and building processes.  Furthermore, there's a deeper question of how much influence they actually had, since progress may be more driven at the state or regional level.  Again, we can perform a query to investigate this deeper, such as finding all the facilities built while Reagan was in office, this time listing their locations.  This still doesn't give us conclusive results to explain the trends, but it does show that the largest facilities built were geothermal plants in California.

The graphs above lump all technologies together (wind, geothermal, biomass, and solar), and it would be interesting now to perform the same query as above, but now grouped by technology:

select sum(?capacity) as ?totalCapacity ?onlineDate ?sector where {
?plant <http://openei.org/resources/Property-3AGeneratingCapacity> ?capacity .
?plant <http://openei.org/resources/Property-3ACommercialOnlineDate> ?onlineDate .
?plant <http://openei.org/resources/Property-3ASector> ?sector .
} group by ?onlineDate ?sector order by ?sector ?onlineDate 

With a little help from the R statistics package, we're able to clean up the data and generate the stacked area chart below.  The width of each band represents the total capacity per sector.  Since each band is additive, the top of the curve shows the total installed capacity for renewable energy for all technologies. 

 


Clearly solar is not very dominant, although the data does not include household installations, so this may be a bit skewed.  What's interesting here is that most of the technologies have more or less leveled off, while the amount of wind capacity has absolutely exploded.  Another view on this can be made by looking at how the total capacity per year is divided among the four technologies, in terms of relative percentages.  In other words, this shows how much a particular technology dominates the total mix over time.

 

The many results presented here are only a subset of what can be done with the data on this website. One of the points I wish to get across in this exercise, is that it's not just an analysis of renewable energy trends, but rather it's an analysis of US Government data, grabbed live from a wiki that has been opened up to public contributions.  This is truly revolutionary, although perhaps under-appreciated or unnoticed by many.  The interesting implication is that if you find problems in the data, or missing information that could lead to new types of analysis and understanding, then you can just get an account for the OpenEI wiki and begin contributing.  What is shown here is really only the beginning of what is possible.

The cleaned data retrieved from the queries, along with the source code for generating the graphs is available here.