Data Mining, The US Department of Energy’s Semantic Wiki

What will be shown here is an analysis of renewable energy trends in the US, using data extracted live from a US Government wiki that anyone can contribute to.

Most people are quite aware of the concepts of wikis, and their use in allowing people to collaboratively create content.  However, the problem with wikis is that their information moves at the speed at which people can read.  Search engines do help to locate content, but when it comes to people digesting information, they are still limited to reading the contents of a single page at a time.  Related information spread over multiple pages can only be gathered by going to each and every page.

Currently, there are new exciting extensions being used with wikis that allow for
them to use Semantic Web
standards.  These semantic wikis allow for a hybrid approach, where people can add plain text, just like on "traditional" wikis, but they also have the option to semantically annotate information.  This means that annotated information spread over multiple wiki pages can be queried in a similar manner as a database.  This semantically annotated information can be the types of information that is commonly placed in infoboxes on Wikipedia.  For example, a look at the pages on New York City or Paris shows infoboxes that contain a wealth of structured data, such as the population and the mayor.  Currently on Wikipedia, you can’t ask questions like "who are the mayors of the world’s 20 largest cities".  With semantic wikis, these types of questions can easily be answered if these types of facts are annotated.

A very prominent example of a semantic wiki is Open Energy Info, which is sponsored by the U.S. Department of Energy and developed by the National Renewable Energy Laboratory.  Browsing through the site reveals that many categories of things are documented, such as Energy Generation Facilities, which cover renewable energy generation facilities.  Clicking on one of these such as the Bear Creek Wind Farm reveals that it has an infobox set up for it, as shown below.

What’s interesting about this is that 1) it’s structured data, and 2) all of the other pages in these categories have these infoboxes as well, which allows us to examine the historical development of renewable energy in the US, solely based on information spread out over more than 1200 wiki pages.  

This next section goes deep into the more advanced features behind the scenes of the OpenEI wiki, in an effort to demonstrate what is currently possible when publicly available open resources are published using Semantic Web technologies.  A less technical option is available in the form of the more limited inline queries, that can also be conducted on the site.

The power of this site is that information is not locked into a single view.  OpenEI has provided a SPARQL endpoint where developers can write queries using the SPARQL language, which is somewhat similar to the SQL language in widespread use for querying databases.  With the query below, we get a table of the generating capacity of renewable energy power plants, along with their commercial online date:

select * where {
?plant <> ?capacity .
?plant <> ?onlineDate  .

With a little cleanup of the results, the chart below was created:



This is a start, and shows that many renewable energy plants where constructed in the 1980’s, with another large peak in the past five years.  However, it’s a bit cluttered, and if two plants came online at the same year with the same capacity, then it would be impossible to distinguish them in the plots.  It would be much more useful to be able to see the total capacity added per year.  The query below is similar to the previous one, except that it sums the capacity added per date. Since the dates listed for these facilities all begin with January 1st, we know that this query will give us the total capacity added every year.

select sum(?capacity) as ?totalCapacity ?onlineDate where {
?plant <> ?capacity .
?plant <> ?onlineDate .
} group by ?onlineDate order by ?onlineDate



This actually shows some surprising results given the media dialogue around politics in the US. For example, Ronald Reagan was known to be unfavorable towards renewable energy and notably removed the solar hot water heating system from the White House which was installed by Jimmy Carter.  According to the current OpenEI data, there is actually an uptick in the amount of renewable energy capacity installed during his term (1981-1989).  Also, for Bill Clinton (1993-2001) large amounts of capacity only came online near the end of his term, with not much happening before then.  In further interpreting this graph, we have to recognize that presidents can’t achieve things overnight, and for each of these facilities, one could expect a time delay of several years for the design, permitting, and building processes.  Furthermore, there’s a deeper question of how much influence they actually had, since progress may be more driven at the state or regional level.  Again, we can perform a query to investigate this deeper, such as finding all the facilities built while Reagan was in office, this time listing their locations.  This still doesn’t give us conclusive results to explain the trends, but it does show that the largest facilities built were geothermal plants in California.

The graphs above lump all technologies together (wind, geothermal, biomass, and solar), and it would be interesting now to perform the same query as above, but now grouped by technology:

select sum(?capacity) as ?totalCapacity ?onlineDate ?sector where {
?plant <> ?capacity .
?plant <> ?onlineDate .
?plant <> ?sector .
} group by ?onlineDate ?sector order by ?sector ?onlineDate 

With a little help from the R statistics package, we’re able to clean up the data and generate the stacked area chart below.  The width of each band represents the total capacity per sector.  Since each band is additive, the top of the curve shows the total installed capacity for renewable energy for all technologies. 


Clearly solar is not very dominant, although the data does not include household installations, so this may be a bit skewed.  What’s interesting here is that most of the technologies have more or less leveled off, while the amount of wind capacity has absolutely exploded.  Another view on this can be made by looking at how the total capacity per year is divided among the four technologies, in terms of relative percentages.  In other words, this shows how much a particular technology dominates the total mix over time.


The many results presented here are only a subset of what can be done with the data on this website. One of the points I wish to get across in this exercise, is that it’s not just an analysis of renewable energy trends, but rather it’s an analysis of US Government data, grabbed live from a wiki that has been opened up to public contributions.  This is truly revolutionary, although perhaps under-appreciated or unnoticed by many.  The interesting implication is that if you find problems in the data, or missing information that could lead to new types of analysis and understanding, then you can just get an account for the OpenEI wiki and begin contributing.  What is shown here is really only the beginning of what is possible.

The cleaned data retrieved from the queries, along with the source code for generating the graphs is available here.

Be Sociable, Share!

Leave a Reply