This post is an exploration of how to deal with data that has a high number of dimensions, in particular FAO data that describes the amount of calories per capita per day for over 110 different food groups. This is a pretty incredible data set as it essentially describes the diverse types of diets that people around the world have. The challenge in exploring this data is that people in different countries don’t have diametrically opposed types of diets, so it’s a bit difficult to say how similar or how different they are, especially given the diversity of food types.
The first tool used is the t-SNE package for R, which I discovered after finding an interesting example that used it to help characterize mutual funds. With the data we’re looking at, if there were only three types of food groups, it would be easy to visualize the differences in diet, just by making a 3d scatter plot where each point was a country, and the axes each represented a different food group. However, we’re dealing with 110 different dimensions, which means that we need to reduce those down to a more manageable number. With the t-SNE algorithm, we are able to reduce those dimensions down to two. As seen in the image below, the algorithm works reasonably well, and there are clear clusters around Western Europe, the Caribbean, and southeast Asia among others. Essentially, this demonstrates that it’s possible to extract some information about geography just by looking at what’s on people’s dinner plates. At the same time, strange examples can be found, such as Lithuania being grouped next to Chile. It isn’t immediately clear if this is incorrect, since there is a chance they may have similar diets. The x and y dimensions of the image don’t really mean anything. The main thing to pay attention to are the clusters and how close different clusters are to each other.
While the image gives us a rough idea about the similarity of diets, it would be interesting to look a bit deeper into all 110 different food types used to created this clustering. The image below uses the same layout as the one above, except that the size of the dot represents the amount of calories per capita per day for that country. Corresponding this to the image above, we see that maize is quite popular in Central America, while rice is popular in the Caribbean, Africa, and Asia. Admittedly, this isn’t the most easy to read representation if we’re trying to trace back how much of what each country eats, but probably best viewed as a test of how well the algorithm worked in reducing the huge number of dimensions down to something that we could visualize.
Below is an image you can click to enlarge showing all of the categories, instead of just the two shown above. The size of the dot represents the actual number of calories, so by summing up the size of the dots for a country, you could deduce the total amount of calories consumed. What we can clearly see is that there are roughly ten categories of food that provide most of the calories for people’s diets.
A slightly different view is given below where the amount of calories is normalized for each food group. In other words, the largest dots represent the countries which consume the most amount of calories for that group. This also gives an indication of how the consumption of a type of food varies among countries. As can be seen, for some categories, everyone eats the same amount, while for certain categories, one country eats much more than anyone else.
The images above are an illustration of the entire data set that the t-SNE algorithm had to work with. For the grid of food types shown, there does indeed seem to be reasonable clustering for many of the types. At the same time, I need to do more work to really evaluate what this means, and to understand if the algorithm would have to be turned using different settings to get better results. In the meantime, below we see what can be achieved by using the same visualization techniques as above, but overlaying their locations on a map. The image below shows why interpreting the image at the top of this post is difficult. Both maize and rice are consumed to a large degree in both Central America and West Africa, but Southern Africa focuses on Maize, while Asia prefers rice.
In the next graphs, the values for each food category are normalized, so it’s easier to see which countries consume the most of one category, relative to the rest of the world. From this we see that Europe leads the world in calories from beer:
For those interested, I have included the R code used to perform this analysis and generate the graphs up on my github repository. The code also includes some preliminary work using the igraph library to visualize how the physical layout generated by the tsne algorithm matches the probability matrix generated by it.