Data mining "Navnehjulet"

Last year the Danish newspaper Ugebrevet A4 published an interactive infographic called "Navnehjulet" ("the name wheel"). It's simple: you just enter a first name and it shows you all sorts of statistics about people in Denmark who has this name e.g. job, age, crime rate and so on. The data was initially collected from Statistics Denmark, but for some reason A4 don't want to share the raw data with us, only the pretty graphics.

I will go through the process of mining the data behind the infographic to build a dataset for various statistical analyses. The tools I used were Firefox and a Linux terminal. We start by looking at the source of the site. On line 306 we find the first clue:

<script src="http://a4-project.s3-website-us-east-1.amazonaws.com/assets/js/build/build.js"></script>

The script contained in build.js is responsible for building the infographic and respond to user input in the search forrm. We can look at it in our browser by visiting:
http://a4-project.s3-website-us-east-1.amazonaws.com/assets/js/build/build.js

It doesn't look pretty, but after sending it through jsbeautifier.org it is easier to read. Looking at the beautified script we see something interesting on line 16794-16801:

search_form = {
    data: [],
    initialize: function() {
        var self = this;
        $.ajax({
            url: absolute_path + "data/namelist.json",
            dataType: "jsonp",
            jsonpCallback: "nameListCallback",

We could try to load the JSON through the browser, but a 403 error tells us that the file is protected. Using the developer tools in Firefox we can see the GET request that loads the namelist for the search form:

Now it's easy to get the file by simply run the request through my terminal and save the output to the specified names.json file:

curl 'http://a4-project.s3-website-us-east-1.amazonaws.com/data/namelist.json?callback=nameListCallback&_=1430417753369' -H 'Host: a4-project.s3-website-us-east-1.amazonaws.com' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0' -H 'Accept: */*' -H 'Accept-Language: en-GB,en;q=0.5' --compressed -H 'Referer: http://www.ugebreveta4.dk/navnehjulet' -H 'Connection: keep-alive' > names.json

Now we have a list of all the 2358 names included in the dataset. The format of names.json is:

nameListCallback({"names": [{"id": "ca63debf4c.json", "value": "Aage"}, {"id": "019ae1390e.json", "value": "Aaron"}, ... ]})

The next step is to do some regex, which I will not cover here. I created two new files: one called "ids" and one called "names" with the data from names.json.

Using again the developer tools in Firefox while entering a name in the search form gives us a new GET request that loads all the data assigned to that name. This request uses the id associated with the name, so now we know how to extract the data for all the names. This time I made a small bash script to loop over all ids, using the GET request from the search form. The input is the file "ids" which is placed in the same folder as the script. The script looks like this:

#!/bin/bash
FILE=ids

exec 0<$FILE
while read line
do
    curl 'http://a4-project.s3-website-us-east-1.amazonaws.com/data/'$line'.json?callback=nameWheelCallback&_=1430417753370' -H 'Host: a4-project.s3-website-us-east-1.amazonaws.com' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0' -H 'Accept: */*' -H 'Accept-Language: en-GB,en;q=0.5' --compressed -H 'Referer: http://www.ugebreveta4.dk/navnehjulet' -H 'Connection: keep-alive' > $line.json
    sleep 1s
done

Save the file, make it executable and run it through the terminal

chmod +x dataMiner.sh
./dataMiner.sh

The sleep command is necessary as Amazon has some kind of DDOS detection. I found that a wait of 200ms was too short, so I just went for 1s to be safe from blocking.

Now we have a JSON with a lot of data for each name. I found that the storing of data is very inefficient as a lot of general HTML is included in the JSON files, so the data for these 2358 names is almost 18MB. Weird.

So far so good. But the data we have is not complete yet. It says a lot about how the different names compare to the average of the population but we have no information of the sample size for each of the names. We will have to look somewhere else to find this, so now we will use the "names" file we created earlier.

We go to Statistics Denmark, who has an online query service called "How many danes have the name... "

We enter a name into the query while watching the developer tools to discover that Statistics Denmark are also using a simple GET request to show us the number of people having a certain name. We again write a simple script to mine all the data, this time the input file is "names" and not "ids":

#!/bin/bash
FILE=names

exec 0<$FILE
while read line
do
    curl 'http://dst.dk/da/Statistik/emner/navne/HvorMange.aspx?ajax=1' -H 'Host: dst.dk' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:37.0) Gecko/20100101 Firefox/37.0' -H 'Accept: text/html, */*; q=0.01' -H 'Accept-Language: en-GB,en;q=0.5' --compressed -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: http://dst.dk/da/Statistik/emner/navne/HvorMange.aspx' -H 'Cookie: ASP.NET_SessionId=4tvm22lszncxsau02qxebxls; website#lang=da; cookiebannershown=true; DstDkStrIdentifier=Value=1894130' -H 'Connection: keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' --data 'firstName='$line'&lastName=' > $line.html
done

Save the file, make it executable and run it through the terminal:

chmod +x nameMiner.sh
./nameMiner.sh

This time sleep is not necessary as Statistics Denmark apparently doesn't have any DDOS detection. The output is saved as HTML files as the output from this GET request apparently returns a HTML table that will have to be cleaned later to extract the numbers of interest.

Now we have a JSON file with all the data for each name as well as a HTML file with population numbers for each name for both 2013 and 2014. What is left now is to clean and reformat the data and probably do some principal component analysis. When the statistical analysis is completed the results as well as the complete dataset will be published at The Winnower.