tberg.dk

Word count of blog posts

Edit: This post is about a theme I made for Nibbleblog, which I am not using anymore.

I have encountered a number of blogs, which at the top, display the number of words in the post, or an estimate of the time it takes to read the post. I thought this would be a nice little exercise to implement on this blog. First, I turned to Stack Overflow for an easy way of counting words with JavaScript:

function countWords(str) {
  var matches = str.match(/[\w\d]+/gi);
  return matches ? matches.length : 0;
}

Given a string, this function matches all words (consisting of alphanumerics and underscores) and puts them in an array. If the length of this array is greater than zero, the length (number of words) is returned. Otherwise a zero is returned. This match might result in a slight overestimation as it matches single letters and numbers. On this blog the main content is places inside an <article> tag. The function above should therefore be run with the following as input to only count the words of the post, excluding headers and footers:

str = document.getElementsByTagName("article")[0].textContent;

Next, I want to make an estimate of the time it takes to read the post. According to Wikipedia, the average reading speed is:

The average adult reads prose text at 250 to 300 words per minute. While proofreading materials, people are able to read at 200 wpm on paper, and 180 wpm on a monitor.

I think that people tend to read a blog post somewhat faster than when proofreading a text. I therefore decided to divide the number of words by 200 to estimate the average reading time.

Now I have a decimal number and can write something like "Read in x.y minutes". However, I don't think people care much for the precision of decimals when deciding to read a blog post or not. I therefore decided to round up the estimated reading time to nearest minute, so that I instead can write: "Read in less than x minutes". In my opinion this is a more welcoming formulation, and it makes the time seem shorter. People of today are busy, so I think it's important to tell your readers how little time they have to sacrifice to read through your thoughts. Otherwise people might not even care to scroll through it. The "swipers" of today are just looking for another quick fix.

I briefly thought about counting the number of equations and images in a post and add a predefined number of seconds for each. This is however very difficult to estimate and just adds to the reading time, which might discourage some readers from reading. I therefore chose just to stick with counting regular words.

I also thought about a statistic way of estimating the reading time by measuring the amount of time each visitor uses on the site and only include cases where they have actually scrolled to the bottom. This would require an extra effort by me to implement and I don't think it would result in a more precise prediction than simply counting words, since many people would probably just scroll through without reading everything and others might leave the site open in a tab for hours before eventually reading it.

The final things to add are:

  1. Making sure that the function is only called on actual "/post/" pages
  2. Discriminating between singular (minute) and plural (minutes)
  3. Populating an empty <span> with the id #readTime

This is how the full function currently looks:

$(function () {
    // Only run if we are on a post page
    if (window.location.pathname.indexOf("/post/") != -1) {
        // Get all words in blog post
        str = document.getElementsByTagName("article")[0].textContent;
        // Find words and put them in an array
        matches = str.match(/[\w\d]+/gi);
        // Count words and return result. If no words return 0
        words = matches ? matches.length : 0;
        // Calculate and round reading time.
        readTime = Math.ceil(words / 200)
        // Insert estimated reading time in post header
        if (readTime &lt;= 1) {
            $("#readTime").html("Read in less than " + readTime.toString() + " minute")
        } else {
            $("#readTime").html("Read in less than " + readTime.toString() + " minutes")
        }
    }
});

Find this and future versions at GitHub.

Easy I/O operations with Python dictionaries

Yes, this is sort of a hack. But it results in a couple of neat functions compared to other solutions using pickle and csv.

For a long time I have wanted to find an easy solution to save and load dictionaries in Python. Recently it occurred to me as a result of the dynamic typing in Python mixed with me being tired and forgetting the type of the data I was handling. Of course, this idea is nothing new to the world, but it is to me.

Here goes. First we need to import NumPy. Then we define a dictionary:

In [1]: import numpy as np

In [2]: d = {'a': 1, 'b':2, 'c':3}

The trick is that a dictionary can be turned into an array with no dimensions:

In [3]: dd = np.array(d)

In [4]: dd
Out[4]: array({'a': 1, 'c': 3, 'b': 2}, dtype=object)<br /><br />In [5]: dd.shape<br />Out[5]: ()

This array can now be saved to disk using numpy.save (or numpy.savez for multiple arrays) and loaded using numpy.load. There are two semantically different ways of turning the zero-dimension array back to a dictionary:

In [6]: dd.item()
Out[6]: {'a': 1, 'b': 2, 'c': 3}

In [7]: dd[()]
Out[7]: {'a': 1, 'b': 2, 'c': 3}

This also works for dictionaries/arrays inside dictionaries.

Functions!

The trick above can be turned into two neat functions for easy saving and loading of Python dictionaries:

import numpy as np

def savedict(dictname, dic):
    np.save(dictname, dic)

def loaddict(dictname):
    return np.load(dictname + '.npy').item()

Now simply save your dictionary to a file with:

savedict('woo', d)

And load in with:

d = loaddict('woo')

That's it.

Correlations in evolution of GDP

Some time ago I read a chapter called Statistical Physics Models for Group Decision Making in a book on Econophysics. This post is not a review, but a presentation of an idea sparked by that book. In this chapter they have the following note about GDP:

A large number of political, economic, social, and administrative decisions are embodied in the economic growth, currently measured by GDP and GDP per capita. Starting from the fluctuations of these indicators, one can establish correlations among the countries that adopt similar lines of development, in a given time interval.

They then move on to construct a fully connected graph, in which the links are assigned a weight based on correlations in evolution of GDP. When all the links have been assigned a weight, they then impose a threshold, such that links with a weight smaller than this threshold are deleted. This is one way of doing a cluster analysis on similarities of economic development. Another approach could be plotting the components of a principal component analysis, which is also explored in the book, and has also been done here.

I wish to repeat the process of creating a network, with weighted links and a variable threshold, as I am not impressed by the visualizations in the book. The problem with old school printing is that it is a very static medium compared to e.g. this blog.

The first thing to do is to obtain data. I found a table of "Real GDP growth - Annual growth in percentage" for the OECD countries in the so called OECD FACTBOOK 2006. According to this document: "Real growth rates are obtained by converting GDP to constant prices and calculating the change from year to year." So far so good. Next I calculate the correlations of these time series, which are shown in the figure below.

undefined

This figure looks fancy in itself, but it is difficult to get an overview and tell if there is any clustering. Also, it is static. The next step is to turn these results into a JavaScript object, so it can be visualized in a browser.

Below is a screenshot of the interactive visualization I have made. Click the image or this link to go there. Now it's possible to select all, or just a single country, and determine the level of the correlation threshold for links to be shown. In this visualization the threshold (left handle) defines the middle of an interval whose width can be tuned using the handle to the right.

Screenshot of visualization

In the book they mention the following four clusters: Scandinavian, Continental, Anglo-Saxon, and Mediterranean. See if you can find them by tuning the correlation threshold and the width of the interval.

The visualization was made using the D3.js library. The source code is available at GitHub.