I wanted to create a dataset of tweets, replies, favorites and so on, but couldn't find a tool to do so. I therefore made my own simple Twitter scraper in Python usingĀ Beautifulsoup for parsing HTML and Selenium as the webdriver.

This post is a short description of the scraper named tScrape. The source and detailed description is posted on GitHub.

The minimal setup only requires a list of Twitter handles and a date for the scraper to go back in time. The scraper saves the entire source code, which is then processed by a parser. The parser has two output files for each given handle, which is

  1. handle-stats.json: a file containing information about the page.
  2. handle-tweets.json: a file containing all tweets from the page all the way to the defined date.

From the json files the page information and tweets can be imported as Python dictionaries. Now everything is available in an easy-to-use format for all kinds of statistical and semantic analyses.

I hope that this can be useful for others. Feel free to suggest further functionality on theĀ GitHub page. This project is work in progress, so follow the git repository for updates.