I wanted to create a dataset of tweets, replies, favorites and so on, but couldn't find a tool to do so. I therefore made my own simple Twitter scraper in Python using Beautifulsoup for parsing HTML and Selenium as the webdriver.
The minimal setup only requires a list of Twitter handles and a date for the scraper to go back in time. The scraper saves the entire source code, which is then processed by a parser. The parser has two output files for each given handle, which is
- handle-stats.json: a file containing information about the page.
- handle-tweets.json: a file containing all tweets from the page all the way to the defined date.
From the json files the page information and tweets can be imported as Python dictionaries. Now everything is available in an easy-to-use format for all kinds of statistical and semantic analyses.
I hope that this can be useful for others. Feel free to suggest further functionality on the GitHub page. This project is work in progress, so follow the git repository for updates.