This is like a data collection step for a much bigger problem. You might want to do some form of analysis of the stories being posted on HackerNews or see which user has received the most upvotes in the past year. To do anything like that, we need the data first. Since I like Python, I tried to figure out how to do it and am sharing it here.
To get started we will need to install the Firebase Python package. It is easy to do via pip
. Just follow the steps here.
Once Firebase is installed, we can go ahead and write the code to do our work.
Do the imports:
We will see below why we need to import the exceptions.
Now define our base HackerNews URL. This is the URL using which HackerNews provides its stories through Firebase.
Now we will write our scraper class. We will need to initialize a FirebaseApplication with the URL we defined above.
At this point we need to decide what range of data do we want to get from HackerNews. Let's assume I want the data from the year 2016 - i.e. from 1st Jan 2016 00:00:00 GMT to 1st Jan 2017 00:00:00 GMT. I'll go here and convert these dates into UNIX time. Once I have the times, I need to find out a starting story index. The stories in HackerNews are numbered starting from 1 in increments of 1 for each story. Every post, comment, etc. on HackerNews is a story. I could start from 1 and filter out only the ones I want - but that will take eternity to reach 2016's stories. Instead, with some trial and error you could figure out a reasonable starting index by filling in the last field in https://hacker-news.firebaseio.com/v0/item/[story-id] and finding a story with time pretty close to our starting time.
Once we have done that, we can define some variables as below. The values are just for illustration and do not really mean anything.
Now for the main functions that do the fetching work. The first one is fetch_stories
, which given a start story index, start time and end time, will fetch all the stories that fall within the specified time period. It will also keep saving the results every 100 stories into the filename we provide in the last argument - a safety measure in case your script has to stop for some reason.
The fetch_stories
function calls a get_item
function, which actually gets a particular story from Firebase.
The exception handling in the above code is pretty useful. It happens a lot of time that while requesting a story an HTTPError
or a ConnectionError
occurs. If we don't handle the exception here, our script will die and we'll need to restart.
By handling the exceptions in a while
loop, our script never really dies. It will just keep trying until it gets through to the story and then will move forward.
This handling turned out to be really useful in writing a robust data collector.
That is the entire code and our data collector is now ready! It still takes a lot of time to gather data for just 1 year. But once we have it, we can have all sorts of fun with it!