Automatically Generating A Financial News Corpus Using News Corpus Builder

I had to quickly develop a model that will be able to categorize articles across a wide range of financial  topics. Developing the corpus manually was not an option and writing custom crawlers  for specific  news sites would be a tedious process.

Having had to create a different corpus previously and lacking  such a financial corpus, I created News Corpus Builder  to allow myself and others to be able to generate  various corpora about any particular topic/s.

Google News is used as the source to obtain articles.  Despite the limitation of  only 100 articles per search term by Google, you are able to build a large corpus by using multiple or similar words to retrieve  articles per topic. Alternatively you could just run it daily to increase the size of your corpus.

 

Screen Shot 2015-09-05 at 5.42.54 PM

Continue reading…