Automatically Generating A Financial News Corpus Using News Corpus Builder

I had to quickly develop a model that will be able to categorize articles across a wide range of financial  topics. Developing the corpus manually was not an option and writing custom crawlers  for specific  news sites would be a tedious process.

Having had to create a different corpus previously and lacking  such a financial corpus, I created News Corpus Builder  to allow myself and others to be able to generate  various corpora about any particular topic/s.

Google News is used as the source to obtain articles.  Despite the limitation of  only 100 articles per search term by Google, you are able to build a large corpus by using multiple or similar words to retrieve  articles per topic. Alternatively you could just run it daily to increase the size of your corpus.

 

Screen Shot 2015-09-05 at 5.42.54 PM

The finance task required  a corpus with the following topics(topic related search terms are in brackets):

*Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)

*International  Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)

*Economy (GDP, Jobs, unemployment, housing, economy)

*Raising Capital(ipo, equity)

*Real Estate

*Mergers &  Acquisitions (merger,acquisitions)

*Oil(oil,oil prices,natural gas price)

*Commodities (commodities,gold ,silver)

*Fraud(insider trading, ponzi scheme, finance fraud)

*Litigation (company litigation, company settlement,)

*Earning Reports

You can download the Finance Corpus here .  If your not comfortable with the command line you can use Sqlite Browser to explore the corpus.  Please leave comments or suggestions.

Number of articles obtained per topic in the table below:

Topic Number of Articles
Capital 245
Commodities 252
Earning_Reports 262
Economy 257
Fraud 266
International_Finance 255
Litigation 202
Mergers_Acquisitions 223
Oil 249
Policy 262
Real_Estate 76

 

2 Comments Automatically Generating A Financial News Corpus Using News Corpus Builder

Leave a Reply

Your email address will not be published. Required fields are marked *