I had to quickly develop a model that will be able to categorize articles across a wide range of financial topics. Developing the corpus manually was not an option and writing custom crawlers for specific news sites would be a tedious process.
Having had to create a different corpus previously and lacking such a financial corpus, I created News Corpus Builder to allow myself and others to be able to generate various corpora about any particular topic/s.
Google News is used as the source to obtain articles. Despite the limitation of only 100 articles per search term by Google, you are able to build a large corpus by using multiple or similar words to retrieve articles per topic. Alternatively you could just run it daily to increase the size of your corpus.
The finance task required a corpus with the following topics(topic related search terms are in brackets):
*Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)
*International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)
*Economy (GDP, Jobs, unemployment, housing, economy)
*Raising Capital(ipo, equity)
*Mergers & Acquisitions (merger,acquisitions)
*Oil(oil,oil prices,natural gas price)
*Commodities (commodities,gold ,silver)
*Fraud(insider trading, ponzi scheme, finance fraud)
*Litigation (company litigation, company settlement,)
Number of articles obtained per topic in the table below:
|Topic||Number of Articles|