Build Your First News Data Pipeline With Python & Newscatcher
Newscatcher python package allows you to automatically collect the latest news data from over 3,000 major online news websites.
NEWSCATCHER
Newscatcher python package allows you to automatically collect the latest news data from over 3,000 major online news websites.
Photo by Quinten de Graaf on Unsplash
As I am writing this article, many people have to work from home, some have a lot of free time during this period. You can use this time to build your portfolio, enhance your skills or begin a side-project.
Newscatcher package makes it easy to collect and normalize news articles data without any external dependencies. It was built while we were working on our main Data-as-a-Service product called Newscatcher API. We are the developers-first team, therefore, we open-source as much as possible so that coders can partially replicate the job we have done for free.
The way to use our package is simple, you pass website URL as an input and get the latest articles as an output. For each article, you have a title, full URL, short description, published date, author(s) and some more variables.
Quick Start
You need to have Python 3.7+
Install the package via PyPI:
pip install newscatcher
Import Newscatcher class from the package:
from newscatcher import Newscatcher
For example, we want to see what are the last articles from nytimes.com
You have to pass the base form of the URL — no www.
, neither https://
, nor /
at the end of the URL.
news_source = Newscatcher(‘nytimes.com’)
last_news_list = news_source.news
news_source.news is a list of feedparser.FeedParserDict. Feedparser is a Python package that normalizes the RSS feed. In my other Medium post I explain more in detail what RSS is and how to work with it using feedparser:
Collecting news articles through RSS/Atom feeds using Python
Or how to stop being dependent on data providerstowardsdatascience.com
One important thing to know is that each RSS/Atom feed may have its own set of attributes. Even though feedparser does a great job structuring it, the article attributes may vary from one news publisher to another.
For the nytimes.com, the list of article attributes is the following:
article = last_news_list[0]article.keys()
dict_keys(['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'summary', 'summary_detail', 'published', 'published_parsed', 'tags', 'media_content', 'media_credit', 'credit'])
In the code above, we take the first article in a list called last_news_list and check it’s keys (each article is a dictionary).
When we go through the main attributes:
print(article.title)
Coronavirus Live Updates: New York City Braces for a Deluge of Patients; Costs of Containment Grow
print(article.summary)
Soldiers in Spain found nursing home patients abandoned. Officials warned that New York was experiencing a virus “attack rate” of five times that elsewhere in the United States. Washington edged closer to a $2 trillion relief package.
print(article.link)
https://www.nytimes.com/2020/03/24/world/coronavirus-updates-maps.html
print(article.published)
Tue, 24 Mar 2020 10:35:54 +0000
Most likely, you will find all of the attributes above in the article's data from all news publishers.
For now, the biggest news publishers are available in Newscatcher. Therefore, you can set up an end-to-end data project. There are many different things that you might do in addition, for example:
pipeline to collect and save news articles to some kind of database (with deduplication)
+ add an NLP layer (named entity recognition, for example)
+ keep track of some particular topic
+ scrape the full text of an article
+ visualize the aggregated data
I would recommend the following packages/technologies to use while working with newscatcher package:
Elasticsearch to store and query data
Newscaper3k python package to automatically scrape the full text of an article
Spacy for NLP
Quick tip for those of you who want to enhance their portfolio. Being able to show a recruiter that you are capable of doing an end-to-end project will make your resume stand out of a crowd. Nevertheless, the business value of your work is important as well. Therefore, try to make something that exists to serve any real case business value. Good luck.
Artem - Medium
Read writing from Artem on Medium. Quite opinionated data guy. Co-founder of newscatcherapi.com and politwire.com…medium.com
If you want to support our team, you can help us by signing up for the closed beta of Newscatcher API. Our API allows you to search the most relevant news that got published in the past.