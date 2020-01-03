newscatcher

Or how to stop being dependent on data providers

In one of my previous posts I was talking about how you could scrape and analyze news articles with just 5 lines of code:

Scrape and Summarize News Articles in 5 Lines of Python Code

This time I will show you how you could set up a pipe to automatically collect all the new articles that have been published by almost any news provider (such as NY Times, CNN, Bloomberg, etc.)

To achieve such a goal I will show you how you could automate news collection using feedparser Python package that helps to normalize RSS/Atom feeds.

Who is this article for?

For data engineers and data scientists who might want to collect their own data and practice building data pipes.

What is RSS and Atom?

RSS is an XML formatted plain text that provides a brief summary about articles that have been recently published by some content provider (news, podcasts, personal blog, etc.)

The most common producers of RSS are news publishers.

RSS feed exists to provide access to the latest news (for news aggregators and news syndicators, for example).

RSS feed does not contain entire article text (in most cases) but provides some basic information such as author, title, description, publication time, etc.

Atom is another XML format that has been developed as an alternative to the RSS feed. Atom seems to be more advanced comparing to the RSS, but I am not going to compare those 2 formats in this post.

An example of RSS XML:

<?xml version="1.0" encoding="UTF-8" ?><rss version="2.0"><channel> <title>RSS Title</title> <description>This is an example of an RSS feed</description> <link>http://www.example.com/main.html</link> <lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate> <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate> <ttl>1800</ttl>

<item> <title>Example entry</title> <description>Here is some text containing an interesting description.</description> <link>http://www.example.com/blog/post/1</link> <guid isPermaLink="false">7bd204c6-1655-4c27-aeee-53f933c5395f</guid> <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate> </item>

</channel></rss>

Scrape the news via an RSS endpoint

So, the only thing left is to collect all the urls (endpoints) of the news publishers that we are interested about.

For this article, I take the NY Times feed endpoint. To do so, I had to:

go to the https://www.nytimes.com/

“inspect” the source code of the page

search for “rss” term

grab the first result

Source code of the NY Times page

Let’s grab that link and check if it looks like something that we need.

https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml

Alright, as we can see it is “NYT > Top Stories” RSS.

Under the <channel> section you might find the general information about the feed itself — description, when it was built, language, etc.

Each <item> under this RSS represents the article. The first item represents the article that has a title (<title>)called “Trump Bet He Could Isolate Iran and Charm North Korea. It’s Not That Easy.”

If we take the link (<link>) under this <item> we will be forwarded to the original page of an article:

https://www.nytimes.com/2020/01/01/us/politics/trump-iran-north-korea.html?emc=rss&partner=rss

RSS will not give us the full text of an article but it will propose a short <description> instead.

Feedparser

Now when we know what are the RSS and how we could use it, we may try to automate the way we obtain new articles.

The main drawback of RSS/Atom feeds is that they are not normalized. According to the Wikipedia page RSS/Atom have only few mandatory fields (link, title, description).

https://en.wikipedia.org/wiki/RSS

It means that in case you would like to store data from different news publishers you should either take into account all the possible key-value pairs or use some schema-free technology (elasticsearch, for example).

Explore the feedparser package

pip install feedparser

import feedparser

Now our feed is loaded under the feed variable. Under the .feed attribute we might find the main info regarding the feed metadata itself.

feed.feed

Out[171]: {‘title’: ‘NYT > Top Stories’, ‘title_detail’: {‘type’: ‘text/plain’, ‘language’: None, ‘base’: ‘https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', ‘value’: ‘NYT > Top Stories’}, ‘links’: [{‘rel’: ‘alternate’, ‘type’: ‘text/html’, ‘href’: ‘https://www.nytimes.com?emc=rss&partner=rss'}, {‘href’: ‘https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', ‘rel’: ‘self’, ‘type’: ‘application/rss+xml’}], ‘link’: ‘https://www.nytimes.com?emc=rss&partner=rss', ‘subtitle’: ‘’, ‘subtitle_detail’: {‘type’: ‘text/html’, ‘language’: None, ‘base’: ‘https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', ‘value’: ‘’}, ‘language’: ‘en-us’, ‘rights’: ‘Copyright 2020 The New York Times Company’, ‘rights_detail’: {‘type’: ‘text/plain’, ‘language’: None, ‘base’: ‘https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', ‘value’: ‘Copyright 2020 The New York Times Company’}, ‘updated’: ‘Thu, 02 Jan 2020 15:03:52 +0000’, ‘updated_parsed’: time.struct_time(tm_year=2020, tm_mon=1, tm_mday=2, tm_hour=15, tm_min=3, tm_sec=52, tm_wday=3, tm_yday=2, tm_isdst=0), ‘published’: ‘Thu, 02 Jan 2020 15:03:52 +0000’, ‘published_parsed’: time.struct_time(tm_year=2020, tm_mon=1, tm_mday=2, tm_hour=15, tm_min=3, tm_sec=52, tm_wday=3, tm_yday=2, tm_isdst=0), ‘image’: {‘title’: ‘NYT > Top Stories’, ‘title_detail’: {‘type’: ‘text/plain’, ‘language’: None, ‘base’: ‘https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml', ‘value’: ‘NYT > Top Stories’}, ‘href’: ‘https://static01.nyt.com/images/misc/NYT_logo_rss_250x40.png', ‘links’: [{‘rel’: ‘alternate’, ‘type’: ‘text/html’, ‘href’: ‘https://www.nytimes.com?emc=rss&partner=rss'}], ‘link’: ‘https://www.nytimes.com?emc=rss&partner=rss'}}

The most important fields are copyright and published .

Feedparser takes care to assign correct values to those attributes so you do not have to waste time on normalizing them yourself.

Normalized articles

Same as for the feeds, you might find the information about each article under the .entries attribute.

feed.entries[0].title

Out[5]: 'Trump Bet He Could Isolate Iran and Charm North Korea. It’s Not That Easy.'

That way, we will know the basic information about each element in of the feed.

In case you want a full text of the article you have to take the url and use newspaper3k. Check my other article that I have embedded in the beginning of this post.

Further work

Try to think about how you could build a data pipe to collect new articles, deduplicate with those that your database has seen already. Also, an additional NLP pipe on top might make lots of useful insights (spaCy python package is perfect for that).

In my personal blog, I talk about how I build newscatcher — an API to access news data from the most popular news publishers. In case you would like to know how to scale what I have described above into thousands of feeds follow my Medium blog and tune to my twitter.