"Reverse Engineering" Google News RSS Feed. Part I.
Or how I was writing the missing Google News RSS documentation
TL;DR You can get a pretty narrowed Google News RSS feed of aggregated news: search by keyword, geo position, time range, topic, etc. You just need to know the syntax. So, I’ve been figuring out Google’s syntax for the past few months Plus, I published an API to access normalized data, as well as a Python library that handles all of the scrapings.
Demo of the Python package:
Why do I care about Google News RSS?
Short answer — web scraping.
Don’t know what is RSS? Here is a great article by Digital Trends.
Google News UI VS Google News RSS
If we go to PageSpeed Insight (by Google), and compare the score and the page weight we will see that:
Google News browser version has only 68 overall score for Desktop and its page weight is about 1,000 KB
Google News RSS page has 100 overall score for Desktop and its page weight is about 30 KB. Because it is just XML, but it still contains data that I need.
Google News’ RSS is not that good RSS if you compare it to the usual ones — NYTimes, for example. Google does not include a thumbnail picture, nor a short description of the article.
However, Google News RSS has a few features that no other RSS have:
It is an RSS of an aggregator. You can get the feed my topics, countries, languages. Best news powered by the best search engine.
You can search. The same way you search Google, but it will return you only the news, plus, in RSS.
In some cases (top headlines feed, for example) each article’s description is the list of related articles.
You can search by date!
Scraping a website VS scraping the RSS feed
Have you ever tried to scrape Google search? If not, here is a short but concise description of what happens: you get blocked fast.
There exist a separate type of API services called SERP — Search Engine Results Page. Long story short, they scrape the data that you could get from the Google search panel. I have checked a few, they all work but all have a latency of about 7 seconds. That is a lot.
I assume that it is due to 2 main factors of SERPs being slow:
Google search page is long to load. Plus, you have to use a headless browser to load all the JS code
SERP API providers have to manage an army of proxies
4 types of Google News RSS
Top headlines: https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en
Geolocation headlines: https://news.google.com/rss/headlines/section/geo/NY?hl=en-US&gl=US&ceid=US:en
Top headlines by search (latest articles that mention
AAPL in the title AND published over the past hour): https://news.google.com/rss/search?q=intitle:AAPL+when:1h&hl=en-US&gl=US&ceid=US:en
I’ll be covering each RSS type in details in my next Parts of this series. So, make sure to be subscribed. Most likely, those long-reads will be available in full only for my paid subscribers. Good news, I still have less than 100 paid subscribers (really less, you know), so you still can subscribe for 11$ for a year! That’s 80% discount:
P.S. What is the legal side of scraping and using data from Google News?
That is a very common question I am asked about. I have been reading about it for quite a while. I have spoken to some lawyers. I went through many similar services to check their terms of service.
The short answer (I AM STILL NOT YOUR LAWYER!) — it is not illegal (unless you use it for criminal things), thousands of companies do similar things on a daily basis. I will write a separate long read where I will write down all of the facts that I could discover on my own. I will also try to get a comment from a few people from the Data-as-a-Service industry, as well as lawyers.
Make sure to be subscribed to my newsletter. It is free. Plus, I still have my 80% discount while I am getting my first 100 paid subscribers. It’s just 11$ instead of 55$.
But like I mentioned, only for my first 100 paid subscribers.