Analysing News Article Content with Google Cloud Natural Language API
Demo on How to Use Google Cloud Natural Language and AI Platform Notebooks
This is a slightly modified version of an article originally posted on Nordcloud Engineering blog.
In my previous blog post I showed how to use AI Platform Training to fine-tune a custom NLP model using PyTorch and the transformers
library. In this post we take advantage of Google's pre-trained AI models for NLP and use Cloud Natural Language API to analyse text. Google's pre-trained machine learning APIs are great for building working AI prototypes and proof of concepts in matter of hours.
Google's Cloud Natural Language API allows you to do named entity recognition, sentiment analysis, content classification and syntax analysis using a simple REST API. The API supports Python, Go, Java, Node.js, Ruby, PHP and C#. In this post we'll be using the Python API.
Before we jump in, let's define our use case. To highlight the simplicity and power of the API I'm going to use it to analyse the content of news articles. In particular I want to find out if the latest articles published in The Guardian's world news section contain mentions of famous people and if those mentions have a positive or a negative sentiment. I also want to find out the overall sentiment of the news articles. To do this, we will go through a number of steps.
- We will use the Guardian's RSS feed to extract links to the latest news articles in the world news section.
- We will download the HTML content of the articles published in the past 24 hours and extract the article text in plain text.
- We will analyse the overall sentiment of the text using Cloud Natural Language.
- We will extract named entities from the text using Cloud Natural Language.
- We will go through all named entities of type
PERSON
and see if they have a Wikipedia entry (for the purposes of this post, this will be our measure of the person being "famous"). - Once we've identified all the mentions of "famous people", we analyse the sentiment of the sentences mentioning them.
- Finally, we will print the names, Wikipedia links and the sentiments of the mentions of all the "famous people" in each article, together with the article title, url and the overall sentiment of the article.
We will do all this using Google Cloud AI Platform Notebooks.
To launch a new notebook make sure you are logged in to Google Cloud Console and have an active project selected. Navigate to AI Platform Notebooks and select New Instance. For this demo you don't need a very powerful notebook instance, so we will make some changes to the defaults to save cost. First, select Python 3 (without CUDA) from the list and give a name for your notebook. Next, click the edit icon next to Instance properties. From Instance properties select n1-standard-1 as the Machine type. You'll see that the estimated cost of running this instance is only $0.041 per hour.
Once you have created the instance and it is running, click the Open JupyterLab link of your notebook instance. Once you're in JupyterLab, select new Python 3 notebook.
Steps 1-2: Extract the Latest News Articles
We start start by downloading some required Python libraries. The following command uses pip to install lxml, Beautiful Soup and Feedparser. We use lxml and Beautiful Soup for processing and parsing HTML the content. Feedparser will be used to parse the RSS feed to identify the latest news articles and to get the links to the full text of those articles.
!pip install lxml bs4 feedparser
Once we have installed the required libraries we need to import them together with the other libraries we need for extracting the news article content.
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
import re
import feedparser
import time
from tqdm import tqdm
Next, we will define the url to the RSS feed as well as the time period we want to limit our search to.
feed = "https://www.theguardian.com/world/rss"
days = 1
We will then define two functions we will use to extract the main article text from the HTML document. The text_from_html
function will parse the HTML file, extract the text from that file and use the tag_visible
function to filter out all but the main article text.
def tag_visible(element):
if element.parent.name in ['p']:
return True
if isinstance(element, Comment):
return False
return False
def text_from_html(html):
soup = BeautifulSoup(html.content, 'lxml')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
Once we have defined these functions we will parse the RSS feed, identify the articles published in the past 24 hours and extract the required attributes for those articles. We will need the article title, link, publishing time and, using the functions defined above, the plain text version of the article text.
newsfeed = feedparser.parse(feed)
articles = []
# Get all the entries from within the last day
entries = [entry for entry in newsfeed.entries if time.time() - time.mktime(entry.published_parsed) < (86400*days)]
for entry in tqdm(entries, total=len(entries)):
html = requests.get(entry.link)
src_text = text_from_html(html)
article = dict()
article["title"] = entry.title
article["link"] = entry.link
article["src_text"] = src_text
article["published"] = entry.published_parsed
articles.append(article)
from google.cloud import language_v1
from google.cloud.language_v1 import enums
Next, we define the main function for the demo. Below, in 21 lines of code, we will do all the needed text analysis as well as print the results to view the output. The function takes document
as the input, analyses the contents and prints the results. We will look at the contents of the document
input later.
To use the API we need to initialise the LanguegeServiceClient
. We then define the encoding type which we need to pass together with the document to the API.
The first API call analyze_entities(document, encoding_type=encoding_type)
takes the input document and the encoding type and returns a response of the following form:
{
"entities": [
{
object(Entity)
}
],
"language": string
}
We will then call the API to analyse the sentiment of the document as well as to get the sentiments of each sentence in the document. The response has the following form:
{
"documentSentiment": {
object(Sentiment)
},
"language": string,
"sentences": [
{
object(Sentence)
}
]
}
The overall document sentiment is stored in annotations.document_sentiment.score
. We assign the document an overall sentiment POSITIVE if the score is above 0, NEGATIVE if it is less than 0 and NEUTRAL if it is 0.
We then go through all the entities identified by the API and create a list of those entities that have the type PERSON. Once we have this list, we loop through it and check which ones from the list have wikipedia_url
in their metadata_name
. As said, we use this as our measure of the person being "famous". When we identify a "famous person" we print the person's name and the link to the Wikipedia entry.
We then check the sentiment annotated sentences for occurrence of the identified "famous people" and use the same values as above to determine the sentiment category of those sentences. Finally, we print all the sentiments of all the sentences mentioning the person.
def print_sentiments(document):
client = language_v1.LanguageServiceClient()
encoding_type = enums.EncodingType.UTF8
# Get entities from the document
response = client.analyze_entities(document, encoding_type=encoding_type)
# Get sentiment annontations from the document
annotations = client.analyze_sentiment(document, encoding_type=encoding_type)
# Get overall document sentiment score
overall_sentiment = 'POSITIVE' if annotations.document_sentiment.score > 0 else 'NEGATIVE' \
if annotations.document_sentiment.score < 0 else 'NEUTRAL'
print(f"Overall sentiment: {overall_sentiment}")
# Construct a list of entities where the entity type is a PERSON
entities = [entity for entity in response.entities if enums.Entity.Type(entity.type).name == 'PERSON']
# Loop through persons
for entity in entities:
# Check if the entity has a metadata entry containing a wikipedia link
for metadata_name, metadata_value in entity.metadata.items():
if metadata_name == 'wikipedia_url':
name = entity.name
wiki_url = metadata_value
print(f"\nPerson: {name}")
print(f"- Wikipedia: {wiki_url}")
# Get all sentences mentioning the person
sentences = [sentence for sentence in annotations.sentences if name in sentence.text.content]
# Display whether the sentences mentioning the person are negative, positive or neutral
for index, sentence in enumerate(sentences):
sentence_sentiment = 'POSITIVE' if sentence.sentiment.score > 0 else 'NEGATIVE' \
if sentence.sentiment.score < 0 else 'NEUTRAL'
print(f"- Sentence: {index + 1} mentioning {name} is: {sentence_sentiment}")
Now that we have extracted the text from the news site and defined the function to analyse the contents of each article, all we need to do is go through the articles and call the function. The input for the function is a dictionary containing the plain text contents of the article, the type of the document (which in our case if PLAIN_TEXT
) and the language of the document (which for us is English). We also print the name of each article and the link to the article.
For demo purposes we limit our analysis to the first 3 articles.
language = "en"
type_ = enums.Document.Type.PLAIN_TEXT
# Analyse the latest 5 articles
for article in articles[:3]:
print('\n' + '#'*50 + '\n')
print(article["title"])
print(article["link"])
document = {"content": article["src_text"], "type": type_, "language": language}
print_sentiments(document)
print('\n' + '#'*50)
##################################################
‘We have to win’: Myanmar protesters persevere as forces ramp up violence
https://www.theguardian.com/world/2021/feb/28/we-have-to-win-myanmar-protesters-persevere-as-forces-ramp-up-violence
Overall sentiment: NEGATIVE
Person: Min Aung Hlaing
- Wikipedia: https://en.wikipedia.org/wiki/Min_Aung_Hlaing
- Sentence: 1 mentioning Min Aung Hlaing is: NEUTRAL
Person: Aung San Suu Kyi
- Wikipedia: https://en.wikipedia.org/wiki/Aung_San_Suu_Kyi
- Sentence: 1 mentioning Aung San Suu Kyi is: POSITIVE
##################################################
White House defends move not to sanction Saudi crown prince
https://www.theguardian.com/world/2021/feb/28/white-house-defends-not-sanction-saudi-crown-prince-khashoggi-killing
Overall sentiment: NEGATIVE
Person: Joe Biden
- Wikipedia: https://en.wikipedia.org/wiki/Joe_Biden
- Sentence: 1 mentioning Joe Biden is: NEGATIVE
Person: Mark Warner
- Wikipedia: https://en.wikipedia.org/wiki/Mark_Warner
- Sentence: 1 mentioning Mark Warner is: NEGATIVE
Person: Khashoggi
- Wikipedia: https://en.wikipedia.org/wiki/Jamal_Khashoggi
- Sentence: 1 mentioning Khashoggi is: NEGATIVE
- Sentence: 2 mentioning Khashoggi is: NEGATIVE
- Sentence: 3 mentioning Khashoggi is: NEGATIVE
Person: Jen Psaki
- Wikipedia: https://en.wikipedia.org/wiki/Jen_Psaki
- Sentence: 1 mentioning Jen Psaki is: NEGATIVE
Person: Democrats
- Wikipedia: https://en.wikipedia.org/wiki/Democratic_Party_(United_States)
- Sentence: 1 mentioning Democrats is: NEGATIVE
Person: Gregory Meeks
- Wikipedia: https://en.wikipedia.org/wiki/Gregory_Meeks
- Sentence: 1 mentioning Gregory Meeks is: POSITIVE
Person: Prince Mohammed
- Wikipedia: https://en.wikipedia.org/wiki/Mohammed_bin_Salman
- Sentence: 1 mentioning Prince Mohammed is: NEGATIVE
##################################################
Coronavirus live news: South Africa lowers alert level; Jordan ministers sacked for breaches
https://www.theguardian.com/world/live/2021/feb/28/coronavirus-live-news-us-approves-johnson-johnson-vaccine-auckland-starts-second-lockdown-in-a-month
Overall sentiment: NEGATIVE
Person: Germany
- Wikipedia: https://en.wikipedia.org/wiki/Germany
- Sentence: 1 mentioning Germany is: NEGATIVE
- Sentence: 2 mentioning Germany is: NEUTRAL
Person: Nick Thomas-Symonds
- Wikipedia: https://en.wikipedia.org/wiki/Nick_Thomas-Symonds
- Sentence: 1 mentioning Nick Thomas-Symonds is: NEGATIVE
Person: Cyril Ramaphosa
- Wikipedia: https://en.wikipedia.org/wiki/Cyril_Ramaphosa
- Sentence: 1 mentioning Cyril Ramaphosa is: NEGATIVE
Person: Raymond Johansen
- Wikipedia: https://en.wikipedia.org/wiki/Raymond_Johansen
- Sentence: 1 mentioning Raymond Johansen is: NEGATIVE
Person: Archie Bland
- Wikipedia: https://en.wikipedia.org/wiki/Archie_Bland
- Sentence: 1 mentioning Archie Bland is: NEUTRAL
##################################################
As you can see the 3 articles we analysed all have an overall negative sentiment. We also found quite a few mentions of people with Wikipedia entries as well as the sentiments of those sentences.
Conclusion
As we saw, the Cloud Natural Language API is super simple and powerful tool that allows us to analyse text with just a few lines of code. This is great when you are working on a new use case and you need to quickly test the feasibility of an AI-based solution. It is also the go-to resource when you don't have data to train your own machine learning model for the task. However, if you need to create a more custom model for your use case, I recommend using AutoML Natural Language or training your own model using AI Platform Training.
Hope you enjoyed this demo. Feel free to contact me if you have any questions.
- Twitter: @AarneTalman
- Website: basement.ai