Article Ranking with Natural Language Inference
Demo on how to use an NLI model to rank news articles from a news feed
In my previous post I showed how to fine-tune a pre-trained transformers model for the natural language inference (NLI) classification task. In this post I'm taking a fine-tuned NLI model and use it to classify and rank articles in a news feed.
The idea is simple: we use an NLI model that has been trained on the MultiNLI task and pass an excerpt of the source text together with a search term we are interested in to the model. The model will then check if the source text entails the search term and returns a score. We can use these scores to rank the articles. We could use the model we trained on the previous demo, but luckily the people at Huggingface have made our lives much easier by releasing a new pipeline for zero-shot classification which uses a pre-trained NLI model.
So let's get started! First we need to install the required libraries and import them. In addition to the PyTorch and transformers libraries we are also installing and importing some libraries we need for retrieving and processing the news feeds and the articles.
!pip install transformers torch lxml bs4 feedparser
from transformers import pipeline, logging
import torch
import sys
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
import re
import feedparser
import time
from operator import itemgetter
from IPython.display import Markdown, display
from tqdm import tqdm
logging.set_verbosity_error()
Next we will define some functions we need when we process html content retrieved from the websites. The text_from_html
function uses BeautifulSoup
to filter out unwanted content like html tags, comments, etc. We also define a function we will use to print out the results in markdown format.
def tag_visible(element):
if element.parent.name in ['p']:
return True
if isinstance(element, Comment):
return False
return False
def text_from_html(html):
soup = BeautifulSoup(html.content, 'lxml')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
def printmd(string):
display(Markdown(string))
We need to define our classifier function that will take the source text and the search term and provide the result. The model we are using has a 1024 limit for the input text. Note that this will significantly impact the results as we might be cutting out some important information. However, for our demonstration purposes we will not care about this. You could of course split the text into smaller junks and perform classification of those junks separately and then combine the results at the end.
def classifier(source_text, search_term):
src_text = source_text[:1024]
classification = pipeline("zero-shot-classification", device=0)
results = classification(src_text, search_term)
return results
Now that we have defined our classifier function we have to define the news feed we want to retrieve the articles from. For this demo I want to understand what news releases have come out from Amazon Web Services (AWS) about machine learning in the past 7 days. So I'm using AWS blog as the source and "machine learning" as the search term / classification label. We also define the number of articles we want to display. Let's say we want to see the top 4 articles about machine learning.
feed = "https://aws.amazon.com/blogs/aws/feed/"
search_term = "machine learning"
days = 7
number_of_articles = 4
Next, we retrieve the newsfeed using the feedparser
library. We then retrieve the html source for all the articles from the feed that have been published in the last 7 days. We use the text_from_html
function to extract the text from the html source and call the classifier function. Finally, we save the classification score and other relevant information for each article.
newsfeed = feedparser.parse(feed)
articles = []
entries = [entry for entry in newsfeed.entries if time.time() - time.mktime(entry.published_parsed) < (86400*days)]
for entry in tqdm(entries, total=len(entries)):
html = requests.get(entry.link)
src_text = text_from_html(html)
# This is where we call our classifier function using the source text and the search term
classification = classifier(src_text, search_term)
article = dict()
article["title"] = entry.title
article["link"] = entry.link
article["src_text"] = src_text
article["published"] = entry.published_parsed
article["relevancy"] = classification["scores"][0]
articles.append(article)
Now that we have a list of classified articles we can sort them using the classification scores.
sorted_articles = sorted(articles, key=itemgetter("relevancy"), reverse=True)
Before we display the results, I'm defining another useful function that utilises the transformers summarization pipeline. We will use this function to create a short summary of each article on our list.
def summarise(source_text):
src_text = source_text[:1024]
summarization = pipeline("summarization")
summary_text = summarization(src_text, min_length = 100)[0]['summary_text']
summary_text = re.sub(r'\s([?.!",](?:\s|$))', r'\1', summary_text)
return summary_text
Finally, we can summarise the texts for our top 4 articles and print the results in a sorted order based on their ranking.
print('*'*20 + ' Start of output ' + '*'*20)
for article in sorted_articles[:number_of_articles]:
summary = summarise(article["src_text"])
printmd("**{}**<br>{}<br>{}<br>**Search term:** {} | **Score:** {:6.3f}<br><br>".format(article["title"],
article["link"],
summary,
search_term,
100*article["relevancy"]))
print('*'*20 + ' End of output ' + '*'*20)
There we have it: a working article ranker using a pre-trained NLI model. Super easy and fun! There are literally hundreds of use cases where these models and pipelines can be used to create useful applications.
Hope you enjoyed this demo. Feel free to contact me if you have any questions.
- Twitter: @AarneTalman
- Website: basement.ai