Using state-of-the-art summarization to cut through tedious documents
Boring reading is for machines
I am Guillaume Barrois, CTO at Explain. Our mission is to transform how companies do business with governments. To do so, we develop software to help them find information in a huge quantity of documents produced by local governments.
Imagine you are a business developer with a windmill farm company. Your job is to find towns and villages willing to have windmills built on their territory. One key source to the information you need are the official documents published by the different local public institutions. But there are hundreds or thousands of small municipalities, each publishing documents. Every morning there is the equivalent of 1000 pages of documents that you should read to be sure not to miss a critical piece of information, e.g., regarding a competitor planning to implement a new project or a municipal council taking a stand against windmill farms.
To make things worse, these official documents can be long, and they are often written in a technical and lengthy manner and difficult to read. See for instance the documents below:
Scanning through the whole text to identify relevant excerpts, reading these excerpts carefully, and remembering their main message can all be time-consuming.
This makes for a long, tedious experience. At eXplain, we believe this type of non-human-friendly reading should be helped by AI. Boring reading is for machines.
Our technology already enables us to scan through all documents and to identify the relevant excerpts. And yet,many excerpts remain long to read.
Therefore, we decided to go one step further, to help our users read and understand each excerpt faster. To do this, we would leverage state-of-the-art summarization techniques.
In theory, abstractive summarization techniques should help us solve two problems at once:
- quickly read long excerpts of interest and identify their main message,
- reformulate them in an easy to read / straight to the point way.
But would abstractive summarization actually work on our specific dataset and use cases ?
The summarization revolution : extractive. vs. abstractive summarization
Until a few years ago, the main way to summarize text automatically was to use extractive summarization1. Extractive summarization selects a subset of sentences from the text to create a summary. These methods are fundamentally limited by the fact that only part of the text itself can be used as its summary.
But the development of large, transformer-based language models with generative capacity such as BERT has changed everything: it is now possible for algorithms to write texts indistinguishable from human-written text. This led to huge improvements in abstractive summarization techniques, in particular using BART architectures2: it is now possible to generate new and better sentences (not included in the original text) as a summary of a given text.
In the last couple of years, libraries such as Hugging Face and availability of models in languages other than English have made it even easier for companies to implement these models on their own data.
Finding the right langage model is not enough
Model used: BARThez
Thanks to the work of Moussa Kamal Eddine et al. at École Polytechnique3, there is a great model corresponding to our use case! BARThez is a BART model trained on a large French corpus. A version of this model has been fine-tuned on a dataset of 23k French newspaper articles summaries (OrangeSum). Luckily for us, this model is open source and is available on HuggingFace, making it very easy to integrate. We decided to use the fine-tuned version, even if our documents (official documents vs. newspapers articles) are of a different nature. More on this below.
Off the shelf results
When you ask BARThez to summarize an excerpt of an administrative document, you can obtain very impressive results such as the summary below.
But in other cases, BARThez summaries are much less impressive:
The quality of the summaries written by BARThez is heterogeneous. As the author, Moussa Kamal Eddine, told me when I interviewed him, this is probably a consequence of the difference of domain (official documents vs. newspaper articles) between our dataset and the OrangeSum dataset used for fine-tuning. So one option for us would have been to fine tune the model on a corpus drawn from our data. However, creating such a dataset from scratch is difficult (writing good summaries is actually quite hard) and time consuming. Therefore it was not an option for a first iteration.
The second option was to integrate BARthez in a pipeline and build additional components around it to improve the quality of the output.
Adding components around the model to improve performance
The documents are first selected from the corpus based on a request made by a user (for instance “éolien dans le Nord-Pas-de-Calais”) using our database of administrative documents indexed in ElasticSearch. Those documents are then sent to a data pipeline that I will describe in more details below.
The excerpts returned by ElasticSearch first need to be pre-processed. As the text is extracted using OCR from documents of various quality, it can be noisy, and therefore needs to be cleaned for artifacts (page numbers, texts from tables or captions…). When the text is too noisy (too short, list of names, charts and data tables, etc.), it is simply discarded.
BARThez summarizer is deployed on AWS Sagemaker, using HuggingFace API. Summarization tasks can then be triggered via a call to a dedicated API. This allows us to easily scale up our capacity when we need to quickly generate summaries of a large number of excerpts.
As we know that the quality of the summaries is heterogeneous, we cannot be fully percent confident that the “best” summary chosen by the model is of sufficient quality. We need a posteriori evaluation to decide whether a summary is good enough. Therefore, we ask the model to generate many options, with sufficient variability. Specifically, we use group_beam_search (to explore different branches of the probability tree during generation) and diversity_penalty (to penalize beams returning the same tokens) to increase diversity, and we keep 12 candidate summaries in output to be evaluated in the post processing phase.
Post processing consists in several ad hoc rules which aim at removing common errors encountered when generating summaries (cleaning), or removing low quality summaries (validation).
Finally, to select the summary, we combine a custom evaluation of its intrinsic quality and a measure of similarity with the initial request. This gives us the final summary, a one- or two-sentence long version of the initial excerpt.
The results obtained using this whole pipeline improve significantly over the results obtained directly from BARThez:
In general, combining BARthez with the pre and post processing steps allows us to generate summaries that are more relevant, contain fewer errors, and are better written.
To quantify the overall quality of the generated summaries, we ran our summarization over 200 documents, covering several topics (the formerly hotly debated issue of Notre-Dame-des-Landes in the west of France, a Center Parcs project in Gironde, etc.). Each summary was categorized as relevant or not relevant. Overall, 80.5% of the summaries were relevant. In the other 19.5% of cases, what made the summary non-relevant was vague text or text unrelated to the topic.
While we are not at 100% relevancy yet, these results are clearly good enough to be put in front of a user. By running our summarization pipeline on all the documents about a topic on an area, we are able to give a quick overview of the events, discussions and other data points gathered from thousands of documents. We save our users precious hours of work and offer a augmented, human-friendly reading experience.
Next steps: two paths to domain-specific fine-tuning
To further improve the quality of the summaries we generate, the obvious next step is to fine tune the model to our domain (official documents in French), and, so, to construct a specific summary dataset. How to do so?
The most straightforward path would be to build a summary dataset from scratch by manually writing summaries for parts of official documents. However this task requires well-trained readers and writers, is long and hard to standardize.
An alternative is to leverage the pipeline that we have already built to design an iterative cycle of fine-tuning. Indeed, we can use our pipeline to generate a first dataset of summaries, and ask humans to review them and decide either to keep them, discard them, or edit away obvious mistakes. Evaluating or lightly editing summaries is much easier for humans than writing summaries from scratch. Moreover, the process can be improved by making it iterative: the fine-tuned model can be used to generate a new dataset, readjusted by human annotators etc., until convergence.
Combined with document retrieval, abstractive summarization techniques have proved extremely valuable to help our users dig into lengthy textual information. Thanks to the open source communities, we have found that state-of-the-art models can be deployed relatively easily. Yet, they still require careful adaptation to specific data and use cases. We began by building additional components on top of an open source model and by selecting among multiple outputs of the model.
We will now continue working on our model to improve the quality of the summaries and deal with new types of documents. Stay tuned !