Text Similarity with (and without) AI

Did you mean…?

There are several reasons you would want to measure the similarity of 2 (or more) pieces of text. Your system might need to recommend articles about similar topics, or maybe group together pieces of text that are very similar to each other. Spelling correction systems are another good example, which need to suggest words with high similarity to the given word. Text search algorithms need to understand how relevant search results are for your query.
Lately, the technology of Natural Language Processing (NLP) has been advancing at breakneck speed, and innovations pop up that allow us to solve* problems that have been hard to solve for a very long time (*or at least solve those problems a lot better).

Edit Distance

Let’s start with the example of spelling correction — for instance, Google’s ‘did you mean…’ feature. This problem has commonly been solved with edit distance algorithms, although recently Google has introduced “a new spelling algorithm that uses a deep neural net to significantly improve our ability to decipher misspellings. In fact, this single change makes a greater improvement to spelling than all of our improvements over the last five years”.

Anyway, edit distance is a way of quantifying how dissimilar two character sequences (strings) are to one another by counting the minimum number of operations required to transform one string into the other. It is commonly used as part of spelling correction systems that can find corrections by selecting the words from a dictionary that have a low distance to the word in question.

There are several types, such as Hamming distance, Levenshtein distance, Jaro distance, and more.

For instance, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other. The example below shows that the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of “s” for “k”)
  2. sitten → sittin (substitution of “i” for “e”)
  3. sittin → sitting (insertion of “g” at the end).

Jaccard similarity

While edit distance can work well for single words and short texts, it won’t be great when dealing with whole sentences. For comparing sentences, Jaccard similarity can be applied. Jaccard similarity is about taking the size of the intersection divided by the size of the union of the two sets of words in each sentence. Crudely put, it represents how many words the two texts have in common.

To increase performance, stopwords can be removed and lemmatization or stemming techniques are used to reduce words to their root word. While stemming is a rule-based way of removing parts of a word that look like grammatical morphemes (e.g. ‘crudely’ becomes ‘crude’), lemmatization is more context-aware and will group different forms of a word by reducing to a word’s lemma, or dictionary form. For instance, the verb ‘to walk’ can appear as ‘walk’, ‘walked’, ‘walks’ or ‘walking’, but the base form is ‘walk’. Another example is de-pluralization (e.g. ‘women’ becomes ‘woman’).

Bag-of-words and TF-IDF

The problem with Jaccard similarity is that we’re simply comparing overlapping words, but not all words are equal.
TF-IDF (term frequency-inverse document frequency) is a way to understand the importance or relevance of a word in a piece of text. TF-IDF, or a variation of it, is often used at the core of search engines to score and rank a document’s relevance to a user’s search query. It can also be used for other tasks such as text summarization or document classification.

Term frequency means how many times a word occurs in a document. If a word occurs many times in a document, it has high relevance to that document. The problem is that this will favor common words such as ‘the’, ‘a’, etc. The inverse document frequency is used to diminish the weight of those words that are common across all documents in a corpus. So TF-IDF basically ranks words on how important they are to the document versus how important they are to all documents. If a word is used frequently in a document, but not frequently in all documents, it has high importance to that document.
Simply counting how many times a word occurs in a text disregards any grammar or even the word order of the text, so this list of words is often referred to as a bag-of-words.

Word embeddings and neural networks

While TF-IDF is good for measuring text similarity, there are many different ways to describe the same thing, and synonyms and homonyms can make things even harder. To get a sense of the semantic similarity and contextual content, algorithms will need to understand words and their relationships.

Several years ago Google published Word2Vec, which is a group of models that can produce word embeddings. Word embeddings are typically vectors with a specific length that can be seen as coordinates in multi-dimensional space. By ‘training’ a model on large amounts of text, these coordinates are updated and ‘learned’ so that words that are closer together in the vector space also are more similar in meaning. So a word embedding can be seen as an encoded meaning of a word.

One of the most notable examples of the power of word2vec’s embeddings was that you could take the vector for king, subtract man, add woman, and get close to the vector of queen. The angle and distance in the latent space between man and woman were similar to those of king and queen.

The following visualizations of real embeddings show geometrical relationships that capture semantic relations like the relation between a country and its capital. Source: Google Developers

There are multiple models with pre-trained word embeddings available for download. Word2Vec, GloVe, and fastText are probably the most famous.
Most deep learning language models such as BERT (which I also briefly wrote about in Making your trained AI model faster and smaller) involve training word embeddings, which you can also use similarly.

Averaging embeddings

Let’s say we’d want to know the similarity of two articles. This could be done by getting the embeddings for all the words in the articles, summing or averaging the vectors, and using cosine similarity to measure the distance between the 2 vectors. However, summing or averaging out the vectors like that would lose semantic and contextual meaning, as each word would have the same weight. A better approach would be to combine this with TF-IDF weights to get a weighted average.

Using a neural network’s hidden state

Embeddings are often created by training a language model. This involves using large amounts of text data and training a model that can predict the next word in a sentence, or something similar to that.

Neural nets, or more specifically encoder-decoder models, consist of an encoder and decoder. Simply put, the encoder will encode the input to a hidden state, and the decoder will take the hidden state as input and output the desired output. It’s possible to change only the decoder of a model and further train the model on a subsequent task. This is known as fine-tuning. So once a language model is trained, it can be re-used and fine-tuned for all kinds of tasks, such as classification.

For semantic similarity, we’re interested in the hidden state of the model. You can think of the hidden state as what the neural net ‘understands’ from the input before it makes a final prediction or output.

Simplified representation of an encoder-decoder model for sentiment classification. The input is a string of text. The encoder encodes the text into something like a matrix of numbers, where each value represents a different characteristic of the input. The decoder’s responsibility is to interpret the encoded state and decode it back to desired output, such as a vector of probabilities for each class.

Let’s say we have a pre-trained language model and we have fine-tuned it for a classification task. We can now use the encoder to encode our documents and use cosine similarity of the encoded states of each input to measure similarity.
What’s cool about this is that it works for more than just text data. This is what what I was referring to when I wrote about the visual similarity of logos, and what I’ll be covering in a next article.

Recap

The field of NLP is moving insanely fast and I’ve tried to cover a wide variety of old and new techniques used to measure text similarity. Relatively old-and- simple algorithms can still go a long way, but deep learning approaches are rapidly becoming the standard.
As with anything, whether or not you should choose one solution or the other, the answer is always “it depends”. It depends on the nature of the problem you’re trying to solve as well as other constraints and requirements.

I hope this has been interesting or useful to you, and if you have any questions or feedback please drop a comment. I’d love to hear from you!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store