Identify Things that are Similar

April 4 2019

In this article we demonstrate how the K-Means Algorithm which is an unsupervised machine learning technique, can be used to group documents using a factor vector

We have seen how to use tf-idf to identify what words or phrases make a document unique and how those tf-idf vectors can be used to compare one document against another. Your next question might be, can we use some more fancy math to group documents that are similar using their entire tf-idf vector and not just one or two words. Funny you should ask that, because the answer is yes, there is some fancy math that we can use. (For those of you who jumped right to this document and are saying what the tf-idf are we doing, please go back and have a quick read of the previous two articles).

The fancy math is K-means clustering. This is also where we start to depart from preparing and comparing data and begin to enter the realm of Artificial Intelligence or Machine Learning. K-means clustering is one of the most widely used unsupervised machine learning algorithms that forms clusters of data based on the similarity between documents.

What is K-means

The first step is to set the number of clusters or groups you want to organize your documents into. The K in the K-means refers to the number of clusters that you want to create. There can be a lot of discussion on what K should be, depending on the number of documents and what you are looking for. We use a couple of different K values depending on a bunch of stuff. The exact values and where we use them is a trade secret.

The K-means algorithm starts by randomly choosing one point at random as the center point for each cluster. The center point is the perfect match for being a member of a group. Think of it as setting up a way to group something, like put all the murder mystery books here, all sci-fi books there and romance books over there.

The algorithm then sorts the data into the groups. It then determines how different each of the documents are from being a perfect match for the group they have been assigned. Perhaps there are a bunch of sci-fi romance books, or sci-fi murder mysteries, or murder mystery romances. We would make a judgement call as to which group the book is more like, this book is more sci-fi then romance, or this is way more romance than murder mystery.

The algorithm then uses the information on how different each of the documents are from the ideal to form new groups, by adjusting the centers of the group which is changing what makes something a perfect match. In our book example we might drop sci-fi and create a new group action-adventure.

The algorithm repeats this process until it has created the most diverse grouping it can by maximizing how close each document would be to the perfect match of the group.

How does it work?

The algorithm uses the straight-line distance between each data instance and centers of each group to determine which group a document would belong to. It will continue to move the center point of each of the groups until the distance between each member in a group and the center point of the group is minimized.

Not very helpful? Let's look at a simple example. We will again turn to the “Kick vs Throw” example. For expedience sake let us assume the tf-idf scores for several documents for kick and throw are as displayed in the table below.

Now lets plot these coordinates on a grid which shown below. Let us also assume the algorithm has selected the first three points as the initial centers of the three groups we have asked for. These centers are marked by an x on the grid. Remember these points will be randomly selected so that we don’t pre-determine where a group might be located.

The algorithm then assigns each document to the group they are closest to. The algorithm calculates the distance using the strait line distance between the points. As an example the distance from Doc3 to Doc4 is 1 and the distance from Doc3 to Doc8 is about 5. At the start the algorithm puts Doc1, Doc4, Doc5, and Doc6 into Group 1. It places Doc2, Doc7 and Doc9 in Group 2. It places Doc5 and Doc8 in Group 3.

The algorithm then determines new group centers by calculating the average of the x and the y coordinates for all the points within the group. This is shown in the image below.

The algorithm then calculates the distance from each of the points and each of the centers for each of the groups. The distance from Group 1 center to Doc4 is 0.8 and the distance from Group 1 center to Doc6 is 3.

The algorithm then once again places the documents into the groups that they are closest to. The new document groups are:

Group1: Doc1, Doc3, Doc4

Group2: Doc2, Doc7, Doc9

Group3: Doc6, Doc5, Doc8

The algorithm then determines the new centers by again averaging the x and y coordinates for each of the documents within each group. The new centers are displayed in the figure below.

The algorithm stops this cycle of measuring and adjusting once the members of the group stop changing.

What to make of the results

Once the K-means algorithm is done we can look at the groups as the most distinct groupings of documents. One of the ways that we use this technique is to take a group of documents that we have a deep understanding of and create the unique groupings via K-means and then we take an unknown document and see which group it would be placed in. This is a simple first step of document classification.