2. Organize the seed set
With the input set determined and the embedding representations retrieved, you have a few options for determining similarity to the seed set of patents.
Let’s go through each of the options in more detail.
1. Calculating an overall embedding point—centroid, medoid, etc.— for the entire input set and performing similarity to that value. Under this method, one metric is calculated to represent the entire input set. That means that the input set of embeddings, which could contain information on hundreds or thousands of patents, ends up pared down to a single point.
There are drawbacks to any methodology that is dependent on one point. If the value itself is not well-selected, all results from the search will be poor. Furthermore, even if the point is well-selected, the search depends on only that one embedding point, meaning all search results may represent the same area of a topic, technology, etc.. By reducing the entire set of inputs to one point, you’ll lose significant information about the input set.
2. Seed set x N similarity, e.g., calculating similarity to all patents in the input set to all other patents. Doing it this way means you apply the vector distance metric used between each patent in the input set and all other patents in existence. This method presents a few issues:
Lack of tractability. Calculating similarity for (seed_set_size x all_patents) is an expensive solution in terms of time and compute.
Outliers in the input set are treated as equals to highly representative patents.
Dense areas around a single point could be overrepresented in the results.
Reusing the input points for similarity may fail to expand the input space.
3. Clustering the input set and performing similarity to a cluster. We recommend clustering as the preferred approach to this problem, as it will overcome many of the issues presented by the other two methods. Using clustering, information about the seed set will be condensed into multiple representative points, with no point being an exact replica of its input. With multiple representative points, you can capture various parts of the input technology, features, etc.
3. Cluster the seed set
A couple of notes about the embeddings on BigQuery:
The embeddings are a vector of 64 numbers, meaning that data is high-dimensional.
As noted earlier, the embeddings were trained in a prediction task, not explicitly trained to capture the “distance” (difference) between patents.
Based on the embedding training, the clustering algorithm needs to be able to effectively handle clusters of varying density. Since the embeddings were not trained to separate patents evenly, there will be areas of the embedding space that are more or less dense than others, yet represent similar information between documents.
Furthermore, with high-dimensional data, distance measures can degrade rapidly. One possible approach to overcoming the dimensionality is to use a secondary metric to represent the notion of distance. Rather than using absolute distance values, it’s been shown that a ranking of data points from their distances (and removing the importance of the distance magnitudes) will produce more stable results with higher dimensional data. So our clustering algorithm should remove sole dependence on absolute distance.
It’s also important that a clustering method be able to detect outliers. When providing a large set of input patents, you can expect that not all documents in the set will be reduced to a clear sub-grouping. When the clustering algorithm is unable to group data in a space, it should be capable of ignoring those documents and spaces.
Several clustering algorithms exist (hierarchical, clique-based, hdbscan, etc.) that have the properties we require, any of which can be applied to this problem in place of the algorithm used here. In this application, we used the shared nearest neighbor (SNN) clustering method to determine the patent grouping.
SNN is a clustering method that evaluates the neighbors for each point in a dataset and compares the neighbors shared between points to find clusters. SNN is a useful clustering algorithm for determining clusters of varying density. It is good for high-dimensional data, since the explicit distance value is not used in its calculation; rather, it uses a ranking of neighborhood density. The complete clustering code is available in the GitHub repo.
For each cluster found, the SNN method determines a representative point for each cluster in order to perform a search against it. Two common approaches for representing geometric centers are centroids and medoids. The centroid simply takes the mean value from each of the 64 embedding dimensions. A medoid is the point in a cluster whose average dissimilarity to all objects in a cluster is minimized. In this walkthrough, we’re using the centroid method.
Below you’ll see a Python code snippet of the clustering application and calculations of some cluster characteristics, along with a visualization of the clustering results. The dimensions in the visualization were reduced using TSNE, and outliers in the input set have grayed out. The results of the clustering can be seen by the like colors forming a cluster of patents: