An Idea: Pre-process Clustering of Data as a Way of Improving Data Matching

Working through a side project involving clustering,classification, the poission distribution;etc. I came across an interesting idea. What if we could cluster groups together to form database keys for information retrieval. Inserting data into a database for data matching is a critical component of a data company’s business.

The benefits are enormous if deployed correctly. Data matching is a quadratic algorithm with n*(n-1) matches (n^2-n matches). Limiting the set of matches and generating an appropriate threshold to match off of could speed up the process significantly with just a few seconds extra insert time, improving speed for the customer with a slight impact as long as any training sets are saved appropriately.

The most efficient way to do that may just be to assign cluster ids as part of an index with a distance measurement as another. Is it possible that without another key, this could suffice to create an index?

Probably not since accuracy measures on the algorithms such as KMeans, afinity propogation, and KNN nearest neighbors (if planning on using SVD and then grouping into classes on) varies greatly and is never usually withing 5 percent of being 100 percent accurate. However, using centroid and similarity based methods does have an advantage in that searching can be expanded to other nearby centroids quickly with the proper index.

This is really just a thought at this point. I do not have a large database to play with. Sci-kit Learn, however, is a terrific tool to test this on. Try indexing on a cluster id as well as distance from the center of that group and finding the centroid from Euclid and do let me know how this works if you do.

Fig 1: Choosing the Right Emulator, ca 2015. Sci-Kit Learn.

For good measure, don’t forget about Pandas. Also, Rocchio is missing from the above set, so I am providing a good link to a tutorial on the Rocchio algorithm despite the inclusion of KNN neighbors in case it is a better algorithm for your situation. Rocchio uses centroids to group data points.

Leave a Reply