More students are entering CS and that can be scary and interesting for job seekers yet Americans suck at math. That will cause attrition rates to skyrocket but also drags out a crucial opinion. It may be better to have a math degree than a CS degree.
There are some very basic reasons:
- Mathematics, especially linear algebra, is of increasing importance and although a CS major takes linear algebra as well as probability and statistis and the courses are full of interpolation and independence, a mathematics major understands this better
- Programming is now everywhere. R and Python are becoming staples of the programming world whereas everyone wants to get out of Java
- As I start a company, I find myself using more math than CS. In fact, CS is on the sidelines as systems and other components are subservient to demands for memory, speed, and distribution based on the mathematics I’ve learned the papers of others.
- Some schools nearly require CS minors for mathematics which can include databases, algorithms, and basic programming (the areas where Math majors are weak).
- There is just more logic in a maths course.
Perhaps it is time to take another look at the CS curriculumn and incorporate math or create a stronger combined major. I still place CS higher than any business degree but if I start hiring, a priority may be a mathematician.
Working through a side project involving clustering,classification, the poission distribution;etc. I came across an interesting idea. What if we could cluster groups together to form database keys for information retrieval. Inserting data into a database for data matching is a critical component of a data company’s business.
The benefits are enormous if deployed correctly. Data matching is a quadratic algorithm with n*(n-1) matches (n^2-n matches). Limiting the set of matches and generating an appropriate threshold to match off of could speed up the process significantly with just a few seconds extra insert time, improving speed for the customer with a slight impact as long as any training sets are saved appropriately.
The most efficient way to do that may just be to assign cluster ids as part of an index with a distance measurement as another. Is it possible that without another key, this could suffice to create an index?
Probably not since accuracy measures on the algorithms such as KMeans, afinity propogation, and KNN nearest neighbors (if planning on using SVD and then grouping into classes on) varies greatly and is never usually withing 5 percent of being 100 percent accurate. However, using centroid and similarity based methods does have an advantage in that searching can be expanded to other nearby centroids quickly with the proper index.
This is really just a thought at this point. I do not have a large database to play with. Sci-kit Learn, however, is a terrific tool to test this on. Try indexing on a cluster id as well as distance from the center of that group and finding the centroid from Euclid and do let me know how this works if you do.
Fig 1: Choosing the Right Emulator, ca 2015. Sci-Kit Learn.
For good measure, don’t forget about Pandas. Also, Rocchio is missing from the above set, so I am providing a good link to a tutorial on the Rocchio algorithm despite the inclusion of KNN neighbors in case it is a better algorithm for your situation. Rocchio uses centroids to group data points.