How Bad Data Collection is Messing Up Data Analysis

Big data is driving the world but are company’s driving big data programs correctly? Here I make an argument for more genericism (now that I know more on this subject after working on it for the past year) and better test data. Basically, my rant from initial research is now an awesome plug for SimplrTek and SimplrTerms and whichever ABC style company comes from SimplrTek research.

Data Collection

I need to make a clarification and a confession, I make up data for my own purposes for my own LLC but only for testing (a previous statement was a little ambiguous here as I work for a company and am trying to create a company as futile as it may seem in this growingly competitive market). It is this sort of task that can hurt a company’s bottom line if done wrong and it should never be sold.

But why can using this sort of data mess with building large scale, timely algorithms?

Company’s are basing their own decisions on the results of distributions based on samples that may not really be representative or even correct. Algorithms have followed this and are driven and effected in large part by the shape of the data they are built with. They are predictive but work more like exponential smoothing than rectangular or triangular smoothing (they base decisions on what they were trained on in the past). Basically, current approaches often are not adaptive to change or corrective for awful data and, while likely using machine learning, use it in a way that is rigid and inflexible.

The results of making up data and using poor distributions or records can thus have a deeply wrong impact on a company’s bottom line. If the distribution shows that the best way to expand the number of records (testing often occurs on a portion of records) without throwing it off is to create or use a 30 year old, camino driving, pot-smoker who also happens to be a judge, something is seriously wrong. If your models and algorithms are based on this, your company is screwed. Your algorithm may take pot-smoking to be the key to what that judge rules. In production, with thousands and even hundreds of thousands of records being requested in a timely manner, there is not time to make sure that the different groups in the data used to build a model are good representations of the groups marked for analysis.

This effects everything from clustering to hypothesis testing (whether or not it is the result of clustering). How well received would marketing in the same way to the MMJ crowd be to our supposedly camino driving judge unless, of course, he really isn’t sober as a judge? So, by all means find a representative sample when building projects and spend the money to purchase good test data.

Bad data is a huge problem.

A good part of the solution is to collect data from the environment related to the specific task. I would say design better surveys with open ended questions, keep better track of customers with better database design, centralize data collection, and modernize the process with a decent system and little downtime. It is also possible to just flat out purchase everything However, this should be incredibly obvious.

Fix a Problem with Responsive Algorithms and Clustering Techniques or Neural Nets

Now for a plug. I am working on algorithms that can help tackle this very program. Generic algorithms that remove intuition, pre-trained modelling, and thus the aforementioned problems from data. Check us out at Our demos are starting to materialize. If you would like to help or meet with us, definitely contact us as well.

Still, one thing I am finding as the bright deer in the headlights look comes from related questions, is that people fail to adequately generate test data. Cluster the data on known factions that will use this data. For instance, My pot smoking judge could be ferreted out by clustering against representative samples of judges and criminals and setting a cosine cutoff distance to test records that fit the judge category well. For more variation, maybe use a neural net trained on good records, blatantly bad records, and records from somewhere in between and use the same cutoff approach to generate test data.

You may ask why not just make the records by hand. It is time consuming. Big data algorithms actually need gigabytes or terabytes of data, and with real data you can do things like map or predict fake income ranges, map people to actual locations, and build demos and the like.

Whatever is chosen, a little thought goes a long way.

Leave a Reply