Sports, Running, and Big Data

After aggravating an old knee injury, I have started to get into running. It really isn’t the kind of injury to prevent it like with an ACL but merely a loss of the connecting tendons in my left knee. Starting to realize that sports are a perfect place for data science like everyone before me, the thought dawned on me. Can I make myself faster using data? Is there an optimal set of parameters even if I need to use dimensionality reduction techniques on too many factors, that can tell me the ideal BMI, the best training regimen, and more. I have nowhere near enough data for this thought experiment and I many never but at least we can try. Even lacking data, I will see if I can at least get a regression analysis (the worst one where life and genetic factors play such a role) in just a year or so.

The Data Points

Sports are physical and so are the data points to state the really obvious. Running is terrain based, another obvious. However, what points do I need to capture? This is the first question in the great science of data.

I need to understand course types, elevation gains and losses with average grade,percent trail, percent path, percent road, altitude average, altitude peak, altitude min, and mileage for terrain values. This may grow or shrink but LSA and other dimensionality reduction techniques exist.

I also need to understand the physical. Body Mass Index (BMI), average caloric intake, supplement use possibly down to chemical composition, Lifting regimens for calves, hamstrings, quads, and core is also important.

Weather is a factor in any sort of outdoor activity. Average temperature for runs during a period, temperatures at different points, and

Finally, training is important. The same terrain values should be accounted for. Also, average long run at different time periods before a course, average short runs for those time periods, average run length for those periods need to be taken.
Average pace times and mileage area also important and should further be taken at snapshots in time.

Data that is difficult to obtain but may be useful include those accounting for medical history, past experience with the sport, in this case running, and periods of inactivity.

Technically, anything that could have an effect on the outcome should be recorded and placed into vectors for further analysis.

The Algorithms

Data science algorithms are many. Where there are hidden patterns and the goal is prediction, neural nets are a strong predictive technique.

Data science algorithms such as neural nets are dimensional. They rely on vectors. Each dimension (field) is placed in the appropriate column for the vector. Neural nets utilize these vectors to analyze vectors for patterns. They use multiple layers to account for other changes. Nodes in each layer are connected to nodes in the previous and next layer. Activation equations are formed which state which node will be used in the next layer. The output is used for prediction. In text one or two ‘hidden’ layers is sufficient. The hidden layer stands between the input layer (vectors separated by categories) and output layer.

Forming the input layer depends on the goal, whether it is mileage, time or both. Clustering algorithms can be used to separate the vectors into common categories. Algorithms exist to determine the proper number of clusters.

The convolutional neural network takes ideas from imaging and utilizes a convolution kernel to further improve the network.

Unfortunately, neural nets take large amounts of data to train. To be accurate, the various hidden patterns and input categories need to have enough data to establish them. This could be five or more but the 32 data point rule is likely better. An alternative is a form of multiple regression whether it is the highly likely logistic regression or less likely exponential regression. Gains are likely to diminish as people become more fit and train harder.

Final Remarks

Stay tuned and I will try to find a working model.

Leave a Reply