Sports, Running, and Big Data

After aggravating an old knee injury, I have started to get into running. It really isn’t the kind of injury to prevent it like with an ACL but merely a loss of the connecting tendons in my left knee. Starting to realize that sports are a perfect place for data science like everyone before me, the thought dawned on me. Can I make myself faster using data? Is there an optimal set of parameters even if I need to use dimensionality reduction techniques on too many factors, that can tell me the ideal BMI, the best training regimen, and more. I have nowhere near enough data for this thought experiment and I many never but at least we can try. Even lacking data, I will see if I can at least get a regression analysis (the worst one where life and genetic factors play such a role) in just a year or so.

The Data Points

Sports are physical and so are the data points to state the really obvious. Running is terrain based, another obvious. However, what points do I need to capture? This is the first question in the great science of data.

I need to understand course types, elevation gains and losses with average grade,percent trail, percent path, percent road, altitude average, altitude peak, altitude min, and mileage for terrain values. This may grow or shrink but LSA and other dimensionality reduction techniques exist.

I also need to understand the physical. Body Mass Index (BMI), average caloric intake, supplement use possibly down to chemical composition, Lifting regimens for calves, hamstrings, quads, and core is also important.

Weather is a factor in any sort of outdoor activity. Average temperature for runs during a period, temperatures at different points, and

Finally, training is important. The same terrain values should be accounted for. Also, average long run at different time periods before a course, average short runs for those time periods, average run length for those periods need to be taken.
Average pace times and mileage area also important and should further be taken at snapshots in time.

Data that is difficult to obtain but may be useful include those accounting for medical history, past experience with the sport, in this case running, and periods of inactivity.

Technically, anything that could have an effect on the outcome should be recorded and placed into vectors for further analysis.

The Algorithms

Data science algorithms are many. Where there are hidden patterns and the goal is prediction, neural nets are a strong predictive technique.

Data science algorithms such as neural nets are dimensional. They rely on vectors. Each dimension (field) is placed in the appropriate column for the vector. Neural nets utilize these vectors to analyze vectors for patterns. They use multiple layers to account for other changes. Nodes in each layer are connected to nodes in the previous and next layer. Activation equations are formed which state which node will be used in the next layer. The output is used for prediction. In text one or two ‘hidden’ layers is sufficient. The hidden layer stands between the input layer (vectors separated by categories) and output layer.

Forming the input layer depends on the goal, whether it is mileage, time or both. Clustering algorithms can be used to separate the vectors into common categories. Algorithms exist to determine the proper number of clusters.

The convolutional neural network takes ideas from imaging and utilizes a convolution kernel to further improve the network.

Unfortunately, neural nets take large amounts of data to train. To be accurate, the various hidden patterns and input categories need to have enough data to establish them. This could be five or more but the 32 data point rule is likely better. An alternative is a form of multiple regression whether it is the highly likely logistic regression or less likely exponential regression. Gains are likely to diminish as people become more fit and train harder.

Final Remarks

Stay tuned and I will try to find a working model.

Advertisements

Headless Testing and Scraping with Java FX

There is a lot of JavaScript in the world today and there is a need to get things moving quickly. Whether testing multiple websites or acquiring data for ETL and/or analysis, a tool needs to exist that does not leak memory as much as Selenium. Until recently, Selenium was really the only option for webkit, JCEF and writing native bindings for Chromium have been options for a while. Java 7 and Java 8 have stepped into the void with the JavaFX tools. These tools can be used to automate scraping and testing where network calls for HTML, Json, CSVs, pdfs, or what not are more tedious and difficult.

The FX Package

FX is much better than the television channel with some exceptions. Java created a sleeker version of Chromium based on webkit. While webkit suffers from some serious setbacks, Java FX also incorporates nearly any part of the java.net framework. Setting SSL Handlers, proxies, and the like works the same as with java.net. Therefore, FX can be used to intercept traffic (e.g. directly stream images that are incoming to a file named by URL without making more network calls), present a nifty front end controlled by JavaScript and querying for components,

Ui4J

Ui4j is as equally nifty as the FX package. While FX is not capable of going headless without a lot of work, Ui4j takes the work out of such a project using Monocle or Xvfb. Unfortunately, there are some issues getting Monocle to run by setting -Dui4j.headless=true on command line or using system properties after jdk1.8.0_20. Oracle removed Monocle from the jdk after this release and forced the programs using the server to OpenMonocle. However, xvfb-run -a works equally well. The -a option automatically chooses a server number. The github site does claim compatibility with Monocle though.

On top of headless mode, the authors have made working with FX simple. Run JavaScript as needed, incorporate interceptors with ease, run javascript, and avoid nasty waitFor calls and Selanese (this is an entire language within your existing language).

TestFX

There is an alternative to Ui4j in TestFX. It is geared towards testing. Rather than using an Assert after calling or with ((String) page.executeScript(“document.documentElement.innerHTML”)), methods such as verifyThat exist. Combine with Scala and have a wonderfully compact day. The authors have also managed to get a workaround for the Monocle problem.

Multiple Proxies

The only negative side effect of FX is that multiple instances must be run to use multiple proxies. Java and Scala for that matter set one proxy per JVM. Luckily, both Java and Scala have subprocess modules. The lovely data friendly language that is Scala makes this task as simple as Process(“java -jar myjar.jar -p my:proxy”).!. Simply run the command which returns the exit status and blocks until complete (see Futures to make this a better version of non-blocking) and use tools like Scopt to get the proxy and set it in a new Browser session. Better yet, take a look at my Scala macros article for some tips on loading code from a file (please don’t pass it as command line). RMI would probably be a bit better for large code but it may be possible to better secure a file than compiled code using checksums.

Conclusion

Throw out Selenium, get rid of the extra Selanese parsing and get Ui4J or TestFX for webkit testing. Sadly, it does not work with Gecko so Chromium is needed to replace these tests and obtain such terrific options as –ignore-certificate-errors. There are cases where fonts in the SSL will wreak havoc before you can even handle the incoming text no matter how low level you write your connections. For simple page pulls, stick to Apache HTTP Components which contains a fairly fast, somewhat mid-tier RAM usage asynchronous thread pool useable in Java or Scala. Sorry for the brevity folks but I tried to answer a question or two that was not in tutorials or documentation. Busy!

Tune PostGres for Faster Inserts

Insertions in PostGreSQL take a while, especially for large amounts of data. A project of mine at work sees both large batch updates and another sees single insert statements in a way that they are preferable over others. There are ways to tune PostgreSQL to make sure that you are achieving the most out of your database.

These variables should be in postgres.conf.

  • Tune the Wal (write-ahead logging) with wal_logging set to minimal and ensure a decent wal_writer_delay: When Wal logging times are short and content large, this can greatly effect servers. 10,000 record inserts went from 30 minutes to 2 seconds by extending the wal_writer_delay from per transaction (0 seconds) to 100 milliseconds. The default delay is 200 milliseconds
  • Increase thread limits: Increased thread limits help large transactions. A 10,000 record batch update in my case went from 15 minutes to 3 seconds when thread limits went from 10000 to 50000 threads

Be aware that other configurations will affect your system and that Wal logging is necessary for replication as Wal logs are used to recreate the state in a hot-swapping environment.