Open Source Data Science, the Great Resource No One Knows About

There is a growing industry for online technology courses that is starting to gain traction among many who may have been in school when certain fields like data science were still the plaything of graduate students and phds in Computer Science, statistics, and even, to a degree, biology. However, these online courses will never match the pool of knowledge one could drink from by even taking an undergraduate Computer Science or mathematics class at a middling state school today (I would encourage everyone to avoid business schools like the plague for technology).

In an industry that is constantly transforming itself and especially where the field of data will provide long-term work, these courses may appear quite appealing. However, they are often too shallow to provide much breadth and just thinking that it is possible to pick up and understand the depth of the 1000 page thesis that led to the stochastic approach to matrix operations and eventually Spark is ridiculous. We are all forgetting about the greatest resources available today. The internet, open source code, and a search engine can add layers of depth to what would otherwise be an education not able to provide enough grounding for employment.

Do Take the Online Courses

First off, the online courses from Courses from Coursera are great. They can provide a basic overview of some of the field. Urbana offers a great data science course and I am constantly stumbling into blogs presenting concepts from them. However, what can someone fit into 2-3 hours per week for six weeks in a field that may encompass 2-3 years of undergraduate coursework and even some masters level topics to begin to become expert-level.

You may learn a basic K Means or deploy some subset of algorithms but can you optimize them and do you really know more than Bayesian probabilities that you likely also learned in a statistics class.

Where Open Source Fits In

Luckily, many of the advanced concepts and a ton of research is actually available online for free. The culmination of decades of research is available at your fingertips in open source projects.

Sparse Matrix research, edge detection algorithms, information theory, text tiling, hashing, vectorizing, and more are all available to anyone willing to put in the time to learn them adequately.

Resources

Documentation is widely available and often on github for:

These github accounts also contain useful links to websites explaining the code, containing further documentation (javadocs), and giving some conceptual depth and further research opportunities.

A wide majority of conceptual literature can be found with a simple search.

Sit down, read the conceptual literature. Find books on topics like numerical analysis, and apply what you spent tens or even hundreds of thousands of dollars to learn in school.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s