Data collection,cleaning, and presentation are a pain, especially when dealing with a multitude of sources. When APIs aren’t available and every step is taken to keep people from getting data, it can be incredibly tedious just to get the data. Parsing in this instance, of course, can be made easier by relating terms in a dictionary and using the documents structure to your advantage. At worst it is just a few lines of regex or several expath expressions and more cleaning with Pentaho. I’ve gone a bit further by enforcing key constraints and naming conventions with the help of Java Spring.
It seems that IBM is making this process a little less time consuming with Watson. Watson appears to have the capacity to find patterns and relations with minimal effort from a CSV or other structured file.
This could really benefit database design by applying a computer to the finding and illumination of the patterns driving key creation and normalization. After all, I would love to be able to focus less on key creation in a maximum volume industry and more on pumping scripts into my automated controllers. The less work and more productivity pre person, the more profit.