So, I would like to find specific data from scraped web pages, pdfs, and just about anything under the sun without taking a lot of time. After looking over the various fuzzy logic algorithms such as Jaro-Winkler, Metaphone, and Levenstein and finding that one did not have an incredibly wide application, I decided that developing a regular expression based distance algorithm may be more feasible.
The idea is simple, start with a regular expression, build a probability distribution across a good and known data set or multiple data sets, and test for the appropriate expression across every web page. The best score across multiple columns would be the clear winner in this case.
Building out the expression would be include taking known good data and finding a combination between the base pattern and the data that works or building an entirely new one. Patterns that appear across a large proportion of the set should be combined. If [A-Z]+[\s]+[0-9]+[A-Z], and [A-Z]+[\s]+[0-9]+, appears often in the same or equivalent place or even [a-z]+[\s]+[0-9]+, then it should likely be [A-Z\s0-9a-z]+, if the set is similarly structured. Since the goal is to save time in programming regular expressions to further parse Xpath or other regular expression results, this is useful.
The tricky part of the project will be designing a similarity score that adequately equates the expressions without too many outliers. Whether this is done with a simple difference test resulting in a statistical distribution or a straightforward score needs to be tested,
In all likelihood, re-occurring words should be used to break ties or bolster weak scores.
The new project will hopefully be available on Source Forge for data integration and pre-curation processes.