Text is difficult to store in a database. Math, much easier since the core separating variables are usually much more distinguishable (e.g. form a primary key on age, income, and zip code). Text has the issue of being fuzzy, and often contains multiple topics. There is an approach based on clustering which could alleviate most if not all of this problem. Using topic modeling, representative documents as would be used to choose the initial K clusters in the K-Means ++ algorithm, and the notion behind a blocking key (think blocks of objects), it is possible to develop a valid index and even primary key for storing text documents as a whole in a database.
Topic modelling encompasses quite a few algorithms from Bayes to Neural nets and even can use LDA. With block of documents and set number of documents, LDA can basically form a generic equation using Single Value Decomposition and some basic magic. If the document set is somewhat better defined, speed is important due to re-indexing, and the number of documents are plenty, training a Naiive Bayesian model might be preferential. NLTK has a decent trainer for Python programmers. The advantage to Bayes is not having to choose one category but using many and is preferential as a probability is determined for each category based on the ‘features’ provided. The feature frequencies determine the probability. Still, with more hard line modelling it is possible to use cosines on a groups most representative document or complete other tasks to develop a category ranking.
Either way, having the best category choice as an attribute is likely a good solution.
Choosing Representative Documents
A drawback to choosing a category is that the user or program wants to glean information not on the general category but specific pieces. For this, individual clusterings within the grouping are useful.
It is possible to use the idea behind the K-means++ algorithm to find representative texts in each category. Start by finding the most representative document as this will also be needed. Next distribute the cosines and find the farthest cosine away from this document, this is the next representative document. Then, take the average cosines that are furthest from this document and find the largest value. Continue this process to an optimal number of documents, perhaps four or five. Vectorize these documents for cosine analysis and save them in an appropriate table, likely providing a lookup table with category name and key and using the key as a foreign key in this table. PostgreSQL allows for NoSQL data to be stored in a more relational format using the hstore field in 9.5 or jsonB in 9.4.
The representative documents should then be used to find the cosines of each appropriately topic clustered document. This data will be stored for the blocking key, all of it.
Building a Blocking Key
Building the blocking key is a crucial step. It will speed up indexing and querying to limit large document sets to an appropriate number for future analysis. It also has the added advantage of encompassing more or less data more easily (again see PostreSQL’s NoSQL capabilities). This can be done by mashing together all of the discovered data, formatting numbers to a certain length of course or could be a more complicated process. Another option, still treating this data as a string is to use something akin to Horner’s hashing mechanism to generate a hash key for the data. Generating keys should avoid topic overlap as much as possible. Think of your new table more like a hash table. Since we used cosine similarity which has magnitude and direction being a vector but whose use does not provide true distance and direction due to the data, using distance for the blocking key is more difficult. ASN.1 can hash multiple strings but just throwing the strings together may produce a better numeric solution.