apache openNLP chuker/POS noun detection - solr

I am experimenting apache Open NLP for one of my project, my requirement is to detect nouns out of email contents and check with our customer data base (this DB consist of individual names, organization names etc and my search engine is Solr base).
For normal english nouns, default trained model works properly (for most of the cases), but
One of the tricky requirement is, we have business organization with abbreviations like OK, LET etc and thus in few scenarios I need to consider OK, LET etc as noun.
As an example
1) "sending some items to LET, please expect delay in payment"
2) "let us go for a party"
In #1 I want to consider LET as noun and in #2 case LET is not noun.
If I can achieve this requirement, I can reduce significant amount of false positive matches in my search engine.
Any help is highly appreciated.

Make a dictionary of the special nouns and perform dictionary-based extraction as a post-processing step. The dictionary-based extraction should take the distinction between lowercase and uppercase into account, in particular for those entries that are acronyms.
In terms of implementation of the dictionary lookup:
As long as the entities in question are single tokens (or consist only of a predefined, small maximum number M of tokens each), implementing the dictionary as HashSet<String>, tokenising the text and making look-ups in the hash for each token (and groups of up to M tokens) should work very well
If you are dealing with very long entities, or if tokenization is a problem, the use of a search trie or finite state machine implementation of the dictionary is sensible.
Finally, as always with NLP, you will need to look at a significant sample of the results to identify any further problems. Depending on the level of ambiguity in your entity list, you may need to further refine the detection method by adding either a heuristics or a statistical / ML-based decision mechanism on top of the case-sensitive dictionary look-up.

Related

Common ways for pair matching of users in a database?

hope you are safe and well!
I have a question about regular or common ways of pair-matching if there is a database of users: say there are a few properties of each user, and when matching, each user could change the filtering options to only match those who fit their own requirement(so there is mutual selection between users), and we want to efficiently match 1000 users as precisely as possible.
For example, let's say there are 3 properties of every user: gender(female/male/other), study level(elementary/mediate/advanced), and grade(freshman/sophomore/junior/senior), and when matching, each user could choose to only match with people with their selected gender, study level and grade.
When focusing on 1 user, I could guess, on the perspective of database, we could use the filtering options in commands and get a list of those who satisfy both "my requirement" and "I fit their requirement"? However, I think this would be slow and asynchronous problems when there are 1000+ users in the matching phase at the same time?
I saw another post here discussed the blossom algorithm or greedy algorithm, which seem cool since if looking in a graph. Are they doable in this case? I guess if two users mutually fit both requirements, they would have an edge between the two nodes, and the value of edge could be comprehensive matching scores of 3 properties all together?
Anyway, I'm wondering is there a common way to do the pair matching precisely with at least 1000+ users at the same time?
Thank you so much!
If the requirement is that each match has to have the exact same properties, then the solution is fairly simple; just do a multiple criteria sort (ex. first sort by gender, then within each gender category sort by study level, etc.) and pair the identical users.
However, in a random dataset you're very unlikely to have perfect matches for all users. In that case you would want score pairs by how closely each category matches and use a more complex algorithm to maximize your overall matches. What you would do depends heavily on your use case and userbase size. Honestly, 1000 users is a very small number for modern computers; pretty much any polynomial time method (including blossom as you mentioned) would work fine.

Relational database design for hierarchical data?

I am a trying to design a database to act as a language dictionary where each word is associated not only to its definition by also to its grammatical "taxon". E.g., it should look something like this:
"eat": verb.imperative
"eat": verb.present
"ate": verb.past
"he": pronoun.masculine.singular
"she": pronoun.feminine.singular
"heiress": noun.feminine.singular
"heirs": noun.masculine.plural
"therefore": adverb
"but": conjunction
It seems that a natural data structure to hold such a grammatical "taxonomy" should be some kind of tree or graph. Although I haven't thought it through, I presume that should make it easier to perform queries of the type
plural OF masculine OF "heiress" -> "heirs"
At this point, however, I am just trying to come up with the least ineffective way to store such a dictionary in a regular relational database (namely a LibreOffice Base). What do you suggest the data schema should be like? Is there something more efficient than the brute force method where I'd have as many boolean columns as there are grammatical types and sub-types? E.g., "she" would be true for the columns pronoun, feminine, and singular, but false for all other column (verbs, adverb, conjunction, etc.)?
This is a really wide-open question, and there are many applications and much related research. Let me give some pointers based on software I have used.
One column would be the lexeme, for example "eat." A second column would give the part of speech, which in your data above would be a string or other identifier that shows whether it is a verb, pronoun, noun, adverb or conjunction.
It might make sense to create another table for verb information. For example, tense, aspect and mood might each be separate columns. But these columns would only make sense for verbs. For the nouns table, the columns would include number (singular, plural) and gender, and perhaps whether it is a count or mass noun. Pronouns would also include person (first, second or third person).
Do you plan to include every form of every word? For example, will this database store "eats" and "eating" as well as "jumps" and "jumping?" It is much more efficient to store rules like "-s" for present singular and "-ing" for progressive. Then if there are exceptions, for example "ate," it can be described as having the underlying form of "eat" + "-ed." This rule would go under the "eat" lexeme, and there would be no separate "ate" entry.
Also there are rules such as that plural changes words that end in y to -ies. This would go under the plural noun suffix ("-s"), not individual verbs.
With these things in mind, I offer a more specific answer to your question: No, I do not think this data is best described hierarchically, nor with a tree or graph, but rather analytically and relationally. LibreOffice Base would be a reasonable choice for a fairly simple project of this type, using macros to help with the processing.
So for:
"heiress" -> masculine plural = "heirs"
The first thing to do would be to analyze "heiress" as "heir" + feminine. Then compose the desired wordform by combining "heir" and "-s."
I was going to add a list of related software such as Python NLTK, but for one thing, the list of available software is nearly endless, and for another, software recommendations are off-topic for stackoverflow.

short text syntactic classification

I am newbie at machine learning and data mining. Here's the problem: I have one input variable currently which is a small text comprises of non-standard nouns and want to classify in target category. I have about 40% of total training data from entire dataset. Rest 60% we would like to classify as accurately as possible. Followings are some input variables across multiple observations those are assigned 'LEAD_GENERATION_REPRESENTATIVE' title.
"Business Development Representative MFG"
"Business Development Director Retail-KK"
"Branch Staff"
"Account Development Rep"
"New Business Rep"
"Hong Kong Cloud"
"Lead Gen, New Business Development"
"Strategic Alliances EMEA"
"ENG-BDE"
I think above give idea what I mean by non-standard nouns. I can see here few tokens that are meaningful like 'development','lead','rep' Others seems random without any semantic but they may be appearing multiple times in data. Another thing is some tokens like 'rep','account' can appear for multiple category. I think that will make weighting/similarity a challenging task.
My first question is "is it worth automating this kind of classification?"
Second : "is it a good problem to learn machine learning classification?". There are only 30k such entries and handful of target categories. I can find someone to manually do that which will also be more accurate.
here's my take on this problem so far:
Full-text engine: like solr to build index and query rules that draws matches based on tokens - word, phrase, synonyms, acronyms, descriptions. I can get someone to define detail taxonomy for each category. Use boosting, use pluggable scoring lib
Machine learning:
Naive Bayes classification
Decision tree
SVM
I have tried out Solr for this with revers lookup though since I don't have taxonomy available at moment. It seems like I can get about 80% true positives (I'll have to dig more into confusion matrix to reduce false positives). My query is bunch of booleans terms and phrases with proximity and boosts; negations to reduce errors. I'm afraid this approach may lead to overfit and wont scale.
I am aware that people usually tries multiple modeling techniques to achieve which one works best or derives combination of techniques. I want to understand this problem with feasibility and complexity point of view. If its too broad question please just comment on feasibility of solution.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Flagging possible identical users in an account management system

I am working on a possible architecture for an abuse detection mechanism on an account management system. What I want is to detect possible duplicate users based on certain correlating fields within a table. To make the problem simplistic, lets say I have a USER table with the following fields:
Name
Nationality
Current Address
Login
Interests
It is quite possible that one user has created multiple records within this table. There might be a certain pattern in which this user has created his/her accounts. What would it take to mine this table to flag records that may be possible duplicates. Another concern is scale. If we have lets say a million users, taking one user and matching it against the remaining users is unrealistic computationally. What if these records are distributed across various machines in various geographic locations?
What are some of the techniques, that I can use, to solve this problem? I have tried to pose this question in a technologically agnostic manner with the hopes that people can provide me with multiple perspectives.
Thanks
The answer really depends upon how you model your users and what constitutes a duplicate.
There could be a user that uses names from all harry potter characters. Good luck finding that pattern :)
If you are looking for records that are approximately similar try this simple approach:
Hash each word in the doc and pick the min shingle. Do this for k different hash functions. Concatenate these min hashes. What you have is a near duplicate.
To be clear, lets say a record has words w1....wn. Lets say your hash functions are h1...hk.
let m_i = min_j (h_i(w_j)
and the signature is S = m1.m2.m3....mk
The cool thing with this signature is that if two documents contain 90% same words then there is a good 90% chance that good chance that the signatures would be the same for the two documents. Hence, instead of looking for near duplicates, you look for exact duplicates in the signatures. If you want to increase the number of matches then you decrease the value of k, if you are getting too many false positives then you increase the number of k.
Of course there is the approach of implicit features of users such as thier IP addresses and cookie etc.

Resources