Using Topic Model, how should we set up a "stop words" list? - stop-words

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case?
For example, I have 10K of articles from a journal, then because of the structure of an article, basically you will see words like "introduction, review, conclusion, page" in every article. My concern is: should we remove these words from our corpus? (the words that every document has?) Thanks to every comment and suggestion.

I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard .
list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have tried out analysing it. It can mislead your models by giving more weights to these domain specific keywords.

Worth to consider is that the stop words might not affect your model as much as you fear. Have you tried not removing them and compared the results?
See also this 2017 paper: "Pulling Out the Stops: Rethinking Stopword Removal for Topic Models." http://www.cs.cornell.edu/~xanda/stopwords2017.pdf
In conclusion they say (paraphrasing) that removing stopwords had no real negative effect on the quality of the LDA model, and if needed they could still be removed afterwards without impacting the model.
Alternatively you can always remove words with a high document frequency automatically, i.e. set a treshold of the amount of documents the word can appear in (e.g. 50%) and just remove all words that are more frequent than those as stopwords.
I don't think this will meaningfully impact the model itself, but I'm sure it'll speed up the computations of the model, by virtue of there being less words to compute.

Related

solr fuzzy vs wildcard vs stemmer

I have couple of questions here.
I want to search a term jumps
With Fuzzy search, I can do jump~
With wild card search, I can do jump*
With stemmer I can do, jump
My understanding is that, fuzzy search gives pump. Wildcard search gives jumping as well. Stemmer gives "jumper" also.
I totally agree with the results.
What is the performance of thes three?
Wild card is not recommended if it is at the beginning of the term - my understanding as it has to match with all the tokens in the index - But in this case, it would be all the tokens which starts jump
Fuzzy search gives me unpredicted results - It has to do something kind of spellcheck I assume.
Stemmer suits only particular scenarios like it can;t match pumps.
How should I use these things which can give more relevant results?
I probably more confused about all these because of this section. Any suggestions please?
Question 1
Wildcard queries are (generally) not analysed (i.e. they're not tokenized or run through filters), meaning that anything that depend on filters doing their processing of the input/output tokens will give weird results (for example if the input string is broken into multiple strings).
The matching happens on the tokens, so what you've input is almost (lowercasing still works) matched directly against the prefix / postfix of the tokens in the index. Generally you'd want to avoid wildcard queries for general search queries, since they're rather limited for natural search and can give weird results (as shown).
Fuzzy search is based on "edit distance" - i.e. a number that tells Solr how many characters can be removed/inserted/changed to get to the resulting token. This will give your users OK-ish results, but might be hard to decipher in the sense of "why did this give me a hit" when the allowed distance is larger (Lucene/Solr supports up to 2 in edit distance which is also the default if no edit distance is given).
Stemming is usually the way to go, as it's the actual "formal" process of taking a term and reducing it down to its stem - the actual "meaning" (it doesn't really know anything about the meaning as in the natural language processing term, but it does it according to a set of static rules and exceptions for the language configured) of the word . It can be adjusted per language to rules suitable for that language, which neither of the two other options can.
For your downside regarding stemming ("Since it can't match pumps") - that might actually be a good thing. It'll be clearer to your users what the search results are based on, and instead of including pumps in your search result, include it as a spelling correction ("Did you mean pump / pumps instead?"). It'll give a far better experience for any user, where the search results will more closely match what they're searching for.
The requirements might differ based on what your actual use case is; i.e. if it's just for programmatic attempts to find terms that look similar.
Question 2
Present those results you deem more relevant as the first hits - if you're doing wildcard or fuzzy searches you can't do this through scoring alone, so you'll have to make several queries and then present them after each other. I usually suggest making that an explicit action by the user of the search when discussing this in projects.
Instead, as the main search, you can use an NGramFilter in a separate field and use a copyfield instruction to get the same content into both fields - and then score the ngramfilter far lower than hits in the more "exact" field. Usually you want three fields in that case - one for exact hits (non-stemmed), one for stemmed hits and one for ngram hits - and then score them appropriately with the qf parameter to edismax. It usually gives you the quickest and easiest results to a decent search results for your users, but make sure to give them decent ways of either filtering the result set (facets) or change their queries into something more meaningful (did you mean, also see xyz, etc.).
Guessing the user's intent is usually very hard unless you have invested a lot of time and resources into personalisation (think Google), so leave that for later - most users are happy as long as they have a clear and distinct way of solving their own problems, even if you don't get it perfect for the first result.
For question 2 you can go strict to permissive.
Option one: Only give strict search result. If no result found give stemmer results. Continue with fuzzy or wildcard search if no result found previously.
Option two: Give all results but rank them by level (ie. first exact match, then stemmer result, ...)

How to get paragraph result from solr keyword search after using tika to index some documents?

I use TIKA to index documents. then I want to get the whole paragraph from paragraph start to the paragraph end which contains the key words. I tried to use HighlightFragsize but it does not work. For example: there is a document like below:
When I was very small, my parents took me to many places, because they wanted me to learn more about the world. Thanks to them, I
witnessed the variety of the world and a lot of beautiful scenery.
But no matter where I go, in my heart, the place with the most
beautiful scenery is my hometown.
there are two paragraphs above. If I search 'my parents', I hope I can get the whole paragraph "When I was very small, my parents....... a lot of beautiful scenery". not only part of this paragraph. I used HighlightFragsize to limit the sentence, but the result is not what I want. Please help. thanks in advance
You haven't provided a lot of information to go off of but I'm assuming that you're using a highlighter so here are a couple of things you should check for:
The field that holds your parsed data - is it stored? Can you see the entire contents?
If (1), is the text longer than 51200 chars? The default highlighter configuration has a setting maxAnalyzedChars that is set to 51200. This means that the highlighter will not process more than 51200 characters from a highlighted field in a matched document to look for highlights. If this is the case, increase this value until you get the desired results.
Highlighting on extremely large fields may incur a significant performance penalty which you should be mindful of before choosing a configuration.
See this for more details.
UPDATE
I don't think there's any parameter called HighlightFragsize but there's one called hl.fragsize which can do what you want when set to zero.
Try the following query and see if it works for you:
q=my+parents&hl=true&hl.fl=my_field&hl.fragsize=0
Additionally, you should, in any case, be mindful of the first 2 points I posted above.
UPDATE 2
I don’t think there’s a direct way to do what you’re looking for. You could possibly split up your field into a multi valued field with each paragraph being stored as a separate value.
You can then possibly use hl.preserveMulti, hl.maxMultiValuedToExamine and hl.maxMultiValuedToMatch to achieve what you need.

LDA: Why sampling for inference of a new document?

Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler:
When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions). However as topics are kept fixed when inferring a new document, i don't see why this should be relevant.
An issue with sampling is the probabilistic nature - sometimes documents topic assignments inferred, greatly vary on repeated invocations. Therefore i would like to understand the theoretical and practical value of the sampling vs. just using a deterministic method.
Thanks Ben
Just using term topic counts of the last Gibbs sample is not a good idea. Such an approach doesn't take into account the topic structure: if a document has many words from one topic, it's likely to have even more words from that topic [1].
For example, say two words have equal probabilities in two topics. The topic assignment of the first word in a given document affects the topic probability of the other word: the other word is more likely to be in the same topic as the first one. The relation works the other way also. The complexity of this situation is why we use methods like Gibbs sampling to estimate values for this sort of problem.
As for your comment on topic assignments varying, that can't be helped, and could be taken as a good thing: if a words topic assignment varies, you can't rely on it. What you're seeing is that the posterior distribution over topics for that word has no clear winner, so you should take a particular assignment with a grain of salt :)
[1] assuming beta, the prior on document-topic distributions, encourages sparsity, as is usually chosen for topic models.
The real issue is computational complexity. If each of N tokens in a document can have K possible topics, there are K to the N possible configurations of topics. With two topics and a document the size of this answer, you have more possibilities than the number of atoms in the universe.
Sampling from this search space is, however, quite efficient, and usually gives consistent results if you average over three to five consecutive Gibbs sweeps. You get to do something computationally impossible, and what it costs you is some uncertainty.
As was noted, you can get a "deterministic" result by setting a fixed random seed, but that doesn't actually solve anything.

how to do fuzzy search in big data

I'm new to that area and I wondering mostly what the state-of-the-art is and where I can read about it.
Let's assume that I just have a key/value store and I have some distance(key1,key2) defined somehow (not sure if it must be a metric, i.e. if the triangle inequality must hold always).
What I want is mostly a search(key) function which returns me all items with keys up to a certain distance to the search-key. Maybe that distance-limit is configureable. Maybe this is also just a lazy iterator. Maybe there can also be a count-limit and an item (key,value) is with some probability P in the returned set where P = 1/distance(key,search-key) or so (i.e., the perfect match would certainly be in the set and close matches at least with high probability).
One example application is fingerprint matching in MusicBrainz. They use the AcoustId fingerprint and have defined this compare function. They use the PostgreSQL GIN Index and I guess (although I haven't fully understood/read the acoustid-server code) the GIN Partial Match Algorithm but I haven't fully understand wether that is what I asked for and how it works.
For text, what I have found so far is to use some phonetic algorithm to simplify words based on their pronunciation. An example is here. This is mostly to break the search-space down to a smaller space. However, that has several limitations, e.g. it must still be a perfect match in the smaller space.
But anyway, I am also searching for a more generic solution, if that exists.
There is no (fast) generic solution, each application will need different approach.
Neither of the two examples actually does traditional nearest neighbor search. AcoustID (I'm the author) is just looking for exact matches, but it searches in a very high number of hashes in hope that some of them will match. The phonetic search example uses metaphone to convert words to their phonetic representation and is also only looking for exact matches.
You will find that if you have a lot of data, exact search using huge hash tables is the only thing you can realistically do. The problem then becomes how to convert your fuzzy matching to exact search.
A common approach is to use locality-sensitive hashing (LSH) with a smart hashing method, but as you can see in your two examples, sometimes you can get away with even simpler approach.
Btw, you are looking specifically for text search, the simplest way you can do it split your input to N-grams and index those. Depending on how your distance function is defined, that might give you the right candidate matches without too much work.
I suggest you take a look at FLANN Fast Approximate Nearest Neighbors. Fuzzy search in big data is also known as approximate nearest neighbors.
This library offers you different metric, e.g Euclidian, Hamming and different methods of clustering: LSH or k-means for instance.
The search is always in 2 phases. First you feed the system with data to train the algorithm, this is potentially time consuming depending on your data.
I successfully clustered 13 millions data in less than a minute though (using LSH).
Then comes the search phase, which is very fast. You can specify a maximum distance and/or the maximum numbers of neighbors.
As Lukas said, there is no good generic solution, each domain will have its tricks to make it faster or find a better way using the inner property of the data your using.
Shazam uses a special technique with geometrical projections to quickly find your song. In computer vision we often use the BOW: Bag of words, which originally appeared in text retrieval.
If you can see your data as a graph, there are other methods for approximate matching using spectral graph theory for instance.
Let us know.
Depends on what your key/values are like, the Levenshtein algorithm (also called Edit-Distance) can help. It calculates the least number of edit operations that are necessary to modify one string to obtain another string.
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.levenshtein.net/

Is there a way to rank the difficulty of pronunciation of a word?

I'm trying to build a collection English words that are difficult to pronounce.
I was wondering if there is an algorithm of some kind or a theory, that can be used to show how difficult a word is to pronounce.
Does this appear to you as something that can be computed?
As this seems to be a very subjective thing, let me make it more objective, let's say hardest words to pronounce by text to speech technologies.
One approach would be to build a list with two versions of each word. One the correct spelling, and the other being the word spelled using the simplest of phonetic spelling. Apply a distance function on the two words (like Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance). The greater the distance between the two words, the harder the word would be to pronounce.
Great problem! Off the top of my head you could create a system which contains all the letters from the phonetic alphabet and with connected weights betweens every combination based on difficulty (highly specific so may need multiple people testing and take averages etc) then have a list of all words from the English dictionary stored on disk and call a script which cycles through each entry and performs web scraping on wikipedia for the phonetic spelling and ranks their difficulty. This could take into consideration the length of the word as well as the difficulty between joining phonetics then order the list based on the difficulty.
Thats what I would try and do :P
To a certain extent...
Speech programs for example use a system of phonetics to try and pronounce words.
For example, "grasp" would be split into:
Gr-A-Sp
However, for foreign words (or words that don't follow this pattern), exception lists have to be kept e.g. Yacht
Suggestion
Fortunately Pronunciation as a process is dependent on a two factors these include
the phones making up the words and the location of vowels and semi vowels i.e
/a/,/ae/,/e/,/i/,/o/,/u/,/w/,/j/...
length of the word.
the first relates to the mechanics of phone sound production as the velum, cheeks tongue have to be altered to produce various sounds related to individual phones i.e nasal etc. this makes some words more difficult to pronounce as the movement required may be a lot. Refer to books about phonetics to find positions of pronouncing each phone.
Algorithm
a weighted spanning tree with weight being the difficulty of pronouncing two consecutive phones i.e l and r or /sh/ and /s/
good luck.

Resources