Thank you guys on this website you helped in TF/IDF. It helped me alot to make tf-idf function in java. I made tf but I have one question. As on wiki they wrote IDF can be calculated that how many documents have the term. But I am confused.
For example, Here is the string "JosAH is great. JoshAH rocks" so the TF would be 2/5 and for IDF there are 2 documents and each documents contain JoshAH term. So
Will we just see if that term occur in other documents or we will see how many times it occurs in other documents?
I'm not entirely sure what you ask here. Anyway, the purpose of IDF --- inverse document frequency --- is to dampen the score of very frequent terms, and boost the score of infrequent terms.
In your collection of two documents, the IDF of "JosAH" will be 0 --- since it occurs in all documents.
The document frequency is 'the number of documents in the collection that contain a term' (from Introduction to Information Retrieval), so in your words the former option, 'just see if that term occurs'.
Related
Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler:
When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions). However as topics are kept fixed when inferring a new document, i don't see why this should be relevant.
An issue with sampling is the probabilistic nature - sometimes documents topic assignments inferred, greatly vary on repeated invocations. Therefore i would like to understand the theoretical and practical value of the sampling vs. just using a deterministic method.
Thanks Ben
Just using term topic counts of the last Gibbs sample is not a good idea. Such an approach doesn't take into account the topic structure: if a document has many words from one topic, it's likely to have even more words from that topic [1].
For example, say two words have equal probabilities in two topics. The topic assignment of the first word in a given document affects the topic probability of the other word: the other word is more likely to be in the same topic as the first one. The relation works the other way also. The complexity of this situation is why we use methods like Gibbs sampling to estimate values for this sort of problem.
As for your comment on topic assignments varying, that can't be helped, and could be taken as a good thing: if a words topic assignment varies, you can't rely on it. What you're seeing is that the posterior distribution over topics for that word has no clear winner, so you should take a particular assignment with a grain of salt :)
[1] assuming beta, the prior on document-topic distributions, encourages sparsity, as is usually chosen for topic models.
The real issue is computational complexity. If each of N tokens in a document can have K possible topics, there are K to the N possible configurations of topics. With two topics and a document the size of this answer, you have more possibilities than the number of atoms in the universe.
Sampling from this search space is, however, quite efficient, and usually gives consistent results if you average over three to five consecutive Gibbs sweeps. You get to do something computationally impossible, and what it costs you is some uncertainty.
As was noted, you can get a "deterministic" result by setting a fixed random seed, but that doesn't actually solve anything.
Using http://wiki.apache.org/solr/TermVectorComponent I can get indexed terms and their frequencies for any document stored in my index. How can I get the same information for a text, without storing the text in my index? I just want SOLR to process the text and return the information, but without having to store the document in my index.
AFAIK this isn't possible without storing data in SOLR.
If you are looking to do text analysis (I understand this is broader than what you ask for), I would recommend the below alternatives:
MAUI - does keyphrase and terminology extraction.
Gensim - does topic modelling
Kea - keyword extraction
I've also come across some python scripts that do term frequency analysis. Have a look at Mincemeat, particulary the example, which does term frequency calculation.
From what you ask for I conclude that you actually need a search library, not a full search engine (service). That library is Lucene. Perhaps, this will help for starters: How to extract Document Term Vector in Lucene 3.5.0. You could store the index in RAM for the sake of computing necessary bits and then get rid of the index.
I wrote an application in Java several years ago that did heavy text analysis based on Lucene. I had to custom-write the search functions to find words within a certain distance of each other. You can import your text documents into the software and have it count the term frequencies, or you can take the code and taylor it to your needs.
Free download:
http://www.minoesoftware.com/download.php
Source:
https://github.com/danspiteri/MINOE/blob/master/src/minoe/SearchFiles.java
If you are using Solr4 and you are not storing the text, you can use a Solr pivot on the text field. But then, obviously you will get terms after the analyzer processing:
http://192.168.0.202:8080/solr/fr_00_0425_sem/select?q=renault&wt=xml&facet=true&facet.pivot=uniqueKey,yourText
This is a pretty heavy query, I hope you don't have too many documents that match...
I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').
If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.
I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.
Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.
You use fuzzy query in a wrong way.
According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum
allowed number of edits (for older releases N was a confusing float
between 0.0 and 1.0, which translates to an equivalent max edit
distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for
mcandless~1 and it will match mccandless (insert c), mcandles (remove
s), mkandless (replace c with k) and a great many other "close" terms.
With max edit distance 2 you can have up to 2 insertions, deletions or
substitutions. The score for each match is based on the edit distance
of that term; so an exact match is scored highest; edit distance 1,
lower; etc.
So you need to write queries like this - Health~2
You write: "I wanted to match Parkway with Pkwy"
Parkway and Pkwy have an edit distance of 3. You could achieve this by subbing in "~3" for "~2" from the first response, but Solr fuzzy matching is not recommended for values greater than 2 for performance reasons.
I think the best way to approach your problem would be to generate a context-specific dictionary of synonyms and do query-time expansion.
Using phonetic filters may solve your problem.
Please consider looking at the following
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
Hope this helps.
I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.
When I choose to view the score field in solr results I see the score assigned by solr to every document returned and a maxscore value that is the score of the topmost returned document.
I need to know is there a cut-off to the solr score or not. I mean if the maxscore is 6.89343 or 2.34365, so does this mean that it is 6.89343 of 10 as the final score? or how can I decide that I'm close to the most correct result.
If possible, I need a simple explanation of the scoring algorithm used by solr.
The maxscore is the scoring of the topmost document in the search results.
There is no cutoff for the maxscore, and depends upon the scoring calculations and normalization done by Lucene/Solr.
The topmost document would have the maxscore, while you would get an idea from the scores of the documents below it, as to how off they are from the topmost.
For Scoring explaination you can check link
If it is indeed a z-score from a normal distribution then you can calculate the CDF (as it appears here ). The CDF will give you a bounded score from 0 to 1. Its hard for me to interpret what the CDF really means in this case given the un-normalized score is calculated in several steps, but you can sort of think of it as the probability that you got the right answer as long as your collection is well populated with the relevant material.