Learning how to use Lucene!
I have an index in Lucene which is configured to store term vectors.
I also have a set of documents I have already constructed custom term vectors for (for an unrelated purpose) not using Lucene.
Is there a way to insert them directly into the Lucene inverted index in lieu of the original contents of the documents?
I imagine one way to do this would be to generate bogus text using the term vector with the appropriate number of term occurrences and then to feed the bogus text as the contents of the document. This seems silly because ultimate Lucene will have to convert the bogus text back into a term vector in order to index.
I'm not entirely sure what you want to do with these term vectors ultimately (score? just retrieve?) but here's one strategy I might advocate for.
Instead of focusing on faking out the text attribute of term vectors, consider looking into payloads which attach arbitrary metadata to each token. During analysis, text is converted to tokens. This includes emitting a number of attributes about each token. There's standard attributes like position, term character offsets, and the term string itself. ALL of these can be part of the uninverted term vector. Another attribute is the payload which is arbitrary metadata you can attach to a term.
You can store any token attribute uninverted as a "term vector" including payloads, which you can access at scoring time.
To do this you need to
Configure your field to store term vectors, including term vectors with payload
Customize analysis to emit payloads that correspond to your terms. You can read more here
Use an IndexReader.getTermVector to pull back Terms. From that you can get a TermsEnum. You can then use that to get a DocsAndPositionEnum which has an accessor for the current payload
If you want to use this in scoring, consider a custom query or custom score query
Related
In my understanding in Lucene while creating a Field we can specify and IndexableFieldType. FieldType is the concrete implementation of IndexableFieldType. Using FieldType we can control among other things:
Index Options: These help us control whether we want the field to be searchable or not. Plus we also use the enum IndexOptions with values like DOCS, DOCS_AND_FREQUENCIES, DOCS_AND_FREQS_AND_POSITIONS and DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS to control the information we want to store in postings.
Term Vectors: We Also have option to store term vectors. Along with term vectors also, we can store postitions, offsets and payloads.
My question is that if we are already storing offsets positions with IndexOptions, then why do we need to store this in term vector again. I think I don't really understand what term vectors are for. Any explanation on the need for term vectors and what they are exactly will also be helpful. There is no clear explanation available on web.
Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.
You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.
For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.
We are consolidating all our collected content on a record in a single content field, which is the main source for SOLR. The problem is that for some records the content field has only 100K characters, for others 10M or more.
As a result, a search on any term will push 10M character records to the top of the result list.
We would like to limit/counterbalance that by introducing something like "relative term frequency" eg the number of occurrences divided by total number of words in the content field.
Since we don't know what terms people will search on, (I think) we cannot calculate this at indexing time.
Any suggestions/ideas on how to do this?
You can start with the Custom Similarity class.
This would allow you to modify the above parameters and scoring factors.
You need to check the tf (term frequency) method and customized it.
The Custom Similarity class can be refereed from the Schema.xml file.
Check the lucene DefaultSimilarity class for reference which is the actual implementation.
Also check Changing Similarity
Using http://wiki.apache.org/solr/TermVectorComponent I can get indexed terms and their frequencies for any document stored in my index. How can I get the same information for a text, without storing the text in my index? I just want SOLR to process the text and return the information, but without having to store the document in my index.
AFAIK this isn't possible without storing data in SOLR.
If you are looking to do text analysis (I understand this is broader than what you ask for), I would recommend the below alternatives:
MAUI - does keyphrase and terminology extraction.
Gensim - does topic modelling
Kea - keyword extraction
I've also come across some python scripts that do term frequency analysis. Have a look at Mincemeat, particulary the example, which does term frequency calculation.
From what you ask for I conclude that you actually need a search library, not a full search engine (service). That library is Lucene. Perhaps, this will help for starters: How to extract Document Term Vector in Lucene 3.5.0. You could store the index in RAM for the sake of computing necessary bits and then get rid of the index.
I wrote an application in Java several years ago that did heavy text analysis based on Lucene. I had to custom-write the search functions to find words within a certain distance of each other. You can import your text documents into the software and have it count the term frequencies, or you can take the code and taylor it to your needs.
Free download:
http://www.minoesoftware.com/download.php
Source:
https://github.com/danspiteri/MINOE/blob/master/src/minoe/SearchFiles.java
If you are using Solr4 and you are not storing the text, you can use a Solr pivot on the text field. But then, obviously you will get terms after the analyzer processing:
http://192.168.0.202:8080/solr/fr_00_0425_sem/select?q=renault&wt=xml&facet=true&facet.pivot=uniqueKey,yourText
This is a pretty heavy query, I hope you don't have too many documents that match...