I was reading about inverted index (used by the text search engines like Solr, Elastic Search etc) and as I understand (if we take "Person" as an example):
The attribute to Person relationship is inverted:
John -> PersonId(1), PersonId(2), PersonId(3)
London -> PersonId(1), PersonId(2), PersonId(5)
I can now search the person records for 'John who lives in London'
Doesn't this solve all the problems? Why do we have the forward (or regular database index) at all? Or in other words, in what cases the regular indexing is useful? Please explain. Thanks.
The point that you're missing is that there is no real technical distinction between a forward index and an inverted index. "Forward" and "inverted" in this case are just descriptive terms to distinguish between:
A list of words contained in a document.
A list of documents containing a word.
The concept of an inverted index only makes sense if the concept of a regular (forward) index already exists. In the context of a search engine, a forward index would be the term vector; a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.
When you understand that the terms "forward" and "inverted" are really just relative terms used to describe the nature of the index you're talking about - and that really an index is just an index - your question doesn't really make sense any more.
Here's an explanation of inverted index, from Elasticsearch:
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html
Inverted indexing is for fast full text search. Regular indexing is less efficient, because the engine looks through all entries for a term, but very fast with indexing!
You can say this:
Forward index: fast indexing, less efficient query's
Inverted index: fast query, slower indexing
But, it's always context related. If you compare it with MySQL: myisam has fast read, innodb has fast insert/update and slower read.
Read more here: https://www.found.no/foundation/indexing-for-beginners-part3/
In forward index, the input is a document and the output is words contained in the document.
{
doc1: [word1, word2, word3],
doc2: [word4, word5]
}
In the reverse/inverted index, the input is a word, and the output is all the documents in which the words are contained.
{
word1: [doc1, doc10, doc3],
word2: [doc5, doc3]
}
Search engines make use of reverse/inverted index to get us documents from keywords.
Related
Situation
I have a Rails application using Postgresql.
Texts are added to the application (ranging in size from a few words to, say, 5,000 words).
The texts get parsed, first automatically, and then with some manual revision, to associate each word/position in the text with specific information (verb/noun/etc, base word (running ==> run), definition_id, grammar tags)
Given a lemma (base word, ex. "run"), or a part of speech (verb/noun), or grammar tags, or a definition_id (or a combination), I need to be able to find all the other text positions in the database that contain the same information.
Conflict
I can't do a full-text search because, for example, if I click "left" on "I left Nashville", I don't want "turn left at the light" to appear. the traffic light. I just want "Leave" as a verb, as well as other forms of "Leave" as a verb.
Also, I might want just "left" with a specific definition_id (eg "Left" used as "The political party", not used as "the opposite of the right").
In short, I am looking for some advice on which of the following 3 routes I should take (or if there's a 4th or 5th route that I haven't considered).
Solutions
There are three options I can think of:
Option 1: TextPosition
A TextPosition table to store each word position, with columns for each of the above attributes.
This would make searching very easy, but there would be MANY records (1 for each position), but maybe that's not a problem? Is storing this amount of tickets a bad idea for some specific reason?
Option 2: JSON on the Text object
A JSON column on the Text object, to store all word positions in a large array of hashes, or a hash of hashes.
This would add zero records, but, a) Building a query to search all texts with certain information would probably be difficult, b) That query would probably be slow, and c) It could take up more storage space than a separate table (TextPosition).
Option 3: TWO JSON columns: one on the Text object, and one on each dictionary object
A JSON in each text object, as in option 2, but only to render the text (not to search), containing all the information about each position in that same text.
Another JSON in each "dictionary object" (definition, base word, grammar concept, grammar tag), just for searching (not to render the text). This column would track the matches of this particular object across ALL texts. It would be an array of hashes, where each hash would be {text_id: x, text_index: y}.
With this option, the search would be "easier", but it would still not be ideal: to find all the text positions that contain a certain attribute, I would have to do the following:
Find the record for that attribute
Extract the text_ids / indexes from the record
Find the texts with those IDs
Extract the matching line from each text, using the index that comes with each text_id within the JSON.
If it was a combination of attributes that I were looking for, I would have to do those 4 steps for each attribute, and then find the intersection between the sets of matches for each attribute (to end up only having the positions that contain both).
Furthermore, when updating a position (for example, if a person indicates that an attribute is wrongly associated and that it should actually be another), I would have to update both JSONs.
Also, will storing 2 JSON columns actually bring any tangible benefit over a TextPosition table? It would probably take up MORE storage space than using a TextPosition table, and for what benefit?
conclusion
In sum, I am looking for some advice on which of those 3 routes I should follow. I hope the answer is "option 1", but if so, I would love to know what drawbacks/obstacles could come up later when there are a ton of entries.
Thanks, Michael King
Text parsing and searching make my brain hurt. But anytime I have something with the complexity of what you are talking about, ElasticSearch is my tool of choice. You can do some amazingly complex indexing and searching with it.
So my answer is 4) ElasticSearch.
I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.
Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.
You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.
For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.
I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.