how does Solr store documents - solr

I know Solr uses Lucene and Lucene uses an inverted index. But from the Lucene examples I have seen so far, I am not sure I understand how it woks in combination with Solr.
Given the following document:
<doc>
<field name="id">9885A004</field>
<field name="name">Canon PowerShot SD500</field>
<field name="manu">Canon Inc.</field>
<field name="inStock">true</field>
</doc>
From the examples I have seen so far, I would think that Lucene has to treat each field as a document. it would then say: the ord Cannon appears in field name and field manu.
Is the index broken down this much? Or does the index only say: "the word Canon appears in the document with id such and such"?
How does this work exactly when using Lucene with Solr?
What would this document look like in the index? (supposing each field has indexed="true")

I made a blog post few years ago, to explain that in details[1] .
Short answer to this question :
" From the examples I have seen so far, I would think that Lucene has to treat each field as a document."
Absolutely NOT.
Lucene unit of information is the document which is composed by a map field -> value[s] .
A Solr document is just a slightly different representation as Solr incorporate a schema where fields are described.
So in Solr you can just add fields to the documents without having to describe the type and other properties ( which are stored in the schema), while in Lucene you need to define them explicitly when creating the doc.
[1] https://sease.io/2015/07/exploring-solr-internals-lucene.html

Related

Solr filter on facets

Each of my documents can have one or more entries of a field called Classes, describing some properties of the document, always of the form:
<field name="Classes">"<Description> - <TypeLabel> - <OriginLabel>"</field>
So for instance a document about food might have the two fields:
<field name="Classes">"Yellow orange - Fruit - California"</field>
<field name="Classes">"Small broccoli - Vegetable - Florida"</field>
I am using Solr 5.0 and a schema.xml file, where I have a multiValued "text_en" field Classes that I copy to a "string" field Classes_asString so that I can do faceting on the whole field and treat is as a big label.
With facet.field on Classes_asString I am getting the facet counts that I want, but now I would like to additionally filter these results.
For example, how do I only get facet results that end with "California"?
Or, in another example, how do I only get facet results that have "Vegetable" between the two "-"?
I have seen the option facet.prefix, but this is not applicable in my case. I would appreciate any help or suggestions.
Maybe this scenario is a good place to use:
Index the Classes info as Child documents. You have at least 3 fields in those fields, so it's worth using their own doc for that?
Then you should be able to facet on the specific child field, either with a current Solr version if it is supported (not sure), or with work in this ticket that is not merged yet

Solr dynamicField not searched in query without field name

I'm experimenting with the Example database in Solr 4.10 and not understanding how dynamicFields work. The schema defines
dynamicField name="*_s" type="string" indexed="true" stored="true"
If I add a new item with a new field name (say "example_s":"goober" in JSON format), a query like
?q=goober
returns no matches, while
?q=example_s:goober
will find the match. What am I missing?
I would like to see the SearchHandler from solrconfig.xml file that you are using to execute the above mentioned query.
In SearchHandler we generally have Default Query Field i.e. qf parameter.
Check that your dynamic field example_s is present in that query field list of solrconfig file else you can pass it while sending query to search handler.
Hope this will help you in resolving your problem.
If you are using the default schema, here's what's happening:
You are probably using default end-point (/select), so you get the definition of search type and parameters from that. Which means, it is default (lucene) search and the field searched is text.
The text field is an aggregate and is populated by copyField instruction from other fields.
Your dynamic field definition for *_s allows you to index the text with any name ending in _s, such as example_s. It's indexed (so you could search against it directly) and stored (so you can see it when you ask for all fields). It will not however search it as a general text. Notice that (differently from ElasticSearch), Solr strings have to be matched fully and completely. If you have some multi-word text in it, there is barely any point searching it. "goober" is one word so it's not a very good example to understand the difference here.
The easiest solution for you is add another copyField instruction:
<copyField source="*_s" dest="text"/>, then all your *_s dynamic fields would also be searchable. But notice that the search analyzers will not be the ones for *_s definition, but the ones for the text field's definition, which is not string, but text_general, defined elsewhere in the file.
As to Solr vs. ElasticSearch, they both err on the different sides of magic. Solr makes you configure the system and makes it very easy to see the exact current configuration. ElasticSearch hides all of the configuration, but you have to rediscover it the second you want to change away from the default behaviour. In the end, the result is probably similar and meets somewhere in the middle.

How to query a specific document by id

From a previous query I already have the document ID (the uniqueKey in this schema is 'track_id') of the document I'm interested in.
Then I would like to query a sequence of words on that document while highlighting the match.
I can't seem to be able to combine the search parameters in a successful way (all my google searches return purple links :\ ), although I've already tried many combinations these past few days. I also know the field where the matches will be if that's any use in terms of improving match speed.
I'm guessing it should be something like this:
/select?q=track_id:{key_i_already_have} AND/&/{part_I_dont_know} word1 word2 word3
Currently, since I can't combine these two search parameters, I'm only querying the words and thus getting several results from several documents.
Thanks in advance.
From Solr 4 you can use the realtime get, which is much more faster than searching the index by id.
http://localhost:8983/solr/get?ids=id1,id2,id3
For index updates to be visible (searchable), some kind of commit must reopen a searcher to a new point-in-time view of the index. The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher. This is primarily useful when using Solr as a NoSQL data store and not just a search index.
You may try applying Filter Query for id. So it will filter your search query to that id, and then search in that document for all the keywords, and highlight them.
Your query will look like:
/select?fq=track_id:DOC_ID&q=word1 word2 word3
Just make sure your "id" field in schema.xml is defined of the type string to apply filter queries on it.
<field name="id" type="string" indexed="true" stored="true" required="true" />

Know indexing time for a document in Solr

Is it possible to know the indexing time of a document in solr. Like there is a implicit field for "score" which automatically gets added to a document, is there a field that stores value of indexing time?
I need it to know the date when a document got indexed.
Thanks
Solr does not automatically add a create date to documents. You could certainly index one with the document though, using Solr's DateField. In earlier versions or Solr ( < 4.2 ), there was a commented timestamp field in the example schema.xml, which looked like:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Also, I think it bears noting that there is no implicit "score" field. Scores are calculated at query time, rather than being tied to the document. Different queries will generate different scores for the same document. There are norms stored with the document that are factored into scores, but they aren't really fields.
femtoRgon give you a correct solution but you must be carefull with partial document update.
If you do not do partial document update you can stop reading now ;-)
If you partially update your document, SolR will merge the existing value with your partial document and the timestamp will not be updated. The solution is to not store the timestamp, then SolR will not be able to merge this value. The drawback is you cannot retrieve the timestamp with your search result.

How to view non-stored fields per document?

I have a field like this:
<field name="status" type="string" indexed="true" stored="false" required="false" />
Using LukeRequestHandler I can view only statistics of the indexed terms, I can view indexed terms per document if stored="true". TermsComponent can show only frequencies of terms, I cannot view terms per document.
Is it possibly to look inside the inverted index without setting stored="true" and reindexing Solr?
In order to view the indexed terms for a single document, you need to use the full Luke application, not the LukeRequestHandler. You would need to copy the index folder from your Solr data directory to another location, then open it in Luke.
There is however a workaround within solr itself - do a search that will return just the one document, and facet on the field you want to examine. Every term in the index for that field on that document will be an entry in the facet output. Here is a full sample URL for this kind of search:
http://localhost:8983/solr/core/select?q=id:1234&facet.field=status&facet.limit=-1&facet.mincount=1&facet=true&facet.method=enum
If you decide to go the Luke route, you can step through your index (or search for an individual document) and view just one document.
The official Luke page is here, but it only supports up through 4.0-ALPHA:
http://code.google.com/p/luke/
You can find Luke for versions beyond 4.0-ALPHA here:
https://java.net/projects/opengrok/downloads
There is an effort underway to absorb Luke into the Lucene/Solr source code as a module, so it will always be current and released at the same time as each Lucene/Solr version.

Resources