Solr not including fields having empty value in result - solr

Indexed data on Solr contains some fields which are having empty values. When I run q=*:* it does not include fields having empty values. What parameter do I need to pass while query to get fields having empty values in the result.
EDIT :
I am indexing data using a csv file, entries in file are as follows :
id, dob, name
1,,name1
2,,name2
Now when I search for top 10 records I get only two fields. I want to get all fields even if there is no value stored for that.

Field should have stored="true"
Cross check in your schema.xml file about dob field. it should have stored="true"
<field name="dob" type="text_general" indexed="true" stored="true"/>
reindex the documents and query again, it works.
Hope this help

If an item doesn't have a piece of data, solr doesn't store the field. You should be able to force storage of an empty string by setting the field attributes required="True" default="".

what do you mean empty fields ? are these fields are set as indexed=true ? Are you setting empty say spaces as data when you are indexing these fields ? Looks like you are not sending even a template blank data to this variable , that is why its happening . For example if i send data in this format {"id":"change.me","title":" "} , where my title field is empty it gets indexed .But if you try to send a data like this , {"id":"change.me","title":} , it will send an error across solr .

Using a query add wt=csv and you can export a well formed CSV file
Specify the fields you require back using fl=
Example:
select?fl=id,foo,bar&indent=on&q=field:value&stored=true&rows=1000&start=0&wt=json

Related

Apache solr date field in views

I have a custom date field in one of my content type field_last_archived_date.
There is a corresponding entry in the Apache solr field list called dm_field_last_archived_date.
Now there are two problems that I am facing
When I try to use this field in a solr view to sort the same, it gives me error "cannot sort on multivalued field."
When I try to use this field as an exposed filter to provide a date range, I'm not sure what date format should be given. I have tried formats like "2011-10-01T23:59:59Z", "2011-10-01 23:59:59", plain unix timestamp, etc. But all of them throws error "Invalid Date String:'OctoberAMCECESTAM+02:001_SunAMCESTE_1nd+02008601'".
Any idea what I am doing wrong here?
Thanks...
dm_field_last_archived_date field is multi value field and solr is not provide sorting on multi value field.
To confirm behavior apply sort on single value field.
You can check multi value in schema file in solr it looks like
<field name="yourFieldName" type="tint" indexed="true" stored="true" omitNorms="true" multiValued="true" default="defaultValue"/>

Solrj indexing mechanism

I have a question about indexing mechanism using Solr in Java. If I create a documents and i want to find only field "name", solr will be index all fields? Or only index by field "name" in each document?
If you tell Solr to only store the field name in your schema, then only the field name will be stored.
If you instruct Solr to store everything you send to it (like in the schemaless mode) and you send 400 fields, each of those fields will be stored.
If you want to store information but not search for it, only those fields which you are going to query need to be indexed, while the other fields can be limited to just stored. If you don't need the content of the field, but just want to search for it, you can set stored to false, and indexed to true.
In the schema.xml where you define the fields getting used, you need to mention indexed=true for all the fields you want to search on.
In your case it would look something like this -
<field name="name" type="string" indexed="true" stored="true" />

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

Solr index vs stored

I am a little confused as to what the behaviour of the index and stored attibutes of the Solr fields is.
For example if I have the following in the Schema.xml
<field name="test1" type="text" indexed="false"
stored="false" required="false" />
Will the field test1 be not stored in the Solr document even if I create a document with that field in it and set a value to that field and commit the document to Solr. As I have the stored=false attribute, does it mean that the value of the field is lost in Solr and not persisted?
That is correct. Typically you will want your field to be either indexed or stored or both. If you set both to false, that field will not be available in your Solr docs (either for searching or for displaying). See Alexandre's answer for the special cases when you will want to set both to false.
As stated here : indexed=true makes a field searchable (and sortable and facetable). For eg, if you have a field named test1 with indexed=true, then you can search it like q=test1:foo, where foo is the value you are searching for. If indexed=false for field test1 then that query will return no results, even if you have a document in Solr with test1's value being foo.
stored=true means you can retrieve the field when you search. If you want to explicitly retrieve the value of a field in your query, you will use the fl param in your query like fl=test1 (Default is fl=* meaning retrieve all stored fields). Only if stored=true for test1, the value will be returned. Else it will not be returned.
The main point of having both set to false is to explicitly skip that particular field.
For example, if you have a storing/indexing dynamicField mapping and you want to ignore one particular name that would otherwise fall under dynamicField's pattern.
Alternatively you could use dynamicField to ignore a whole set of fields with same prefix/suffix that comes from a 3rd party. For example, Tika will send you a whole bunch of metadata fields which you may just want to ignore. See this defined in Solr's example schema.xml and used in solrconfig.xml
In the later versions of Solr, you could also use IgnoreFieldUpdateProcessorFactory (see full list for others) instead, which will get rid of those fields even earlier in the indexing process.
Quoting from this response in the Solr's mail thread:
"indexed" and "stored" are independent, orthogonal attributes - you can use
any of the four combinations of true and false. "indexed" is used for search
or query, the "lookup" portion of processing a query request. Once the
search/query/lookup is complete and a set of documents is selected, "stored"
is the set of fields whose values are available for display or return with
the Solr response.
Part of the reason for the separation is that Solr/Lucene "analyzes" or
transforms the input data into a more efficient form for faster and more
relevant search/lookup. Unfortunately, that analyzed/transformed data is
frequently no longer suitable for display and human consumption. In other
words the analysis/transformation is not bidirectional/reversible. Setting
"stored=true" guarantees that the original data can be retrieved in its
original form.
If both are false you loose your data in that field. If indexed true, the data are searchable but it can not be displayed. If you set stored true you will not be able to search on that field but it can be displayed (in this case you can write copyfield rule to copy the info from that field to the default searchable field). Both set as true -> you can search and display.
indexed = true means that this field can be used in the search.
For example, if I set the item field as follows and I try to perform the field in a search
<field name="item" type="text_general" uninvertible="true" indexed="false" stored="true"/>
fq = item: "Tennis" will mark an error.
stored = true means that this field can be retrieved in the list of fields displayed after a query.
For example, if the item field is defined as follows
<field name="item" type="text_general" uninvertible="true" indexed="true" stored="false"/>
You will be able to search fq = item: "Tennis" correctly, but it will not return the item field in the results.
Regards

Solr schema.xml field confusion

i m new to solr so i really need someone to help me understand the fields below. What's the meaning of the field if it's stored=false, indexed=false? see the two examples below, what's the differences? If the field is not stored, what's the use of it...
<field name="test1" type="text" indexed="false"
stored="false" required="false" />
How about this one?
<field name="test2" type="text" indexed="false"
stored="false" required="false" multiValued="true" />
Thanks a lot!
You can find best explanation from Solr wiki.
If you want a field to be searchable then you should set indexed attribute to true.
indexed=true : True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable, and facetable.
If you want to retrieve the field at the search result then you should set stored attribute to true.
stored=true : True if the value of the field should be retrievable during a search
If you want to store multiple value in a single field then you should set multivalued field to true.
multivalued=true : True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document
It's easier than it seems:
indexed: you can search on it
stored: you can show it within your search results
In fact, there might be fields that you don't use for search, but you just want to show them within the results. On the other hand, there might be fields that you want to show within the results but you don't want to use for search. The stored=false is important when you don't need to show a certain field, since it improves performance. If you make all your fields stored and you have a lot of fields, Solr can become slow returning the results.
Of course, having both false doesn't make a lot of sense, since the field would become totally useless.
The unique difference between your two fields is the multiValued=true, which means that the second field can contain multiple values. That means that the content of the field is not just a text entry but a list of text entries.

Resources