Stored fields in Solr are getting displayed in queries , why? - solr

I am new to using Solr , and I have made a new core and copied the default schema.xml to the conf/ folder. The changes I have made is very trivial .
<field name="id" type="string" indexed="true" stored="false" required="true" multiValued="false" />
As you can see, I set the id field to stored=false. As per my understanding, the field id should not be displayed now when I do a query search. But that is not happening. I have tried restarting solr instance, and did the query to index the file again.
curl 'http://localhost:8983/solr/TwitterCore/update/json?commit=true'
--data-binary #$(echo TwitterData_Core_Conf/TwitterText_en_demo.json)
-H 'Content-type:application
As per Solr Wiki , this should have re-indexed my file. However when I run my query again, I still see the Id .
An example of the document returned (this is not the complete JSON node , I just copied some parts ) :
"text": [
"RT #FollowTrainTV: Moonseternity just joined #FollowTrainTV - Watch them stream on http://t.co/oMcOGA51kT"
],
"lang": [
"en"
],
"id": "0a8edfea-68f7-4b05-b370-27b5aba640b7", // I dont want to see this
"_version_": 1512067627994841000
Maybe someone can give me detailed steps on re-indexing.

When you change the schema.xml file and restart the solr-server, the changes only apply for new documents. This means you have to clear the index and re-index all documents (Except at query tokenizer, these changes are active immediately after server restart, but this is not the case here). After re-indexing, the id field should not be visible any more.
Another remark: You don't have to test your queries with curl. When you connect to http://localhost:8983/solr with your web-browser you should find an admin interface there. There you can select a core and test your queries.

Refer to this https://lucene.apache.org/solr/guide/6_6/docvalues.html document.
Non-stored docValues fields will be also returned along with other
stored fields when all fields are
specified to be returned (e.g. “fl=*”) for search queries depending on
the effective value of the useDocValuesAsStored parameter for each
field. For schema versions >= 1.6, the implicit default is
useDocValuesAsStored="true".
The String field type has docValues="true" . That is the reason why it is appearing in the search response.
You can either add the useDocValuesAsStored="false" parameter to the field or you can use a different fieldType, say text_general.

Related

Incorrect field reading during ranking

Solr version 5.1.0
Documents contain DocValues field "ts" with timestamp using during ranking.
<field name="ts" type="long" docValues="true" indexed="true" stored="true" multiValued="false"/>
If I directly request document at Solr Admin UI I see that it contains correctly value:
"ts": 1575624481951
But when I added logs into the ranking method I saw that "ts" values for the same document is 0.
LeafReader reader = context.reader();
NumericDocValues timeDV = DocValues.getNumeric(reader, "ts");
long timestamp = timeDV.get(doc);
LOG.info("ts: " + timestamp);
Log:
ts: 0
Problem was in incorrect deleting document from Solr.
That was reproducing with next sequence of actions:
Firstly document was added to Solr without field "ts".
After some actions in app document was added again but with field "ts".
When Solr tried to ranking this document had not this field.
I added additional logs and saw that first version of document was on one shard and second version (with field "ts") was on another shard.
I don't pretty sure why it may happened because as I know Solr should put the same document on the same shard.
But anyway it was fixed with deleting document from index before adding second version.

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

Is there a way to view search document fields that are only indexed but not stored via the solr admin panel using the query tool?

I want to view the indexed but not stored fields of a solr search document in the solr admin query tool, is there any provision for this?
Example Field Configuration:
<field name="product_data" type="string" indexed="true" stored="false" multiValued="false" docValues="true" />
If you're using schema version 1.6, Solr will automagically fetch the values from the stored docValues, even if the field itself is set as stored="false". Include the field name in fl to get the values.
However, even if you're looking for the actual tokens indexed for a document / field / value, using the Analysis page is usually the preferred way as it allows you to tweak the value and see the response quickly. The Luke Request Handler / Tool is useful if you want to explore the actual indexed tokens.

SOLR 4.3 Modification of schema.xml not considered by the server

I have the following error: [doc=testIngestID411] unknown field 'dateImport'
At the beginning I did not have the field 'dateImport' in my solr schema. I decided to add it after launching solr a few times.
1. I added this field to schema.xml:
<filed name="dateImport" type="string" indexed="true" stored="true" required="true"/>
after the other pre-existing fields.
I removed all my existing documents using :
<delete><query>*:*</query></delete>
Stopped SOLR (using ctrl+c or by killing the jar process)
Restarted SOLR (using java -jar start.jar)
Then, when I try to insert a document with a filed named dateImport I got :
"unknown field 'dateImport'"
Extra information:
If I modify one field which existed before (i.e which was there the first time I launched this SOLR core) the modification is well considered. For instance, if I change one field that was not required for required=true (and restart solr). Then I cannot add a document without specifying this field.
Also I have noticed, using the web admin interface:
On the left there is a tab call "Schema", this schema contains all modifications (like the field dateImport). Above this tab there is another tab named "Schema Browser". The field 'dateImport' DOES NOT appear here :( .
What can I do to get this new field working??
Thank you
Change <filed ... to <field ...

Solr-Sunburnt-Nutch. content field missing in results

Im using solr-sunburnt with django. I have used nutch to crawl and index my site. I copied the nutch schema.xml to solr.
The problem I'm facing is that when I send a query, the results do not have the content field in them.
Results are the same whether I query from sunburnt or directly solr (from browser, :8983/solr/select).
What do i need to do to get content field in my results
P.S. I'm a noob when it comes to searching and solr. :)
Thanks for the hint aitchnyu22.
So the reason the content field is not returned in the results; is that it did not get indexed in the first place.
The reason it does not get indexed, is because the schema.xml file, that is copied from nutch into solr, has the stored parameter of the content field set to false by default.
Once you change this to true and re-index from scratch, the content field should appear in your results.
So the field should be
<field name="content" type="text" stored="true" indexed="true"/>
Has this to be set to true for Nutch, Solr or both?
Of course it should be the same in both locations, but which component does actually use this flag?

Resources