Why does solr add new document while updating?

Why does solr add new document while updating? - solr

curl 'http://localhost/solr/collection/update?commit=true'
-H 'Content-type:application/json'
-d
'[
{
"id":"11111",
"price":{"set":1000}
}
]'
If id:11111 exists, price value is updated. It's ok.
If id:11111 doesn't exist, new document is created in solr index. This behavior is not desirable. I expect error with some text like: document you tried to update does not exist.
I cannot understand what is wrong.
Solr version: 4.8.0.
Part of schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<uniqueKey>id</uniqueKey>

The /update request handler actually updates the index for new and existing documents and handles deletion as well.
During indexation:
A document is considered new if it has no identifier or if its id does not match any of the indexed documents. If no id is generated during indexing and if the uniqueKey field is required, the document is rejected.
A document that has an identifier matching an indexed document is merged with its stored version : all stored fields are loaded from the index and are overriden by field values from the request parameters, and the resulting document replace the previous one (but in the end it is the same operation).
In other word an update request - if not a delete - always ends up in the same add operation. By the way the XML schema recognized by solr.UpdateRequestHandler contains the elements <add>, <doc> and <field> regardless of the operation (add or replace).
Recent versions of Solr provide more options for updating parts of documents. (see atomic updates and in-place updates.
What you describe is the expected behavior. Since the id field is required, Solr will throw an error for document missing this field. In your situation, the document is indexed in both cases because the id is given in both cases.
With this configuration you would have to ensure that id field is empty for what you consider a new document, either client side when preparing the request or server side using an update processor or by updating the request handler implementation. Maybe it would be even simpler to prevent the indexation of any new docs ?

that is how the current implementation of Atomic updates seems to work. I concur it might be desirable to get an error...You should raise the issue in the user mailing list, and see what commiters think, maybe they agree with you that an error should be raised, they'll ask you to open a jira then.
Oh, just noticed the 4.8 version, that is quite old, can you by any chance test the behaviour in current versions?

Related

Reindexing Solr Data with different field type

I am facing an issue while reindexing Solr data.
I have indexed some documents specifying a wrong field type on the managed-schema file.
Now, instead of the wrong field definition, I would like to use:
<field name="documentDate" type="date" indexed="true" stored="true"/>
To do this I have:
deleted all the previous wrong indexed documents;
updated the managed-schema
reloaded the core
After these steps I tried to reindex documents, but this fails; looking at logs:
org.apache.solr.common.SolrException: Exception writing document id 2ecde3eb2b5964b2c44362f752f7b90d to the index; possible analysis error: cannot change DocValues type from NUMERIC to SORTED_SET for field "documentDate".
How is this possible? I have removed all the documents storing the field documentDate.. How can I solve this issue?

maybe try to delete the data folder in your core.
You can add new fields to your schema without delete the data folder, but when you modify a field (this is my experience) then I have to delete the data folder and build a new fresh index

Duplicate SOLR Document Issue While Using Overwrite=True

I am having an issue with temporary duplicate documents in my SOLR collection that are causing my user rankings system to be incorrect.
I am using SOLR version 4.8.1 so it is one of the latest builds. I am using XML to update the SOLR collection like described in this SOLR Documentation:
<add overwrite="true" commitWithin="#COMMIT_WITHIN.GLOBAL_VALUE#">
<doc>
<field name="END_USER_ID">#END_USER_ID#</field>
<field name="TARGET_REGION_ID">#TARGET_REGION_ID#</field>
<field name="POPULARITY_RANK">#POPULARITY_RANK#</field>
<field name="VISIBILITY_SCORE">#VISIBILITY_SCORE#</field>
<field name="POPULARITY_VISIBILITY_SCORES_ID">#POPULARITY_VISIBILITY_SCORES_ID#</field>
<cfif #POP_VIS_SCORES_LAST_MODIFIED_DATETIME# NEQ "">
<field name="POPULARITY_VISIBILITY_SCORES_DATE_MODIFIED">#POP_VIS_SCORES_LAST_MODIFIED_DATETIME#</field>
</cfif>
</doc>
</add>
As you can see from the code above, I am using the overwrite parameter (to have newer documents replace previously added documents with the same uniqueKey) in conjunction with the commitWithin parameter (to add the document within a certain time period). The uniqueKey in this case should be END_USER_ID and the time period should be 15 seconds; I have checked to make sure that the uniqueKey is defined in the appropriate schema.xml file and that multiValued is set to false for END_USER_ID.
So on my rankings page, there are several calls to our local SOLR server. For example:
http://localhost:8983/solr/pop_vis_scores/select/?q=TARGET_REGION_ID:50%20AND%20-POPULARITY_RANK:0&version=4.8&start=0&rows=1&indent=off&stats=true&stats.field=POPULARITY_RANK&sort=POPULARITY_RANK%20ASC&fl=[docid],END_USER_ID,POPULARITY_RANK&timeAllowed=8000
From my observations, when the commitWithin is set to 15000 milliseconds, the updated SOLR document is available right away but a duplicate SOLR document exists that reflects the older data. When the commitWithin is set to 500 milliseconds, it seems like the problem does not exist. Having said that, I would theorize the problem is still there but users cannot act quickly enough to see the duplicate documents. When I have thousands of users playing this game, I theorize that this problem may in fact still exist on a larger scale. In addition, it would be nice to set that commitWithin back to 15 seconds when the player base of the game increases.
Anybody face a similar issue before and if so, how would you go by solving it? Anybody have any recommendations? Thanks in advance!

I assumed that when a SOLR document gets added to the collection within that given 15 second time window that the old document would get deleted at the same time as the new one would be inserted into the collection. It appears that this assumption was incorrect. I was able to exclude the user id from my queries to get more accurate statistical values when it came to rankings. For anybody experiencing a similar situation that I was in, I recommend not assuming that SOLR documents get deleted and updated at the same time.

Solr dynamicField not searched in query without field name

I'm experimenting with the Example database in Solr 4.10 and not understanding how dynamicFields work. The schema defines
dynamicField name="*_s" type="string" indexed="true" stored="true"
If I add a new item with a new field name (say "example_s":"goober" in JSON format), a query like
?q=goober
returns no matches, while
?q=example_s:goober
will find the match. What am I missing?

I would like to see the SearchHandler from solrconfig.xml file that you are using to execute the above mentioned query.
In SearchHandler we generally have Default Query Field i.e. qf parameter.
Check that your dynamic field example_s is present in that query field list of solrconfig file else you can pass it while sending query to search handler.
Hope this will help you in resolving your problem.

If you are using the default schema, here's what's happening:
You are probably using default end-point (/select), so you get the definition of search type and parameters from that. Which means, it is default (lucene) search and the field searched is text.
The text field is an aggregate and is populated by copyField instruction from other fields.
Your dynamic field definition for *_s allows you to index the text with any name ending in _s, such as example_s. It's indexed (so you could search against it directly) and stored (so you can see it when you ask for all fields). It will not however search it as a general text. Notice that (differently from ElasticSearch), Solr strings have to be matched fully and completely. If you have some multi-word text in it, there is barely any point searching it. "goober" is one word so it's not a very good example to understand the difference here.
The easiest solution for you is add another copyField instruction:
<copyField source="*_s" dest="text"/>, then all your *_s dynamic fields would also be searchable. But notice that the search analyzers will not be the ones for *_s definition, but the ones for the text field's definition, which is not string, but text_general, defined elsewhere in the file.
As to Solr vs. ElasticSearch, they both err on the different sides of magic. Solr makes you configure the system and makes it very easy to see the exact current configuration. ElasticSearch hides all of the configuration, but you have to rediscover it the second you want to change away from the default behaviour. In the end, the result is probably similar and meets somewhere in the middle.

Extending Solr Tutorial with custom fields/core

After standing up a basic jetty Solr example. I've tried to make my own core to represent the data my company will be seeing. I made a directory structure with conf and data directories and copied core.properties, schema.xml, and solrconfig.xml from the collection1 example.
I've editted core.properties to change the core name, and I've added 31 fields (most of type text_general, indexed, stored, not required or multivalued) to the schema.
I'm pretty sure I've set it up correctly as I can see my core in the admin page drop down and interact with it. The problem is, when I feed a document designed for the new fields, I cannot get a successful query for any of the values. I believe the data is fed as I got the same command line response:
"POSTing file incidents.xml...
1 file indexed. ....
COMMITting..."
I thought, the Indexing process took more time, but when I copy a field node out of an example doc (e.g <field name="name">Apple 60 GB iPod with Video Playback Black</field> from ipod_video.xml) into a copy of my file (incidents2.xml) searches on any of those strings instantly succeed.
The best example of my issue is both files have the field:
<field name="Brand" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="Brand">APPLE</field>
However, only the second document (with the aforementioned name field) is returned with a query for apple.
Thanks for reading this far; my questions are:
1) Is there a way to dump the analysis/tokenization phase of document ingestion? Either I don't understand it or the Analysis tab isn't designed for this. The debugQuery=true parameter gives relevance score data but no explanation of why a document was excluded.
2) Once I solve my overall issue, I we would like to have large text fields included in the index, can I wrap long form text in CDATA blocks in solr?
Thanks again.

To debug any query issues in Solr, there's a few useful things to check. You might also want to add the output of your analysis page and the field you're having issues with from your schema.xml to your question. It's also a good idea to have a smaller core to work with (use three or four fields just to get started and get it to work) when trying to debug any indexing issues.
Are the documents actually in the index? - Perform a search for : (q=*:*) to make sure that there are any documents present in the index. *:* is a shortcut that means "give me all documents regardless of value". If there are no documents returned, there is no content in the index, and any attempt to search it will give zero results.
Check the logs - Make sure that SolrLogging is set up, so you get any errors thrown in your log. That way you can see if there's anything in particular going wrong when the query or indexing is taking place, something which would result in the query never being performed or any documents being added to the index.
Use the Analysis page - If you have documents in the index, but they're not returned for the queries you're making, select the field you're querying at the analysis page and add both the value given when indexing (in the index column) and the value used when querying (in the query field). The page will then generate all the steps taken both when indexing and querying, and show you the token stream at each step. If the tokens match, they will be highlighted with a different background color, and depending on your setting, you might require all tokens present on the query side to be present on the indexing side (i.e. every token AND-ed together). Start with searching for a single token on the query side for that reason.
If you still doesn't have any hits, but have the documents in the index, be more specific. :-)
And yes, you can use CDATA.

Know indexing time for a document in Solr

Is it possible to know the indexing time of a document in solr. Like there is a implicit field for "score" which automatically gets added to a document, is there a field that stores value of indexing time?
I need it to know the date when a document got indexed.
Thanks

Solr does not automatically add a create date to documents. You could certainly index one with the document though, using Solr's DateField. In earlier versions or Solr ( < 4.2 ), there was a commented timestamp field in the example schema.xml, which looked like:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Also, I think it bears noting that there is no implicit "score" field. Scores are calculated at query time, rather than being tied to the document. Different queries will generate different scores for the same document. There are norms stored with the document that are factored into scores, but they aren't really fields.

femtoRgon give you a correct solution but you must be carefull with partial document update.
If you do not do partial document update you can stop reading now ;-)
If you partially update your document, SolR will merge the existing value with your partial document and the timestamp will not be updated. The solution is to not store the timestamp, then SolR will not be able to merge this value. The drawback is you cannot retrieve the timestamp with your search result.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight