Duplicate SOLR Document Issue While Using Overwrite=True - solr

I am having an issue with temporary duplicate documents in my SOLR collection that are causing my user rankings system to be incorrect.
I am using SOLR version 4.8.1 so it is one of the latest builds. I am using XML to update the SOLR collection like described in this SOLR Documentation:
<add overwrite="true" commitWithin="#COMMIT_WITHIN.GLOBAL_VALUE#">
<doc>
<field name="END_USER_ID">#END_USER_ID#</field>
<field name="TARGET_REGION_ID">#TARGET_REGION_ID#</field>
<field name="POPULARITY_RANK">#POPULARITY_RANK#</field>
<field name="VISIBILITY_SCORE">#VISIBILITY_SCORE#</field>
<field name="POPULARITY_VISIBILITY_SCORES_ID">#POPULARITY_VISIBILITY_SCORES_ID#</field>
<cfif #POP_VIS_SCORES_LAST_MODIFIED_DATETIME# NEQ "">
<field name="POPULARITY_VISIBILITY_SCORES_DATE_MODIFIED">#POP_VIS_SCORES_LAST_MODIFIED_DATETIME#</field>
</cfif>
</doc>
</add>
As you can see from the code above, I am using the overwrite parameter (to have newer documents replace previously added documents with the same uniqueKey) in conjunction with the commitWithin parameter (to add the document within a certain time period). The uniqueKey in this case should be END_USER_ID and the time period should be 15 seconds; I have checked to make sure that the uniqueKey is defined in the appropriate schema.xml file and that multiValued is set to false for END_USER_ID.
So on my rankings page, there are several calls to our local SOLR server. For example:
http://localhost:8983/solr/pop_vis_scores/select/?q=TARGET_REGION_ID:50%20AND%20-POPULARITY_RANK:0&version=4.8&start=0&rows=1&indent=off&stats=true&stats.field=POPULARITY_RANK&sort=POPULARITY_RANK%20ASC&fl=[docid],END_USER_ID,POPULARITY_RANK&timeAllowed=8000
From my observations, when the commitWithin is set to 15000 milliseconds, the updated SOLR document is available right away but a duplicate SOLR document exists that reflects the older data. When the commitWithin is set to 500 milliseconds, it seems like the problem does not exist. Having said that, I would theorize the problem is still there but users cannot act quickly enough to see the duplicate documents. When I have thousands of users playing this game, I theorize that this problem may in fact still exist on a larger scale. In addition, it would be nice to set that commitWithin back to 15 seconds when the player base of the game increases.
Anybody face a similar issue before and if so, how would you go by solving it? Anybody have any recommendations? Thanks in advance!

I assumed that when a SOLR document gets added to the collection within that given 15 second time window that the old document would get deleted at the same time as the new one would be inserted into the collection. It appears that this assumption was incorrect. I was able to exclude the user id from my queries to get more accurate statistical values when it came to rankings. For anybody experiencing a similar situation that I was in, I recommend not assuming that SOLR documents get deleted and updated at the same time.

Related

Why does solr add new document while updating?

curl 'http://localhost/solr/collection/update?commit=true'
-H 'Content-type:application/json'
-d
'[
{
"id":"11111",
"price":{"set":1000}
}
]'
If id:11111 exists, price value is updated. It's ok.
If id:11111 doesn't exist, new document is created in solr index. This behavior is not desirable. I expect error with some text like: document you tried to update does not exist.
I cannot understand what is wrong.
Solr version: 4.8.0.
Part of schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<uniqueKey>id</uniqueKey>
The /update request handler actually updates the index for new and existing documents and handles deletion as well.
During indexation:
A document is considered new if it has no identifier or if its id does not match any of the indexed documents. If no id is generated during indexing and if the uniqueKey field is required, the document is rejected.
A document that has an identifier matching an indexed document is merged with its stored version : all stored fields are loaded from the index and are overriden by field values from the request parameters, and the resulting document replace the previous one (but in the end it is the same operation).
In other word an update request - if not a delete - always ends up in the same add operation. By the way the XML schema recognized by solr.UpdateRequestHandler contains the elements <add>, <doc> and <field> regardless of the operation (add or replace).
Recent versions of Solr provide more options for updating parts of documents. (see atomic updates and in-place updates.
What you describe is the expected behavior. Since the id field is required, Solr will throw an error for document missing this field. In your situation, the document is indexed in both cases because the id is given in both cases.
With this configuration you would have to ensure that id field is empty for what you consider a new document, either client side when preparing the request or server side using an update processor or by updating the request handler implementation. Maybe it would be even simpler to prevent the indexation of any new docs ?
that is how the current implementation of Atomic updates seems to work. I concur it might be desirable to get an error...You should raise the issue in the user mailing list, and see what commiters think, maybe they agree with you that an error should be raised, they'll ask you to open a jira then.
Oh, just noticed the 4.8 version, that is quite old, can you by any chance test the behaviour in current versions?

Extending Solr Tutorial with custom fields/core

After standing up a basic jetty Solr example. I've tried to make my own core to represent the data my company will be seeing. I made a directory structure with conf and data directories and copied core.properties, schema.xml, and solrconfig.xml from the collection1 example.
I've editted core.properties to change the core name, and I've added 31 fields (most of type text_general, indexed, stored, not required or multivalued) to the schema.
I'm pretty sure I've set it up correctly as I can see my core in the admin page drop down and interact with it. The problem is, when I feed a document designed for the new fields, I cannot get a successful query for any of the values. I believe the data is fed as I got the same command line response:
"POSTing file incidents.xml...
1 file indexed. ....
COMMITting..."
I thought, the Indexing process took more time, but when I copy a field node out of an example doc (e.g <field name="name">Apple 60 GB iPod with Video Playback Black</field> from ipod_video.xml) into a copy of my file (incidents2.xml) searches on any of those strings instantly succeed.
The best example of my issue is both files have the field:
<field name="Brand" type="text_general" indexed="true" stored="true" required="false" multiValued="false"/>
<field name="Brand">APPLE</field>
However, only the second document (with the aforementioned name field) is returned with a query for apple.
Thanks for reading this far; my questions are:
1) Is there a way to dump the analysis/tokenization phase of document ingestion? Either I don't understand it or the Analysis tab isn't designed for this. The debugQuery=true parameter gives relevance score data but no explanation of why a document was excluded.
2) Once I solve my overall issue, I we would like to have large text fields included in the index, can I wrap long form text in CDATA blocks in solr?
Thanks again.
To debug any query issues in Solr, there's a few useful things to check. You might also want to add the output of your analysis page and the field you're having issues with from your schema.xml to your question. It's also a good idea to have a smaller core to work with (use three or four fields just to get started and get it to work) when trying to debug any indexing issues.
Are the documents actually in the index? - Perform a search for : (q=*:*) to make sure that there are any documents present in the index. *:* is a shortcut that means "give me all documents regardless of value". If there are no documents returned, there is no content in the index, and any attempt to search it will give zero results.
Check the logs - Make sure that SolrLogging is set up, so you get any errors thrown in your log. That way you can see if there's anything in particular going wrong when the query or indexing is taking place, something which would result in the query never being performed or any documents being added to the index.
Use the Analysis page - If you have documents in the index, but they're not returned for the queries you're making, select the field you're querying at the analysis page and add both the value given when indexing (in the index column) and the value used when querying (in the query field). The page will then generate all the steps taken both when indexing and querying, and show you the token stream at each step. If the tokens match, they will be highlighted with a different background color, and depending on your setting, you might require all tokens present on the query side to be present on the indexing side (i.e. every token AND-ed together). Start with searching for a single token on the query side for that reason.
If you still doesn't have any hits, but have the documents in the index, be more specific. :-)
And yes, you can use CDATA.

How to query a specific document by id

From a previous query I already have the document ID (the uniqueKey in this schema is 'track_id') of the document I'm interested in.
Then I would like to query a sequence of words on that document while highlighting the match.
I can't seem to be able to combine the search parameters in a successful way (all my google searches return purple links :\ ), although I've already tried many combinations these past few days. I also know the field where the matches will be if that's any use in terms of improving match speed.
I'm guessing it should be something like this:
/select?q=track_id:{key_i_already_have} AND/&/{part_I_dont_know} word1 word2 word3
Currently, since I can't combine these two search parameters, I'm only querying the words and thus getting several results from several documents.
Thanks in advance.
From Solr 4 you can use the realtime get, which is much more faster than searching the index by id.
http://localhost:8983/solr/get?ids=id1,id2,id3
For index updates to be visible (searchable), some kind of commit must reopen a searcher to a new point-in-time view of the index. The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher. This is primarily useful when using Solr as a NoSQL data store and not just a search index.
You may try applying Filter Query for id. So it will filter your search query to that id, and then search in that document for all the keywords, and highlight them.
Your query will look like:
/select?fq=track_id:DOC_ID&q=word1 word2 word3
Just make sure your "id" field in schema.xml is defined of the type string to apply filter queries on it.
<field name="id" type="string" indexed="true" stored="true" required="true" />

Know indexing time for a document in Solr

Is it possible to know the indexing time of a document in solr. Like there is a implicit field for "score" which automatically gets added to a document, is there a field that stores value of indexing time?
I need it to know the date when a document got indexed.
Thanks
Solr does not automatically add a create date to documents. You could certainly index one with the document though, using Solr's DateField. In earlier versions or Solr ( < 4.2 ), there was a commented timestamp field in the example schema.xml, which looked like:
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
Also, I think it bears noting that there is no implicit "score" field. Scores are calculated at query time, rather than being tied to the document. Different queries will generate different scores for the same document. There are norms stored with the document that are factored into scores, but they aren't really fields.
femtoRgon give you a correct solution but you must be carefull with partial document update.
If you do not do partial document update you can stop reading now ;-)
If you partially update your document, SolR will merge the existing value with your partial document and the timestamp will not be updated. The solution is to not store the timestamp, then SolR will not be able to merge this value. The drawback is you cannot retrieve the timestamp with your search result.

How to update Solr documents on the Solr server side with custom handler / plugin

I have a core with millions of records.
I want to add a custom handler which scan the existing documents and update one of the field based on a condition (age>12 for example).
I prefer doing it on the Solr server side for avoiding sending millions of documents to the client and back.
I was thinking of writing a solr plugin which will receive a query and update some fields on the query documents (like the delete by query handler).
I was wondering whether there are existing solutions or better alternatives.
I was searching the web for a while and couldn't find examples of Solr plugins which update documents (I don't need to extend the update handler).
I've written a plug-in which use the following code which works fine but isn't as fast as I need.
Currently I do:
AddUpdateCommand addUpdateCommand = new AddUpdateCommand(solrQueryRequest);
DocIterator iterator = docList.iterator();
SolrIndexSearcher indexReader = solrQueryRequest.getSearcher();
while (iterator.hasNext()) {
Document document = indexReader.doc(iterator.nextDoc());
SolrInputDocument solrInputDocument = new SolrInputDocument();
addUpdateCommand.clear();
addUpdateCommand.solrDoc = solrInputDocument;
addUpdateCommand.solrDoc.setField("id", document.get("id"));
addUpdateCommand.solrDoc.setField("my_updated_field", new_value);
updateRequestProcessor.processAdd(addUpdateCommand);
}
But this is very expensive since the update handler will fetch again the document which I already hold at hand.
Is there a safe way to update the lucene document and write it back while taking into account all the Solr related code such as caches, extra solr logic, etc?
I was thinking of converting it to a SolrInputDocument and then just add the document through Solr but I need first to convert all fields.
Thanks in advance,
Avner
I'm not sure whether the following is going to improve the performance, but thought it might help you.
Look at SolrEntityProcessor
Its description sounds very relevant to what you are searching for.
This EntityProcessor imports data from different Solr instances and cores.
The data is retrieved based on a specified (filter) query.
This EntityProcessor is useful in cases you want to copy your Solr index
and slightly want to modify the data in the target index.
In some cases Solr might be the only place were all data is available.
However, I couldn't find an out-of-the-box feature to embed your logic. So, you may have to extend the following class.
SolrEntityProcessor and the link to sourcecode
You may probably know, but a couple of other points.
1) Make the entire process exploit all the cpu cores available. Make it multi-threaded.
2) Use the latest version of Solr.
3) Experiment with two Solr apps on different machines with minimal network delay. This would be a tough call :
same machine, two processes VS two machines, more cores, but network overhead.
4) Tweak Solr cache in a way that applies to your use-case and particular implementation.
5) A couple of more resources: Solr Performance Problems and SolrPerformanceFactors
Hope it helps. Let me know the stats despite this answer. I'm curious and your info might help somebody later.
To point out where to put custom logic, I would suggest to have a look at the SolrEntityProcessor in conjunction with Solr's ScriptTransformer.
The ScriptTransformer allows to compute each entity after it is extracted from the source of a dataimport, manipulate it and add custom field values before the new entity is written to solr.
A sample data-config.xml could look like this
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<script>
<![CDATA[
function calculateValue(row) {
row.put("CALCULATED_FIELD", "The age is: " + row.get("age"));
return row;
}
]]>
</script>
<document>
<entity name="sep" processor="SolrEntityProcessor"
url="http://localhost:8080/solr/your-core-name"
query="*:*"
wt="javabin"
transformer="script:calculateValue">
<field column="ID" name="id" />
<field column="AGE" name="age" />
<field column="CALCULATED_FIELD" name="update_field" />
</entity>
</document>
</dataConfig>
As you can see, you may perform any data transformation you like and is expressible in javascript. So this would be a good point to express your logic and transformations.
You say one constraint maybe age > 12. I would handle this via the query attribute of the SolrEntityProcessor. You could write query=age:[* TO 12] so that only records with an age up to 12 would be read for the update.

Resources