How to re-index documents in Solr without knowing last modified time? - solr

How to handle the following scenario in Solr's DataImportHandler? We do a full import of all our documents once daily (the full indexing takes about 1 hour to run). All our documents are in two classes, say A and B. Only 3% of the documents belong to class A and these documents get modified often. We re-index documents in class A every 10 mins via deltaQuery by using the modified time. All fine till here.
Now, we also want to re-index ALL documents in class A once every hour (because we have a view_count column in a different table and the document modified time does not change when we update the view_count). How to do this?
Update (short-term solution): For now we decided to not use the modified time in the delta at all and simply re-index all documents in class A every 10 mins. It takes only 3 mins to index class A docs so we are OK for now. Any solution will be of help though.

Rather than using separate query an deltaQuery parameters in your DIH DB config, I chose to follow the suggestion found here, which allows you to use the same query logic for both full and partial updates by passing different parameters to Solr to perform either a full import or a delta import.
In both cases, you would pass ?command=full-import, but for a full import you would pass &clean=true as a URL parameter, for a delta you would pass &clean=false, which would affect the # of records returned from the query as well as tell Solr whether or not to flush and start over.

I found one can use ExternalFileField to store the view count and use a function query to sort the results based on that field. (I asked another question about this on SO: ExternalFileField in Solr 3.6.) However, I found that these fields cannot be returned in the Solr result set, which meant I needed to do a DB call to get the values for the fields. I don't want to do that.
Found an alternate solution: When trying to understand Mike Klostermeyer's answer, I found that command=full-import can also take an additional query param: entity. So now I set up two top-level entities inside the <document> tag in data-config.xml - the first one will only index docs in class A and the second one will only index docs in class B. For class A documents we do a delta import based on last modified time every 5 mins and a full import every hour (to get the view_count updated). For class B documents, we only do one full import every day and no delta imports.
This essentially gives three different execution plans to run at different time intervals.
There is also one caveat though: need to pass query param clean=false every time I run the import for an entity; otherwise the docs in the other entity get deleted after indexing completes.
One thing I don't like about the approach is the copy-pasting of all the queries and sub-entities from one top entity to the other. The only difference between the queries in the top-level entities is whether the doc is in class A or class B.

Related

Are documents removed from a couchbase view if the data changes?

My understanding is that Couchbase views are built incrementally, but I can't seem to find an answer to whether a document can exist in a view multiple times. For example, say I want to create a view based on an updatedAt timestamp, that is changed every time I update this document type.
If the view is built incrementally, that seems to imply that if document id "1234" is updated several times and that updatedAt timestamp changed each time, I'd end up with several entries in the view for the same document, when what I want is just one entry, for the latest value.
It does seem like Couchbase is limiting it to a single copy of any given document id within the view, but I can't find firm confirmation of that anywhere. I want to make sure I'm not designing something for a production system around a behavior that might not work the way it seems to on a small scale.
Yes. When a view index is refreshed, any documents modified since the last refresh have their associated rows removed from the view, and the map function is invoked again to emit the new row(s).
A single document can generate multiple view rows, but only if the view's map function calls emit multiple times.

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

Updating documents with SOLR

I have a community website with 20.000 members using everyday a search form to find other members. The results are sorted by the last connected members. I'd like to use Solr for my search (right now it's mysql) but I'd like to know first if it's good practice to update the document of every member who would login in order to change their login date and time ? There will be around 20.000 update of documents a day, I don't really know if it's too much updating and could alter performances ? Tank you for your help.
20k updates/day is not unreasonable at all for Solr.
OTOH, for very frequently updating fields (imagine one user could log in multiple times a day so you might want to update it all those times), you can use External Fields to keep that field stored outside the index (in a text file) and still use it for sorting in solr.
Generally, Solr does not be used for this purpose, using your database is still better.
However, if you want to use Solr, you will deal with it in a way like the database. i.e every user document should has a unique field, id for example. When the user make log in, you may use an update query for that user's document last_login_date field by its id. You could able to know more about Partial Update from this link.

List all document keys in a Solr index for the purpose of database synchronisation

I need to synchronize a Solr index with a database table. At any given time, the Solr index may need to have documents added or removed. The nature of the database prevents the Data Import Handler's Delta Import functionality from being able to detect changes.
My proposed solution was to retrieve a list of all primary keys of the database table and all unique keys of the Solr index (which contain the same integer value) and compare these lists. I would use SolrJ for this.
However, to get all Solr documents requires the infamous approach of hard-coding the maximum integer value as the result count limit. Using this approach seems to be frowned upon. Does my situation have cause to ignore this advice, or is there another approach?
You can execute two queries to list all keys from solr in one batch: first with rows=0, you will get a number of hits, second with that number as rows parameter. Its not very optmimal solution, but works.
Second possibility is to store update date in solr index, and fetch only changed documents from last synchronisation.

Best way to create a Lucene Index with fields that can be updated frequently, and filtering the results by this field

I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter.
With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select.
This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability.
I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in
a day? And how quickly must those updates be reflected in the search
results?
800K documents isn't all that many. I'd go with the simple approach first
and monitor the results, #then# go to a more complex solution if you
see a problem arising. Just update (delete/add) the documents when
the value changes.
Well, the cost to reindex is just about what the cost to index it orignally
is. The old version of the document is marked deleted and the new one
is added. It's essentially the same cost as to index a new document.
This leaves some gaps in your index, that is the deleted docs are still in
there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that,
say
once daily (or even weekly).
HTH
Erick

Resources