I am working on making some improvements on reindexing process. So we have our custom logic to figure out which documents have been modified and need to be reindexed. So at the end I can generate a delete query with something like delete all documents where fieldId in list
So instead of deleting and adding 50k documents everytime we only re-index a tiny percentage of it.
Now I am thinking about edge case scenario where our list of fieldIds is extremely large say 30-40,000 ids so if that's the case is there a upper limit on request length that I should worry about, or would it in turn cause negative effects on performance and exacerbate the situation instead of making it better.
I read some articles on google where they are advising to make it a post request instead.
I am using SolrNet latest build which is build on Solr 4.0
I would revisit that logic because deleting the documents then re-index them again is not the best solution. Because firstly it is an expensive operation, secondly your index will be empty or in-complete for a while until you re-index the documents again, which means if you query your index in the middle of the operation you could get zero, or partial results.
I would advise to just index again with the same document Id (uniquekey defined in solr schema.xml). And solr is smart eough to overwrite the document if it is indexed with the same Id. Then you don't have to worry about the hassle of deleting old documents. You might also do 'Optimize' to the index from time to time to physically get rid of 'deleted' documents.
Related
Currently I am facing an issue that a MongoDB collection might have billion records which contains document based on some rapid event happening in the system. These events get logged in the DB collection.
Since we have some 2-3 composite indexing on the same collection, the search definitely becomes slow.
The escape point to this is our customer has agreed if we can index only N months data in the MongoDB, then the efficiency for read can increase instead of having 2-3 years data indexed and we perform read operation.
My thoughts on solution 1: we can do TTL indexes and set expiry. After this expiry the data gets deleted from main collection. we can some how do backup for that expired records. This way we can only have specific data required in main collection.
My thoughts on solution 2: I can remove all the indexes, create the indexes again based on time frame, for example, Drop current indexes and again create indexes based on condition that indexes must be created only till past N months data only. This way I can maintain limited index. But I am not sure how much is it possible.
Question: I need more help on this on how can I achieve selective indexing. Also it must be rolling as everyday records gets added so does indexing.
If you're on Mongo 3.2 or above, you should be able to use a partial index to create the "selective index" that you want -- https://docs.mongodb.com/manual/core/index-partial/#index-type-partial You'll just need to be sure that your queries share the same partial filter expression that the index has.
(I suspect that there might also be issues with the indexes you currently have, and that reducing index size won't necessarily have a huge impact on search duration. Mongo indexes are stored in a B-tree, so the time to navigate the tree to find a single item is going to scale relative to the log of the number of items. It might be worth examining the explain output for the queries that you have to see what mongo is actually doing.)
I'm running a lot of SOLR document updates which results in 100s of thousands of deleted documents and a significant increase in disk usage (100s of Gb).
I'm able to remove all deleted document by doing an optimize
curl http://localhost:8983/solr/core_name/update?optimize=true
But this takes hours to run and requires a lot of RAM and disk space.
Is there a better way to remove deleted documents from the SOLR index or to update a document without creating a deleted one?
Thanks for your help!
Lucene uses an append only strategy, which means that when a new version of an old document is added, the old document is marked as deleted, and a new one is inserted into the index. This way allows Lucene to avoid rewriting the whole index file as documents are added, at the cost of old documents physically still being present in the index - until a merge or an optimize happens.
When you issue expungeDeletes, you're telling Solr to perform a merge if the number of deleted documents exceed a certain threshold, in effect, meaning that you're forcing an optimize behind the scenes as Solr deems necessary.
How you can work around this depends on more specific information about your use case - in the general case just leaving it to the standard settings for merge factors etc. should be good enough. If you're not seeing any merges, you might have disabled automatic merges from taking place (depending on your index size and seeing hundred of thousands of deleted documents seems extensive for an indexing processing taking 2m30s). In that case make sure to enable it properly and tweak it values again. There's also changes that were introduced with 7.5 to the TieredMergePolicy that allows even more detailed control (and possibly better defaults) for the merge process.
If you're re-indexing your complete dataset each time, indexing to a separate collection/core and then switching an alias over or renaming the core when finished before removing the old dataset is also an option.
I am using solr search engine. I had defined a schema initially and imported data from SQL db to solr using DIH. I have got a new column in sql db and value to which is getting populated using some of the previous columns. Now, I have to index this new column into solr.
My question is: do I perform update for all records or do I delete all records from solr and rebuild index again using DIH? I am asking this question because I have read that if we perform update for any document, solr first deletes the index and then rebuild it again.
The answer regarding speed is, as always, "it depends". But it's usually easier to just reindex. It doesn't require all fields to be stored in Solr and it's something you'll have to support anyway - so it doesn't require any additional code.
It also offers a bit more flexibility in regards to the index, since as you note, if you are going to do partial updates, the actual implementation is delete+add internally (since there might be fields that depend on the field you're changing, update processors, distribution across the cluster, etc.) - which requires all fields to be stored. This can have a huge impact on index size, which might not be necessary - especially if you have all the content in the DB for all other uses anyway.
So in regards to speed you're probably just going to have to try (document sizes, speed of DB, field sizes, etc. is going to affect that for each single case) - but usually the speed of a reindex isn't the most important part.
If you update your index don't forget to optimize it afterwards (via the Admin console for example) to get rid of all the deleted documents.
We have SOLR storing 3 billions of records in 23 machines and each machine have 4 shards and only 230 million documents have some field like aliasName. Currently queryCache or documentCache or Filter Cache is disable.
Problem: We are trying to get the results which have query like (q=alisaName:[* TO *] AND firstname:ash AND lastName:Coburn) is returning the match documents in 4.3 seconds. Basically we want only those matched firstname and lastname records where aliasName is not empty.
I am thinking to enable filter query fq=aliasName:[* TO *] and not sure it will make it faster as firstname and last name is mostly different in the each queries? how much memory should we allocate for filter query to perform? It should not impact the other existing queries like q=firstanme:ash AND last name:something)
Please don't worry about I/O operations as we are using flash drive.
Really appreciate the reply if you have worked on similar issue and suggest the best solution.
According to solr documentation...
filterCache
This cache stores unordered sets of document IDs that match the key (usually queries)
URL: https://wiki.apache.org/solr/SolrCaching#filterCache
So I think it comes down to two things:
What is the percentage of documents that you have with populated aliasName ? In my opinion if most documents have this field populated, then the filter cache might be useless. But, if it is only a small percentage of documents, the filter cache will have a huge performance impact, and less memory used.
What kind of Id are you using? Although I assume that the documentation refers to lucene document Ids, and not solr Ids. But maybe a smaller Solr Ids could result in a smaller cache size as well (I am not sure).
At the end you will have to perform a trial and see how it goes, maybe try on a couple of nodes first and see if there is a performance improvement.
I'm using DSE Search 3.2.1. We have removed some unneeded indexes and fields and posted the schema.xml document to all of the nodes. Do we need to do anything else to have it discontinue indexing data? Do we need to run a reindex, or a full reindex?
I'm pretty sure from what I see in Solr you need to reindex after changing the fields in the documents in your solr schema.xml. After you post it, you'll need to reload the core. If querying still works after that you might be ok, but I would guess you're going to need to run a reindex to be safe.
If you don't reindex, the existing Solr index field values will remain, occupying space and responding to queries. And fresh inserts or updates will not have the deleted fields. As Ben said, that might be okay.
A Solr reindex will delete all of the old field values.
Ideally if you change anything in schema.xml and want the changes to reflect you have to do a re-index. But, doing a re-index totally depends on the application use case and the number of records you have in it. If the reason for removing the index was due to lack of usage then there is no need for you to do re-index since no is going to search on them. The old indexes will take some space but it should be fine. Also, be careful when you are doing re-indexing because it highly depends on the number of documents you have. If you have somewhere around 10M and above I would NOT recommend re-indexing as it is CPU & I/O bound operation. If the number of documents are less then you can surely go ahead and do it.