I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.
Related
For statistic purposes I have to save and analyse all search queries made to a server running Solr (version 8.3.1). Maybe it's just because I haven't worked with Solr until today, but I couldn't find a simpler way to access these queries except crawling the logs.
I've only found one article to help me, in which the following is stated:
I think that solr by him self doesn't store the queries (correct me if I'm wrong, about this) but you can accomplish what you want by processing the solr log (its the only way I think).
(source: https://lucene.472066.n3.nabble.com/is-it-possible-to-save-the-search-query-td4018925.html)
Is there any more convenient way to do this?
I actually found a good way to achieve this in another SO-Question. Well, at least kind of.
Note: It is only useful if you have enough resources on the same server or another server to properly handle a second Solr-Core.
Link to original answer
That SO-Question is about Elastic-Search but the methodology of it can also be applied to this case with a second Solr-Core that indexes the queries made. (One can also add additional fields like when it was last searched, total search count, ...)
The functionalities of a search auto-complete are also achievable with this solution.
In short:
The basic idea is to use a second Solr instance to provide the means necessary for quickly saving the queries (instead of a DB for instance).
Remark: I'm not going to accept this as the best answer because it is a rather special solution to the question I originally made. But I nonetheless felt it could be useful for any programmer trying to achieve this while also thinking about search auto-completion.
I'm using Alfresco 4.1.6 and SOLR 1.4.
For search, I use fts_alfresco_language and the searchService.query method.
And in my query I search by PATH, TYPE and some custom properties like direction, telephone, mail, or similar.
I have now over 2 millions of documents, and we can see how the performance of the searchs are worst than at the beginning.
I read that in version 1.4 of solr, using PATH on the query is a bad idea. And is better avoid it and only use TYPE and the property key and value.
But I have 2 questions...
Why the PATH increase the response time? It's not a help? I have over 1000 main folders at the root of the repository. If I specify the folder that solr may search, why this not filter the results and give me a worst time response than if I don't specify this? Or there are another way to say to solr the main folder to reduce results and then do the rest of the query?
When I find by custom properties, I use 3 or 4 properties, all indexed, to search. These merged lookups has a higher overhead than one? Maybe is better to search only by one property, and not by the 3? Or maybe use ORs and not ANDs to quickly results? How works SOLR?
Thanks!
First let me start with this, I'm not sure what you want of this question cause it's vague. You're not asking how to make your query better, your asking why a bad-practice(bad-performance) is working bad for you.
Do some research on how to structure your ECM system, first thing what makes your ECM any good is a proper Content Model. There are books out there which will help you.
If you're structuring your content with folders (Path) and these are important for you, than you need to add these as metadata to your content. If you haven't done that, then you should start with that.
A good Content Model will be able to find content wherever it's placed within your ECM system.
Sure it's easy to migrate a filesystem to an ECM system and just leave it there, but you've done only half the work.
The path queries are slow in general cause it uses a loop pattern and it's expensive. It has been greatly improved in the new SOLR, but it still isn't as fast as normal metadata querying.
As part of a refactoring project I'm moving our quering end to ElasticSearch. Goal is to refactor the indexing-end to ES as well in the end, but this is pretty involved and the indexing part is running stable so this has less priority.
This leads to a situation where a Lucene index is created / indexed using Solr and queried using Elasticsearch. To my understanding this should be possible since ES and SOlR both create Lucene-compatable indexes.
Just to be sure, besides some housekeeping in ES to point to the correct index, is there any unforseen trouble I should be aware of when doing this?
You are correct, Lucene index is part of elasticsearch index. However, you need to consider that elasticsearch index also contains elasticsearch-specific index metadata, which will have to be recreated. The most tricky part of the metadata is mapping that will have to be precisely matched to Solr schema for all fields that you care about, and it might not be easy for some data types. Moreover, elasticsearch expects to find certain internal fields in the index. For example, it wouldn't be able to function without _uid field indexed and stored for every record.
At the end, even if you will overcome all these hurdles you might end up with fairly brittle solution and you will not be able to take advantage of many advanced elasticsearch features. I would suggest looking into migrating indexing portion first.
Have you seen ElasticSearch Mock Solr Plugin? I think it might help you in the migration process.
Consider the following situation. We have a database which stores writers and books in two separate tables. One book obviously stores the reference to the writer who wrote the book.
For Solr i have to denormalize this structure into one big document where every book contains the details of the writer associated. This index is now used for querying books.
One user of the system now decides to update a writer record in the system. Because many books can be associated with it i have to update every document in Solr which have embedded data from this writer record. This is very painful because i have to delete and re-add every affected document as far as i know.
Is there any better way of doing this? I need near realtime update of the index in the system if one of the referenced data gets modified.
This would be a perfect usecase for nested documents. As far as I know lucene does support nested documents but Solr doesn't, not totally sure about the current state of this feature.
This feature is available in elasticsearch though. You might want to have a look at it, there's an article I just wrote that can be interesting if you want to know what's so cool about elasticsearch in my opinion. Your question just reminded me that I didn't mention the nested documents feature in my article, which is really cool too. You can use the nested type in your mapping. If you want to know more you can have a look at this article. By the way it contains exactly the books/authors example.
Elasticsearch also helps you while updating documents. You don't need to reindex the whole document but send only the changes through a script. Thanks to the fact that it stores the source document that has been indexed it internally retrieves it, updates it running the script and reindexes it. That's how lucene internally works since its index segments are write-once. With Solr 4, which will be soon released, you can update documents providing only the changes, but as far as I know this works only if all your fields are stored. The fields that are not stored cannot be retrieved from the index.
If we are talking about Near Real Time updates, elasticsearch does use the Lucene Near Real Time API and refreshes automatically the index reader every second. Solr 3 doesn't use yet those APIs but Solr 4 does.
For updating nested types in SOLR you can use dataimporters and delta imports. The example on https://wiki.apache.org/solr/DataImportHandler#Delta-Import_Example shows how this would work. Obviously you would then need to have solr access your database.
I am working on a project right now that has a solr index of counts and ids. I am currently researching if it is possible to increment/decrement on solr directly, instead of having to retrieve the data, increment it with PHP, and then reinsert it into solr.
I have spent an hour googling variations of this to no avail. Any information would be most appreciated.
Thanks.
No, as far as I know it's not possible. You could certainly implement this in Solr as a request handler, which would retrieve the document from the underlying Lucene index, update the field, then write it back to the index and commit, but doing this too frequently will probably kill your performance. This is not really what Lucene/Solr were designed for. Consider using something like Redis instead, for this particular feature, and leave Lucene/Solr for full-text search, where it really shines.