search multiple cores and merge results

search multiple cores and merge results - solr

I have setup solr with multiple cores. Each core has its own schema(with common unique id).
Core0: Id, name
Core1: Id, type.
I am looking for a way to generate the resultset as Id, name, type. Is there any way?
I tried the solution suggested here but it did not work.
Search multiple SOLR core's and return one result set
The reason for not having merged schema is as follows:
We have thousands of documents(pdf files) which need to be extracted/inserted into solr for search on every day basis. which means there will be millions of documents at some time.
Second requirement is that at any point in time, there can be a request to perform some processing on all historical pdf files and add that processed information onto solr for each file. So that processed info comes up with search.
Now since extraction takes a long time, it will be very difficult(long update time) to perform historical update for all files. So I thought if I can use another core to keep the processed info, it will be quicker. Please suggest if there is alternative solution.

Related

Indexing Architecture for frequently updated index solr?

I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.

Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.

I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.

Why are Solr's logs time series stored in different collections based on time instead of different shards based on time

If you see Lucidworks Time Based Partitioning or Large Scale Log Analytics with Solr, multiple solr "collections" are created partitioned on time.
My question is
Why not in such cases just create multiple shards based on time ?
In case of multiple collection, how would a query spanning multiple collections/time be done ?

There is not much difference between multiple shards with implicit routing or multiple collections. When you issue a query, you can (optionally) specify which shards or which collections to search.
Alternatively you can set up an alias containing multiple collections, thus hiding the logistics from the search client. This makes it easy to create custom views over the full data set, such as an alias for each year, one for everything and one for the last quarter. If you at a later time decide to slice your data differently, e.g. make a collection for each week instead of each month, this change will be transparent to the client application. Aliases does not work for shards, so that is one reason to prefer collections.

Solr 4.5: When is Solr facet query better than simple query?

I'm working with Apache Solr and would like to get more detailed information about some query options. I discovered facet queries and was wondering, when exactly do they bring essential advantages; especially in case of the following example:
There is a stock of books that is saved on a Solr server. Despite the common attributes a book ought to have, they have an ISBN. Data about books is provided by third parties and so it's important to check that there are no doubled ISBNs within the system. In order to check if a book's ISBN is a duplicate, it has to go through a routed path, were - unfortunately - every book is processed indiviually without any information about preceeding or following processes.
The question is:
a) Should you simply query Solr with the current book ISBN and check the total results, or
b) should you send a facet query with a f.isbn.facet.mincount=2 and check if the result contains the current book ISBN?
In both cases, caching results is not possible. So the number of queries would always equal the number of books processed. I simply don't know how Solr works within and therefore can't make this decision without further information, especially because the number of queries won't be reduced by either of above possibilities.

If you're going to do a query - do a query. Lucene is highly optimized for doing queries, so that's what you should do. A facet query is for creating facets (counts) from arbitrary queries - so internally it does the same thing. If you generate a facet and then iterate through that one, Lucene has to look at far more documents than if you're just querying for one single value.
The best strategy to get a performance boost would be to perform these operations in batch - check 500 books in the same batch (i.e. isbn:(123 OR 321 OR 567 OR 765)), and then handle that in your code. If these updates can arrive from many systems in parallel without going through one single source, you'll have to decide how much time you can spend before any duplicates might appear in the streams (this race condition can happen with just one book as well, as two streams can query for a single isbn and get a negative result before adding it separately from both streams).

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?

In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Apache Solr core per file

I am uploading many csv files
currency.csv file:
code,currency_name,currency_decimals
AUD,Australian Dollar,2
GBP,Pound Sterling,2
...
...
currency_holidays.csv file:
code,holiday_date,holiday_name
AUD,02/01/2012,New Year's Day Observed
AUD,26/01/2012,Australia Day
...
...
NOTE: uniqueKey is set to 'code' in solr configuration file
if I upload these files in to solr single core it would overwrite the matching currency recordes e.g. AUD. Right?
is it better to have core per file? i.e. multiple cores.
This is my previous post:
apache solr csv file same values
What is the best solution? I need help. Hope someone can help out.
Thanks
GM

Some of the points you might want to think upon :-
If you have completely different entities with nothing in common and not dependant as well (no joins), it would be better to have them as Separate Cores.
This would be a much cleaner approach.
As there might be fields which have a common name and would need to be analyzed in different ways as well as
Search behaving in different ways for fields and their boost
This would also be manageable if the data is huge.
However, if you have a very small dataset with none of the above concerning you just go with a single core.
You Unique keys you can prefix the ids with the type e.g. curreny_aud and holiday_aud which will help you keep the entities seperate and prevent overwriting.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

search multiple cores and merge results - solr

Related

Indexing Architecture for frequently updated index solr?

Why are Solr's logs time series stored in different collections based on time instead of different shards based on time

Solr 4.5: When is Solr facet query better than simple query?

Is there a better way to represent provenenace on a field level in SOLR

Apache Solr core per file

Categories

Resources