Why are Solr's logs time series stored in different collections based on time instead of different shards based on time - solr

If you see Lucidworks Time Based Partitioning or Large Scale Log Analytics with Solr, multiple solr "collections" are created partitioned on time.
My question is
Why not in such cases just create multiple shards based on time ?
In case of multiple collection, how would a query spanning multiple collections/time be done ?

There is not much difference between multiple shards with implicit routing or multiple collections. When you issue a query, you can (optionally) specify which shards or which collections to search.
Alternatively you can set up an alias containing multiple collections, thus hiding the logistics from the search client. This makes it easy to create custom views over the full data set, such as an alias for each year, one for everything and one for the last quarter. If you at a later time decide to slice your data differently, e.g. make a collection for each week instead of each month, this change will be transparent to the client application. Aliases does not work for shards, so that is one reason to prefer collections.

Related

store the number of times a document was seen in a given time period

I am parsing documents on the web and storing them in solr database. Every day I see thousand of documents and some of them are repeating.
I'd like to give user an option to see which document was most seen on a given date, or in a given timespan. Queries of interest correspond to:
-show me which documents were seen the most on 16/10/2022,
-show me which documents were seen the most between 16/10/2022 and 23/10/2022
When writing solr queries, you specify field name to search on. What field type should I use and in what format should I store the number of times the document was seen on a given date?
How I would try it:
Create a separate collection - very simple collection with fields:
view time
doc id
title or body (whatever you're querying)
... do this for EVERY view.
you can query it by the gap you want:
curl http://localhost:8983/solr/query -d 'q=title:abc&rows=0&json.facet={
per_month: { range : {
field : last_modified,
start:'2022-01-01T00:00:00Z',
end:'2022-12-31T23:59:59Z',
gap:'+1MONTH',
}}
}}
This would return all views by MONTH (can change it to DAY, YEAR, etc).
But your doc is probably too big for this solution. If you want to normalize this:
a JOIN query. Since solr 8.6, you can now do cross-collection joins on multiple shards. this is a good article about how to do those queries. this is a decent video of how to set this up It's not that hard to do.
The JOIN query would be much faster.
If you don't want to do the JOIN query:
If the views change often, do not store them in the document store. There's no notion of partial updates in solr. If you're updating views every day, you'll need to update every document that's been viewed. That's going to cause a lot of unnecessary disk thrashing.
Other thoughts:
can you use a database? This is a far better use of views. Solr isn't good as a master record for views.
Another suggestion is to make the views go to an analytics engine - a far better solution since you can get rich analytics about the actual users. An analytics engine does a lot that rendering views does not - especially filtering out false positives (like bots!). It's not fun to maintain an accurate view count if you have a high-trafficed site.
In the past I've used an analytics engine to collect the data and used the analytics engine to export that data into solr. This way you can have the view logic be done by the software component that knows views best (the analytics engine like Google analytics or Salesforce marketing engine) and run an hourly process to update the views in solr using one of the above tactics.

Indexing Architecture for frequently updated index solr?

I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.
Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.
I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

search multiple cores and merge results

I have setup solr with multiple cores. Each core has its own schema(with common unique id).
Core0: Id, name
Core1: Id, type.
I am looking for a way to generate the resultset as Id, name, type. Is there any way?
I tried the solution suggested here but it did not work.
Search multiple SOLR core's and return one result set
The reason for not having merged schema is as follows:
We have thousands of documents(pdf files) which need to be extracted/inserted into solr for search on every day basis. which means there will be millions of documents at some time.
Second requirement is that at any point in time, there can be a request to perform some processing on all historical pdf files and add that processed information onto solr for each file. So that processed info comes up with search.
Now since extraction takes a long time, it will be very difficult(long update time) to perform historical update for all files. So I thought if I can use another core to keep the processed info, it will be quicker. Please suggest if there is alternative solution.

Sorting by recent access in Lucene / Solr

In my Solr queries, I want to sort most recently accessed documents to the top ("accessed" meaning opened by user action). No other search criteria has weight for me: of the documents with text matching the query, I want them in order of recent use. I can only think of two ways to do this:
1) Include a 'last accessed' date field in each doc to have Solr sort upon. Trie Date fields can be sorted very quickly, I'm told. The problem of course is keeping the field up to date, which would require storing each document's text so I can delete and re-add any document with an updated 'last accessed' field. Mutable fields would obviate this, but Lucene/Solr still doesn't offer mutable fields.
2) Alternatively, store the mutable 'last accessed' dates and keep them updated in another db. This would require Solr to return the full list of matching documents, which could be upwards of hundreds of thousands of documents. This huge list of document ids would then be matched up against dates in the db and then sorted. It would work OK for uncommon search terms, but not for broad, common search terms.
So the trade off is between 1) index size plus a processing cost every time a document is accessed and 2) big query overhead, especially for unfocused search terms
Do I have any alternatives?
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html
You should be able to do this with the atomic update functionality.
http://wiki.apache.org/solr/Atomic_Updates
This functionality is available as of Solr 4.0. It allows you to update a single field in a document without having to reindex the entire document. I only know about this functionality from the documentation. I have not used it myself, so I can't say how well it works or if there are any pitfalls.
Definitely use option 1, using SOLR queries and updating the lastAccessed field as needed.
Since SOLR 4.0 partial document updates are suported in several falvours: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
For your application it seems that a simple atomic update would be sufficient.
With respect to performance, this should work very well for large collections and fast document updates.

Resources