How can we implement archiving in SOLR? - solr

I am new to Apache SOLR and I want to implement archiving in SOLR since my data is growing day by day. I am not very sure whether SOLR allows data archiving or not?
If anybody has any suggestions on this then please give it to me.

This question is pretty general so it's a bit hard to give a cut and dried answer, but if one thinks about archiving for a moment, there are two parts to it.
Removing old data
Storing the old data in an alternate location.
The first part is fairly easy in solr so long as you can identify a query that will select the "old" documents. For example if you have a field that records when you sent the data to solr called 'index_date' want to delete everything before Jan 1, 2014 you might do this:
curl http://localhost:8983/solr/update --data '<delete><query>indexed_date:[* TO 2014-01-01T00:00:00]</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
The second part requires more thought. The first question is, why would you want to move the data in solr to some other location. The answer to that more or less has to be because you think you might need it again. But ask yourself what the use case for that is, and you you might service that use case. Are you planning on putting the data back into solr at some later point if you want it? Is solr the only place where this data was stored and you need it for record keeping/audit only?
You will have to determine the second half of "archiving" based on your needs, but here's some things to think about: The data behind fields in solr that are stored="false" are already lost. You can not completely reconstruct the data that went into creating them. Fields for which stored="true" can be retrieved in xml/json/csv with a regular query, and then output to the long term storage of your choice. Many systems use solr as an index into the primary sources rather than using solr as a primary source itself. In this case there may be no need to archive the data, simply remove the data that is too old to be relevant in the search results, but of course make sure that your business team understands and agrees with this strategy before you do it! :)
EDIT: I happened to look back at this and when I re-read it I realized I left something out and there's a new development.
What I Left Out
The above delete by query strategy has the drawback that deleted documents remain in the index (just marked deleted), potentially wasting as much as 50% of your space (or more if you've run "optimize"! in the past). Here's a good article by Eric Erickson about deleting and space consequences:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
New Development
If time is the criteria for deletion and you followed the best practice I mentioned above about not having solr be the single source of truth (i.e. solr is just an index into a primary source, not the data store) then you may very well want to use the new Time Routed Aliases feature which keeps a set of temporally bounded collections and deletes the oldest collections. The great thing about deleting a collection rather than delete by query is that there's no merging to do. The segments for the index disappear as a whole, so there are no deleted documents hanging out wasting space.
http://lucene.apache.org/solr/guide/7_4/time-routed-aliases.html
Self Promotion Disclaimer: Along with David Smiley, I helped write this feature

Related

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

How to handle frequently changing multivalue string fields in SOLR?

I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.
I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

How to reindex all docs in Solr data

I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.
The question is about how to "re-index" all the docs?
One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.
Is there some better way can do this more efficiently? Thanks for your suggestion.
First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.
But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.
There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.
The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.
While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.
If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.
It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.
To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.
There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.
For optimizing, call from command line:
curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.
This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Resources