On a Solr installation with 2+ shards, when is the data returned by the LukeRequestHandler distributed across the shards? I ask because I want to be able to detect new (previously unseen) dynamic fields within a short amount of time after they are added.
Example desired sequence of events:
Assume dynamic field *_s
Query Luke and receive list of dynamic fields
Add document with field example_s
Query Luke and receive same list as before but with additional example_s in result (this currently doesn't happen)
Query collection for example_s:* and match the document added above
I am aware that newly added documents become immediately searchable even before being hard committed, but I am looking for a way to have that info appear in Luke too.
Info on the following would be useful:
Does Luke query all shards at request time, or just one? It would appear to only query one at random.
Exactly when does knowledge of previously unseen dynamic fields become distributed across all shards (equivalently, available to Luke)?
Can I configure the delay/trigger for this supposed Luke propagation in order to minimize the delay between addition of a document with a new dynamic field on an arbitrary shard and the moment it becomes visible in Luke responses on every other shard?
See https://issues.apache.org/jira/browse/SOLR-8127
Never.
As indicated by responses on the linked ticket, the Luke request handler isn't at a high enough level to understand multiple shards. Luke provides information about an index, not a collection, and certainly not a cluster.
You need to query each shard directly. This can be done by using the exact core path /solr/collection_shard1_replica1/admin/luke
Related
I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.
Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.
I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.
I want to remove one specific value from a multivalued field in a large index, where I need to query first which documents contain that value, i.e.:
retrieve IDs of the documents containing the specific value.
partially update these documents (using remove).
Solr version is 5.1. I could update if necessary, but the change logs do not indicate any relevance to this issue.
I've tried the following query (in a few variations) on the /select endpoint through the Solr web interface (http://localhost:8983/solr/#/core/documents), trying to remove the value from all the documents:
{"id":"*", "field": {"remove":"value"} }
The server response is "success", but no document is updated.
What I could do is to query for field:value, extract the document IDs, and (programmatically) generate update queries for these IDs, similar to what has been indicated in this answer. But I would expect that there should be a more straight-forward solution.
The examples presented in the partial updates documentation and other related web pages are not really applicable here because they assume that the ID of the updated documents are known in advance.
Most other discussions about similar issues refer to old Solr versions, before partial updates were introduced (in Solr 4).
As far as I know, there is no "update by query" functionality in Solr at the current moment, so fetching and updating still is the suggested way.
Batching these updates (one select, one update) should however work as expected, reducing the number of requests made to Solr.
I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT # 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.
This is actually a guide to answer my own question:
I kinda understand some concepts:
"custom sharding" IS NOT "custom hashing".
Solr averagely splits hash values as default hashing behavior.
compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).
So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).
We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.
In my Solr queries, I want to sort most recently accessed documents to the top ("accessed" meaning opened by user action). No other search criteria has weight for me: of the documents with text matching the query, I want them in order of recent use. I can only think of two ways to do this:
1) Include a 'last accessed' date field in each doc to have Solr sort upon. Trie Date fields can be sorted very quickly, I'm told. The problem of course is keeping the field up to date, which would require storing each document's text so I can delete and re-add any document with an updated 'last accessed' field. Mutable fields would obviate this, but Lucene/Solr still doesn't offer mutable fields.
2) Alternatively, store the mutable 'last accessed' dates and keep them updated in another db. This would require Solr to return the full list of matching documents, which could be upwards of hundreds of thousands of documents. This huge list of document ids would then be matched up against dates in the db and then sorted. It would work OK for uncommon search terms, but not for broad, common search terms.
So the trade off is between 1) index size plus a processing cost every time a document is accessed and 2) big query overhead, especially for unfocused search terms
Do I have any alternatives?
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypes-WorkingwithExternalFiles
http://blog.mikemccandless.com/2012/01/tochildblockjoinquery-in-lucene.html
You should be able to do this with the atomic update functionality.
http://wiki.apache.org/solr/Atomic_Updates
This functionality is available as of Solr 4.0. It allows you to update a single field in a document without having to reindex the entire document. I only know about this functionality from the documentation. I have not used it myself, so I can't say how well it works or if there are any pitfalls.
Definitely use option 1, using SOLR queries and updating the lastAccessed field as needed.
Since SOLR 4.0 partial document updates are suported in several falvours: https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
For your application it seems that a simple atomic update would be sufficient.
With respect to performance, this should work very well for large collections and fast document updates.