Solrcache and requesthandler - solr

Due to my proj security requirements, i have created a custom requesthandler(for eg:"/new") to serve requests coming from a particular set of users & i have the default "/select" requesthandler to serve the requests from another set of users.This distinction is made to make them search over different set of fields (qf). My querystring (say, q="car") sent to /new handler , fetches 100 results & the same (q="car") sent to /select , gives 50 results. Will these query results for
each requesthandler be handled separately or be taken from the same cache.
In short, is the Solr requesthandlers tied to its own querycache?

From all caches in Solr the most important regarding queries is the filterCache. If properly setup, and if the queries make use of fq, it will have a great impact usually.
It is my understanding filterCache is shared among all request handlers.
The other caches, documentCache, queryResultCache etc, have much less importance.

Related

How can I perform solr atomic updates without a distributed update processor?

I'm running a solr cluster in kubernetes, and my organization manages our own shards rather than letting solr distribute documents automatically. Therefore we've replaced DistributedUpdateProcessorFactory with NoOpDistributingUpdateProcessorFactory in the solrconfig.xml. This is happening in my application code, but I can reproduce the behavior with simple curls from the shell:
# send an update to a document in shard 0 that adds a value to each field
curl -X POST --header "Content-Type:application/json" --data "[{'id': 'my_doc_id', 'my_field1': {add: ['foo']}, 'my_field2': {add: ['bar']}]" "http://solr-0.solr-headless.default.svc.cluster.local:8983/solr/my_collection/update?commit=false"
# commit the update to the shard (along with updates to other docs I might have sent)
curl -X GET "http://solr-0.solr-headless.default.svc.cluster.local:8983/solr/my_collection/update?commit=true"
Where my_field1 and my_field2 are both multivalued string fields (I basically want to store a list of strings that I pull out of some data I'm processing).
The response solr gives is:
{
"responseHeader":{
"status":400,
"QTime":3},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"RunUpdateProcessor has received an AddUpdateCommand containing a document that appears to still contain Atomic document update operations, most likely because DistributedUpdateProcessorFactory was explicitly disabled from this updateRequestProcessorChain",
"code":400}}
So my question is, why are atomic updates prohibited without a distributed update processor, and is there a way to get around this and send atomic updates to individual shards explicitly?
My guess is that since atomic updates requires reading the document first, changing the fields and then re-indexing the document, the RunUpdateProcessor only receives the already changed document without any of the update/add commands present.
This transformation is the responsibility of DistributedUpdateProcessorFactory, since in normal settings it has to retrieve the document from the node that contains it, and apply the update and then submit it to RunUpdateProcessor.
You can implement this functionality yourself instead by using the optimistic concurrency feature with _version_, or create a custom replacement for DistributedUpdateProcessorFactory that resolves an update request by fetching it from the index.
The old wiki says that the DistributedUpdateProcessor is a no-op for single core instances, but that doesn't seem to be the case any longer.
But if you're running without a cluster setup, the DistributedUpdateProcessorFactory wouldn't do anything in relation to distribution - i.e. as long as you've configured your Solr instances to be standalone, that should work just fine. This might be an architectural issue + solr configuration instead of an issue with the update processor.

Saleforce SOQL query - Jersey Readtimeout error

I'm having a problem on a batch job that has a simple SOQL query that returns a lot of records. More than a million.
The query, as it is, cannot be optimized much further according to SOQL best practices. (At least, as far as I know. I'm not an SF SOQL expert.)
The problem is that I'm getting -
Caused by: javax.ws.rs.ProcessingException: java.net.SocketTimeoutException: Read timed out
I try bumping up the Jersey readtime out value from 30 seconds to 60 seconds, but it still times out.
Any recommendation on how to deal with this issue? Any recommended value for the readtimeout parameter for a query that returns that much data?
The query is like this:
SELECT Id, field1, field2__c, field3__c, field3__c FROM Object__c
WHERE field2__c = true AND (not field3 like '\u0025Some string\u0025')
ORDER BY field4__c ASC
In no specific order...
Batches written in Apex time out after 2 minutes so maybe set same in your Java application
Run your query in Developer Console using the query plan feature (you probably will have to put real % in there, not \u0025). Pay attention which part has "Cost" column > 1.
what are field types? Plain checkbox and text or some complex formulas?
Is that text static or changes depending on what your app needs? would you consider filtering out the string in your code rather than SOQL? Counter-intuitive to return more records than you really need but well, might be an option.
would you consider making a formula field with either whole logic or just the string search and then asking SF to index the formula. Or maybe making another field (another checkbox?) with "yes, it contains that text" info, set the value by workflow maybe (essentially prepare your data a bit to efficiently query it later)
read up about skinny tables and see if it's something that could work for you (needs SF support)
can you make an analytic snapshot of your data (make a report, make SF save results to helper object, query that object)? Even if it'd just contain lookups to your original source so you'll access always fresh values it could help. Might be a storage killer though
have you considered "big objects" and async soql
I'm not proud of it but in the past I had some success badgering the SF database. Not via API but if I had a nightly batch job that was timing out I kept resubmitting it and eventually 3rd-5th time it managed to start. Something in the query optimizer, creation of cursor in underlying Oracle database, caching partial results... I don't know.
what's in the ORDER BY? Some date field? If you need records updated since X first then maybe replication API could help getting ids first.
does it make sense to use LIMIT 200 for example? Which API you're using, SOAP or REST? Might be that returning smaller chunks (SOAP: batch size, REST API: special header) would help it finish faster.
when all else fails (but do contact SF support, make sure you exhausted the options) maybe restructure the whole thing. Make SF push data to you whenever it changes, not pull. There's "Streaming API" (CometD implementation, Bayeux protocol, however these are called) and "Change Data Capture" and "Platform Events" for nice event bus-driven architecture decisions, replaying old events up to 3 days back if the client was down and couldn't listen... But that's a totally different topic.

Does solr store recent queries?

For example I fired queries-
q=id:SOURCE-*
q=sourceName:abc
q=sourceName:xyz
q=id:DB-*
Is there any way to fetch these last 4 queries fired on Solr?
Solr does has a query cache that holds the previous queries and the docs ids with the results. Your main issue would be how to use it, as it is mostly for internal use. But you can look into the source code and maybe you find a way.
One idea might be to use the Solr logging system. You can set the log level to INFO and it should be fine to retrieve every queries.
In addition to the logging options [...], there is a way to
configure which request parameters (such as parameters sent as part of
queries) are logged with an additional request parameter called
logParamsList. See the section on Common Query Parameters for more
information.
For example with logParamsList=q, only the q parameters will be logged.
N.B. Logging every query can potentially impact performance depending on the query rate and the volume of data generated.

When is Luke data distributed across Solr cores?

On a Solr installation with 2+ shards, when is the data returned by the LukeRequestHandler distributed across the shards? I ask because I want to be able to detect new (previously unseen) dynamic fields within a short amount of time after they are added.
Example desired sequence of events:
Assume dynamic field *_s
Query Luke and receive list of dynamic fields
Add document with field example_s
Query Luke and receive same list as before but with additional example_s in result (this currently doesn't happen)
Query collection for example_s:* and match the document added above
I am aware that newly added documents become immediately searchable even before being hard committed, but I am looking for a way to have that info appear in Luke too.
Info on the following would be useful:
Does Luke query all shards at request time, or just one? It would appear to only query one at random.
Exactly when does knowledge of previously unseen dynamic fields become distributed across all shards (equivalently, available to Luke)?
Can I configure the delay/trigger for this supposed Luke propagation in order to minimize the delay between addition of a document with a new dynamic field on an arbitrary shard and the moment it becomes visible in Luke responses on every other shard?
See https://issues.apache.org/jira/browse/SOLR-8127
Never.
As indicated by responses on the linked ticket, the Luke request handler isn't at a high enough level to understand multiple shards. Luke provides information about an index, not a collection, and certainly not a cluster.
You need to query each shard directly. This can be done by using the exact core path /solr/collection_shard1_replica1/admin/luke

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status

Resources