Does solr store recent queries? - solr

For example I fired queries-
q=id:SOURCE-*
q=sourceName:abc
q=sourceName:xyz
q=id:DB-*
Is there any way to fetch these last 4 queries fired on Solr?

Solr does has a query cache that holds the previous queries and the docs ids with the results. Your main issue would be how to use it, as it is mostly for internal use. But you can look into the source code and maybe you find a way.

One idea might be to use the Solr logging system. You can set the log level to INFO and it should be fine to retrieve every queries.
In addition to the logging options [...], there is a way to
configure which request parameters (such as parameters sent as part of
queries) are logged with an additional request parameter called
logParamsList. See the section on Common Query Parameters for more
information.
For example with logParamsList=q, only the q parameters will be logged.
N.B. Logging every query can potentially impact performance depending on the query rate and the volume of data generated.

Related

How to access snowflake query profile overview statistics via SQL?

In Snowflake SnowSight UI, in the Query Profile view, there is a section called Profile Overview where you can see the breakdown of the total execution time. It contains statistics like Processing, Local Disk I/O, Remote Disk I/O, Synchronization etc.
Full list here
https://docs.snowflake.com/en/user-guide/ui-snowsight-activity.html#profile-overview
I want to access those statistics programmatically instead of having to navigate to that section for each query that I want to analyze. The only system view I know that provides query statistics is the QUERY_HISTORY however it doesn't contain those stats.
https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html
Question is, can I get those stats in any of the system views? If so, where and how?
It is possible to programmatically access query profile using GET_QUERY_OPERATOR_STATS
Returns statistics about individual query operators within a query. You can run this function for any query that was executed in the past 14 days.
For example, you can use this information to determine which operators are consuming the most resources. As another example, you can use this function to identify joins that have more output rows than input rows, which can be a sign of an “exploding” join (e.g. an unintended Cartesian product).
These statistics are also available in the query profile tab in Snowsight. The GET_QUERY_OPERATOR_STATS() function makes the same information available via a programmatic interface.
The GET_QUERY_OPERATOR_STATS function is a table function. It returns rows with statistics about each query operator in the query
set query_id = '<query_ud>';
select *
from table(get_query_operator_stats($query_id));
2023 update: GET_QUERY_OPERATOR_STATS()
See https://stackoverflow.com/a/74824120/132438 with Lukasz answer.
https://docs.snowflake.com/en/sql-reference/functions/get_query_operator_stats.html
Bad news: There's no programmatic way to get this.
Good news: This is a frequent request, so we might eventually have news.
In the internal tracker I left a note to update this answer once there is progress we can report.
You can do it via https://github.com/Snowflake-Labs/sfsnowsightextensions#get-sfqueryprofile. Doing it at scale (scraping-style) will likely yield ~60%-80% success rate. Please don't abuse it.
Inspired by a clever customer who did that to get what is now offered by https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
Completely unsupported as it says so on the repo homepage.
Just FYI, there is an upcoming feature called GET_QUERY_STATS (currently in private preview) https://docs.snowflake.com/en/LIMITEDACCESS/get_query_stats.html that will do just this and obviate the reason for Get-SFQueryProfile once it ships.

Saleforce SOQL query - Jersey Readtimeout error

I'm having a problem on a batch job that has a simple SOQL query that returns a lot of records. More than a million.
The query, as it is, cannot be optimized much further according to SOQL best practices. (At least, as far as I know. I'm not an SF SOQL expert.)
The problem is that I'm getting -
Caused by: javax.ws.rs.ProcessingException: java.net.SocketTimeoutException: Read timed out
I try bumping up the Jersey readtime out value from 30 seconds to 60 seconds, but it still times out.
Any recommendation on how to deal with this issue? Any recommended value for the readtimeout parameter for a query that returns that much data?
The query is like this:
SELECT Id, field1, field2__c, field3__c, field3__c FROM Object__c
WHERE field2__c = true AND (not field3 like '\u0025Some string\u0025')
ORDER BY field4__c ASC
In no specific order...
Batches written in Apex time out after 2 minutes so maybe set same in your Java application
Run your query in Developer Console using the query plan feature (you probably will have to put real % in there, not \u0025). Pay attention which part has "Cost" column > 1.
what are field types? Plain checkbox and text or some complex formulas?
Is that text static or changes depending on what your app needs? would you consider filtering out the string in your code rather than SOQL? Counter-intuitive to return more records than you really need but well, might be an option.
would you consider making a formula field with either whole logic or just the string search and then asking SF to index the formula. Or maybe making another field (another checkbox?) with "yes, it contains that text" info, set the value by workflow maybe (essentially prepare your data a bit to efficiently query it later)
read up about skinny tables and see if it's something that could work for you (needs SF support)
can you make an analytic snapshot of your data (make a report, make SF save results to helper object, query that object)? Even if it'd just contain lookups to your original source so you'll access always fresh values it could help. Might be a storage killer though
have you considered "big objects" and async soql
I'm not proud of it but in the past I had some success badgering the SF database. Not via API but if I had a nightly batch job that was timing out I kept resubmitting it and eventually 3rd-5th time it managed to start. Something in the query optimizer, creation of cursor in underlying Oracle database, caching partial results... I don't know.
what's in the ORDER BY? Some date field? If you need records updated since X first then maybe replication API could help getting ids first.
does it make sense to use LIMIT 200 for example? Which API you're using, SOAP or REST? Might be that returning smaller chunks (SOAP: batch size, REST API: special header) would help it finish faster.
when all else fails (but do contact SF support, make sure you exhausted the options) maybe restructure the whole thing. Make SF push data to you whenever it changes, not pull. There's "Streaming API" (CometD implementation, Bayeux protocol, however these are called) and "Change Data Capture" and "Platform Events" for nice event bus-driven architecture decisions, replaying old events up to 3 days back if the client was down and couldn't listen... But that's a totally different topic.

Solrcache and requesthandler

Due to my proj security requirements, i have created a custom requesthandler(for eg:"/new") to serve requests coming from a particular set of users & i have the default "/select" requesthandler to serve the requests from another set of users.This distinction is made to make them search over different set of fields (qf). My querystring (say, q="car") sent to /new handler , fetches 100 results & the same (q="car") sent to /select , gives 50 results. Will these query results for
each requesthandler be handled separately or be taken from the same cache.
In short, is the Solr requesthandlers tied to its own querycache?
From all caches in Solr the most important regarding queries is the filterCache. If properly setup, and if the queries make use of fq, it will have a great impact usually.
It is my understanding filterCache is shared among all request handlers.
The other caches, documentCache, queryResultCache etc, have much less importance.

How to stop certain results from being stored in FilterCache in Solr

I have been using some filters in multi-faceting queries in Solr. Right now the filters are using only value but now I have to expand it to multi-values and I think I have to use OR for that. I haven't done any performance checking but I am wondering if there is a way to stop my filter queries from being stored to FilterCache? I don't want to cache results from filter queries with more than two values. Ideally I guess I have to rely on caching algorithm doing a good job but I am just wondering.
Taken from here.
To tell Solr not to cache a filter, we use the same powerful local params DSL that adds metadata to query parameters and is used to specify different types of query syntaxes and query parsers. For a normal query that does not have any localParam metadata, simply prepend a local param of cache=false.

How does the ids paremeter work in a solr query? and will it help me in debugging a sharding issue?

We have a large number of solr shards and are trying to setup multi-levels of aggregators. My understanding is that one aggregator should have no more than 200 cores associated with it. Our current plan has our first level of aggregators working on 100 cores each. We will then have another level of aggregators collecting these together. So far this is not working for us and when we query our second level we are getting 500 Internal Server Errors. Digging into this we find our level one aggregators are give NPEs. We've also found there's some translation of the query going on. For instance, we give our level 2 aggregator something like this:
http://l2agghostname:8080/solr/core-00/select?q=*
It sends the following to the level 1:
http://l1agghostname:8080/solr/core-00/select?ids=a6_370573660942_76697809790_0,a7_370573660942_76697809790_4&wt=xml&q=*
That suggests it's receiving the IDs to return, but I'm not sure exactly what that "ids" parameter is supposed to do. If I plug that same query directly in to the level 1 aggregator I get the same error, however if I give it only one doc id, as so:
http://l1agghostname:8080/solr/core-00/select?ids=a6_370573660942_76697809790_0&wt=xml&q=*
Then it will return the information!
This seems very weird but I'm also not sure I should be spending time trying to understand how this ids parameter is working. Am I following a red herring?
PS: http://l1agghostname:8080/solr/core-00/select?q=* does return results as expected.
What we eventually found was that Solr was not able to handle searches like this. Solr would send the initial query through the stack. The aggregators would then collect the IDs and send those back. Solr is then supposed to retrieve the documents using IDs. Once the IDs were returned it would not be able to recall the path to retrieve them all. It thus sent the queries to the wrong machines and errors occurred.

Resources