Why my repeated solr query has different qtime? - solr

I have a solr cloud setup, I fire a simple query in a loop. The number and type of documents and query never changes but the qtime and corresponding request time varies from 6 ms to 1000+ ms in the same loop. What could be causing this?

This is because of caching. When you start the service not all kind of caches will be loaded into the memory(which is dependent on auto warmup).
The query time is also dependent on the load on the server machine. If GC is running on the service because of other activities, then the queryTime again varies.

Related

Snowflake as backend for high demand API

My team and I have been using Snowflake daily for the past eight months to transform/enrich our data (with DBT) and make it available in other tools.
While the platform seems great for heavy/long running queries on large datasets and powering analytics tools such as Metabase and Mode, it just doesnt seem to behave well in cases where we need to run really small queries (grab me one line of table A) behind a high demand API, what I mean by that is that SF sometimes takes as much as 100ms or even 300ms on a XLARGE-2XLARGE warehouse to fetch one row in a fairly small table (200k computed records/aggregates), that added up to the network latency makes for a very poor setup when we want to use it as a backend to power a high demand analytics API.
We've tested multiple setups with Nodejs + Fastify, as well as Python + Fastapi, with connection pooling (10-20-50-100)/without connection pooling (one connection per request, not ideal at all), deployed in same AWS region as our SF deployment, yet we werent able to sustain something close to 50-100 Requests/sec with 1s latency (acceptable), but rather we were only able to get 10-20 Requests/sec with as high as 15-30s latency. Both languages/frameworks behave well on their own, or even with just acquiring/releasing connections, what actually takes the longest and demands a lot of IO is the actual running of queries and waiting for a response. We've yet to try a Golang setup, but it all seems to boil down to how quick Snowflake can return results for such queries.
We'd really like to use Snowflake as database to power a read-only REST API that is expected to have something like 300 requests/second, while trying to have response times in the neighborhood 1s. (But are also ready to accept that it was just not meant for that)
Is anyone using Snowflake in a similar setup? What is the best tool/config to get the most out of Snowflake in such conditions? Should we spin up many servers and hope that we'll get to a decent request rate? Or should we just copy transformed data over to something like Postgres to be able to have better response times?
I don't claim to be the authoritative answer on this, so people can feel free to correct me, but:
At the end of the day, you're trying to use Snowflake for something it's not optimized for. First, I'm going to run SELECT 1; to demonstrate the lower-bound of latency you can ever expect to receive. The result takes 40ms to return. Looking at the breakdown that is 21ms for the query compiler and 19ms to execute it. The compiler is designed to come up with really smart ways to process huge complex queries; not to compile small simple queries quickly.
After it has its query plan it must find worker node(s) to execute it on. A virtual warehouse is a collection of worker nodes (servers/cloud VMs), with each VW size being a function of how many worker nodes it has, not necessarily the VM size of each worker (e.g. EC2 instance size). So now the compiled query gets sent off to a different machine to be run where a worker process is spun up. Similar to the query planner, the worker process is not likely optimized to run small queries quickly, so the spin-up and tear-down of that process might be involved (at least relative to say a PostgreSQL worker process).
Putting my SELECT 1; example aside in favor of a "real" query, let's talk caching. First, Snowflake does not buffer tables in memory the same way a typical RDBS does. RAM is reserved for computation resources. This makes sense since in traditional usage you're dealing with tables many GBs to TBs in size, so there would be no point since a typical LRU cache would purge that data before it was ever accessed again anyways. This means that a trip to an SSD disk must occur. This is where your performance will start to depend on how homogeneous/heterogeneous your API queries are. If you're lucky you get a cache hit on SSD, otherwise its off to S3 to get your tables. Table files are not redundantly cached across all worker nodes, so while the query planner will make an attempt to schedule a computation on a node most likely to have the needed files in cache, there is no guarantee that a subsequent query will benefit from the cache resulting from the first query if it is assigned to a different worker node. The likeliness of this happening increases if you're firing 100s of queries at the VM/second.
Lastly, and this could be the bulk of your problem but have saved it for last since I am the least certain on it. A small query can run on a subset of the workers in a virtual warehouse. In this case the VH can run concurrent queries with different queries on different nodes. BUT, I am not sure if a given worker node can process more than one query at once. In that case, your concurrency will be limited by the number of nodes in the VH, e.g. a VH with 10 worker nodes can at most run 10 queries in parallel, and what you're seeing are queries piling up at the query planner stage while it waits for worker nodes to free up.
maybe for this type of workload , the new SF feature Search Optimization Service could help you speeding up performances ( https://docs.snowflake.com/en/user-guide/search-optimization-service.html ).
I have to agree with #Danny C - that Snowflake is NOT designed for very low (sub-second) latency on single queries.
To demonstrate this consider the following SQL statements (which you can execute yourself):
create or replace table customer as
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
limit 500000;
-- Execution time 840ms
create or replace table customer_ten as
select *
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
limit 10;
-- Execution time 431ms
I just ran this on an XSMALL warehouse and it demonstrates currently (November 2022) Snowflake can copy a HALF MILLION ROWS in 840 milliseconds - but takes 431 ms to copy just 10 rows.
Why is Snowflake so slow compared (for example) to Oracle 11g on premises:
Well - here's what Snowflake has do complete:
Compile the query and produce an efficient execution plan (plans are not currently cached as they often lead to a sub-optimal plan being executed on data which has significantly increased in volume)
Resume a virtual warehouse (if suspended)
Execute the query and write results to cloud storage
Synchronously replicate the data to two other data centres (typically a few miles apart)
Return OK to the user
Oracle on the other hands needs to:
Compile the query (if the query plan is not already cached)
Execute the query
Write results to local disk
If you REALLY want sub-second query performance on SELECT, INSERT, UPDATE and DELETE on Snowflake - it's coming soon. Just check out Snowflake Unistore and Hybrid Tables Explained
Hope this helps.

Google Cloud SQL Performance?

I switched from a join and view based strategy on my old dedicated server to a multiple small queries strategy for Google Cloud. For me this is also easier to maintain and on my dev machine there was no noticeable performance difference. But on my App Engine and Cloud SQL it is really slow.
For example if I want to query the last 50 articles it takes 4-5 seconds, on my dev machine 160ms. For each article there are min. 12 queries on average 15 queries. That are ~750 queries, if I monitor the Cloud SQL I noticed that it always caps at ~200 queries per second. The CPU just peeks at 20%, I have just a db-n1-standard-1 with SSD. 200 queries per second also mean if I want to get the last 100 articles it will take 8-9 seconds and so on.
I already tried to set the App Engine Instance class to F4 to see if this will change anything. It didn't change anything, the number where the same. I haven't tried to increase the DB Instance because I can't see that it is at it's limit.
What to do?
Software: I use GO with a unlimited mysql connection pool.
EDIT: I even changed to the db-n1-standard-2 instance and there was no difference :(
EDIT2: I tried some changes over the weekend, 1500 iops, 4 cores, etc but nothing showed the expected improvements. The usage graphs were already indicating that there is no "hardware" limit. I managed to isolate the slow query tho... it was a super simple one where I query the country name via country-ISO2 and language-ISO3 both keys are indexed and still it takes 50ms for EACH. So I just cached ALL countires in memcache and done.
Google Cloud SQL uses GCE VM instances so things that apply to GCE apply to Cloud SQL.
When you create a db-n1-standard-1 instance your network throughput is caped to 250 MB/s by your CPU but your Read/Write (a)disk throughtput and (b)IOPS speed are capped by the the storage capacity and type, which is:
Write: 4.8ᵅ|300ᵇ
Read: 4.8ᵅ|300ᵇ
You can view what if anything is missing in your instance details:
https://console.cloud.google.com/sql/instances/[INSTANCE_NAME]
If you want to increase performance of your instance raise the number of its vCPUs and its storage capacity/type as sugested in the links above.

How can I use change detection when indexing a SQL view with an Azure Search indexer?

Have Azure search targeting a table now but we need to change it to target a view instead. After some reading realized cannot use the change tracking to do incremental indexer build, does it mean each time the indexer need to be fully rebuild then? The view contains several million rows, each time a rebuild will cost around half an hour, questions,
Is there a better way to do this to minimize the data latency
During the indexer rebuild, would it impacting the search calls
Thanks.
You can use the high watermark change detection policy (documented here) with a SQL view. This ensures that when your indexer runs, only rows that have changed according to some high watermark column are indexed.
During the indexer rebuild, would it impacting the search calls
Maybe. Indexing does consume some resources and may impact search latency. It depends on your pricing tier, topology (number of replicas and partitions), and workload. If you use a change detection policy and the number of changed rows indexed during each run of the indexer is relatively small, it probably won't have much of an impact.
To minimize the data latency when using view you can use high watermark change detection policy High water mark change detection. But
Real-time data synchronization isn't possible with an indexer. An indexer can reindex your table at most every five minutes. If data updates need to be reflected in the index sooner, we recommend pushing updated rows directly.
Change detection will not reload the data for you. It just keeps the track of records updated after last run. You will have to set scheduler to reload the data or if you want it to be real time you can use search service to upload new data directly to index. But it has quota limits.
If you need to upload large set of records change tracking will do efficiently. Instead of using scheduler we can also run the indexer using API run indexer This will reload all updated data for you.
You can track the status of indexer run using indexer status

Google AppEngine server instance clock synchronization

I just came across the following paragraph in the AppEngine documentation for Query Cursors:
An interesting application of cursors is to monitor entities for
unseen changes. If the app sets a timestamp property with the current
date and time every time an entity changes, the app can use a query
sorted by the timestamp property, ascending, with a Datastore cursor
to check when entities are moved to the end of the result list. If an
entity's timestamp is updated, the query with the cursor returns the
updated entity. If no entities were updated since the last time the
query was performed, no results are returned, and the cursor does not
move.
For this to work reliably, there would have to be some sort of guarantees about the clock synchronization across different server instances. Otherwise you could get the following scenario:
Server instance 1 (fast clock) saves an update with time-stamp 1000.
Client asks for updates and finds this one update.
Server instance 2 (slow clock) saves another update with time-stamp 950.
Client asks for updates and does NOT find this update as time-stamp didn't increase.
As far as I understood, there never were any such clock synchronization guarantees. Did this change???
Update:
I just realized that even if the clocks were sync'ed perfectly, this approach might miss results due to the eventual consistency of queries. If a later update ends up getting committed before an earlier update and makes it into a simultaneous query while the earlier one doesn't, it will hide the earlier update. Or am I missing something?
The only docs that i found on clock and Google Cloud Platform, are here and here.
According to the first link post, instances are synced using NTP service, and it's done for you.

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

Resources