I don't understand how some parameters of Query Store in MS SQL Server works - sql-server

According to official documentation: In sys.database_query_store_options we have options which can adjust Query Store workflow and performance.
From documentation:
"flush_interval_seconds - The period for regular flushing of Query Store data to disk in seconds. Default value is 900 (15 min)"
"interval_length_minutes - The statistics aggregation interval in minutes. Arbitrary values are not allowed. Use one of the following: 1, 5, 10, 15, 30, 60, and 1440 minutes. The default value is 60 minutes."
And now i have a problem:
If Query Store flush data to disk every 15min, why do i see query in QS tables in seconds after execution?
As i understand QS tables are 'permanent' and they are stored in data base (on disk), so how does this parameter (flush_interval_seconds) work?
The same thing about interval_length_minute - when i saved QS output after 1 minute from last query execution and after 61 minutes i realised that they are more less the same, so what about this aggregation?

flush_interval_seconds - The period for regular flushing of Query Store data to disk in seconds. That means flushing from memory to disk so that the information wouldn't be lost after server restart. Before the flushing you just read info from memory.
interval_length_minute - this is aggregation interval for query runtime statistics. The lower it is the finer granularity of the runtime statistics becomes.
None of the options sets a period after which the info will be available.

Related

Best way to handle time consuming queries in InfluxDB

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Two question about Time Travel storage-costs in snowflake

I read the snowflake document a lot. Snowflake will has storage-costs if data update.
"tables-storage-considerations.html" mentioned that:
As an extreme example, consider a table with rows associated with
every micro-partition within the table (consisting of 200 GB of
physical storage). If every row is updated 20 times a day, the table
would consume the following storage:
Active 200 GB | Time Travel 4 TB | Fail-safe 28 TB | Total Storage 32.2 TB
The first Question is, if a periodical task run 20 times a day, and the task exactly update one row in each micro-partition, then the table still consume 32.2TB for the total storage?
"data-time-travel.html" mentioned that:
Once the defined period of time has elapsed, the data is moved into
Snowflake Fail-safe and these actions can no longer be performed.
So my second question is: why Fail-safe cost 28TB, not 24TB (reduce the time travel cost)?
https://docs.snowflake.com/en/user-guide/data-cdp-storage-costs.html
https://docs.snowflake.com/en/user-guide/tables-storage-considerations.html
https://docs.snowflake.com/en/user-guide/data-time-travel.html
First question: yes, it's the fact that the micro-partition is changing that is important not how many rows within it change
Question 2: fail-safe is 7 days of data. 4Tb x 7 = 28Tb

Can snowflake work as an operational data store against which I can write rest APIs

I am researching snowflake database and have a data aggregation use case, where i need to expose the aggregated data via a Rest API. While the data ingestion and aggregation seems to be well defined, is snowflake a system that can be used as an operational data store for servicing high throughput apis?
Or is this an anti pattern for this system
Updating based on your recent comment.
Here's some quick test results I did on large tables we have in production. *Changed the table names for display.
vLookupView records = 175,760,316
vMainView records = 179,035,026
SELECT
LP.REGIONCODE
, SUM(L.VALUE)
FROM DBO.vLookupView AS LP
INNER JOIN DBO.vMainView AS M
ON LP.PK = M.PK
GROUP BY LP.REGIONCODE;
Results:
SQL SERVER
Production box - 2:04 minutes
**Snowflake:**
By Warehouse (compute) size
XS - 17.1 seconds
Small - 9.9 seconds
Medium - 7.1s seconds
Large - 5.4 seconds
Extra Large - 5.4 seconds
When I added a WHERE condition
WHERE L.ENTEREDDATE BETWEEN '1/1/2018' AND '6/1/2018'
the results were:
SQL SERVER
Production box - 5 seconds
**Snowflake:**
By Warehouse (compute) size
XS - 12.1 seconds
Small - 3.9 seconds
Medium - 3.1s seconds
Large - 3.1 seconds
Extra Large - 3.1 seconds

Paginated searching... does performance degrade heavily after N records?

I just tried the following query on YouTube:
http://www.youtube.com/results?search_query=test&search=tag&page=100
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50
Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
)
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
http://sphinxsearch.com/docs/current.html#conf-max-matches
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
...
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
...
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
...
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
...
10 rows in set (1.70 sec)
It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?

Measure transaction log throughput?

After reading Kim Tripp's article on transaction log throughput and discovering that I have gazillions of VLFs, I'm planning to restructure the logs as she outlined. I want to measure the resulting increase in log throughput to see if the fragmentation makes a difference on my servers, but I'm at a loss as to how to do so. I couldn't find anything in the BOL or Google on measuring log throughput, and the best strategy I've been able to cobble together is to see if the average wait time per task for LOGBUFFER and WRITELOG waits decreases.
SELECT wait_type, (wait_time_ms - signal_wait_time_ms) * 1. /
waiting_tasks_count AS [Wait (ms) per Task]
FROM sys.dm_os_wait_stats
WHERE wait_type IN ('LOGBUFFER', 'WRITELOG')
Is there something more definitive, perhaps akin to the perfmon database throughput counters (http://technet.microsoft.com/en-us/library/ms189883.aspx)?
select * from sys.dm_os_performance_counters
where counter_name in ('Log Flushes/sec'
,'Log Bytes Flushed/sec'
,'Log Flush Waits/sec'
,'Log Flush Wait Time')
and instance_name = '<dbname>';
This being a performance counter, you would need to compute the actual value from the raw value. For the 'Log Flush Wait Time' counter, which is of type 65792 (ie. NumberOfItems64) is easy: the raw value is the value. But the other ones are type 272696576 (ie. RateOfCountsPerSecond64) for which the value is computed by dividing the delta or two consecutive raw values to the number of seconds passed between the tacking of the samples.
The easieer alternative if to fire up Perfmon.exec and look at the corresponding performance counters.

Resources