Presto running slower than SQL Server - sql-server

Configured the SQL Server connnector in Presto, and tried few simple queries like:
Select count(0) from table_name
or,
Select sum(column_name) from table_name
Both above queries ran in SQL server in 300 ms and in Presto its running over 3 min.
This is the explain analyze of the second query (it seems to do table scan and fetch huge amount of data before doing sum), why it couldnt pushed down the sum operator to SQL Server itself.
Query Plan
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fragment 1 [SINGLE]
Cost: CPU 2.98ms, Input: 1 row (9B), Output: 1 row (9B)
Output layout: [sum]
Output partitioning: SINGLE []
- Aggregate(FINAL) => [sum:double]
Cost: ?%, Output: 1 row (9B)
Input avg.: 1.00 lines, Input std.dev.: 0.00%
sum := "sum"("sum_4")
- LocalExchange[SINGLE] () => sum_4:double
Cost: ?%, Output: 1 row (9B)
Input avg.: 0.06 lines, Input std.dev.: 387.30%
- RemoteSource[2] => [sum_4:double]
Cost: ?%, Output: 1 row (9B)
Input avg.: 0.06 lines, Input std.dev.: 387.30%
Fragment 2 [SOURCE]
Cost: CPU 1.67m, Input: 220770667 rows (1.85GB), Output: 1 row (9B)
Output layout: [sum_4]
Output partitioning: SINGLE []
- Aggregate(PARTIAL) => [sum_4:double]
Cost: 0.21%, Output: 1 row (9B)
Input avg.: 220770667.00 lines, Input std.dev.: 0.00%
sum_4 := "sum"("total_base_dtd")
- TableScan[sqlserver:sqlserver:table_name:ivpSQLDatabase:table_name ..
Cost: 99.79%, Output: 220770667 rows (1.85GB)
Input avg.: 220770667.00 lines, Input std.dev.: 0.00%
total_base_dtd := JdbcColumnHandle{connectorId=sqlserver, columnName=total_base_dtd, columnType=double}

Both example queries are aggregate queries that produce single row result.
Currently, in Presto it is not possible to push down an aggregation to the underlying data store. Conditions and column selection (narrowing projections) are pushed down, but aggregations are not.
As a result, when you query SQL Server from Presto, Presto needs to read all the data (from given column) to do the aggregation, so there is a lot of disk and network traffic. Also, it might be, that SQL Server could optimize away certain aggregations so it might be skipping data read at all (i am guessing here).
Presto is not suited to be a frontend to some other database. It can be used as such, but this has some implications. Presto shines when it is put to work as a big data query engine (over S3, HDFS or other object stores) or as a federated query engine, where you combine data from multiple data stores / connectors.
Edit there is an ongoing work in Presto to improve pushdown, including aggregate pushdown. You can track it at https://github.com/prestosql/presto/issues/18

Presto doesn't support aggregate-pushdowns but as a workaround, you can create views in the source database (SQL Server in your case) and query those views from Presto.

Related

indexed query to decimate time series results

The context here is I'm scoping a design that slices time-series data at user-defined intervals - too many permutations to simply roll up the data (eg, a 2nd hourly table, rolled up as (or after) the data ingest). I am not a database expert, so am hoping to learn if there are standard approaches to this that rely on table indexes, not duplicating the data in new table/collection, or otherwise encoding it a priori by file structure, etc. Especially if there is a particular db suited to or supporting this load. A prototype of the backend is in Mongo, but we can easily pivot to a more suited store at this time.
QUESTION: In any mainstream database or similar, is it possible to query a ~time series over a given time window, (efficiently) returning only data points at a consistent interval? (what db and example query, specifically).
My data is a few 10's of gb today, but growing if we're successful. I'd expect indexes against timestamp to continue to fit ~ok in memory. A custom file-based schema such as using parquet might work, but an off the shelf DB would be ideal. By consistent interval, i mean some "skip" meaningful to a human, such as
"every nth value", if the data could be assumed to be at a reliable cadence
or perhaps "first sample per hour", if not and the timestamps were epoch times
Eg, if my data is
ts
value
1001
1
1002
2
1003
3
... continuous ...
1010
10
1011
11
... etc ...
... large data set
and query had the parameters:
- skip_value:10
- ts:
- after:1000
- before:2000
the returned set would be:
[ (1001, 1), (1011,11), (1021,21) .... (1991,991) ]

Can snowflake work as an operational data store against which I can write rest APIs

I am researching snowflake database and have a data aggregation use case, where i need to expose the aggregated data via a Rest API. While the data ingestion and aggregation seems to be well defined, is snowflake a system that can be used as an operational data store for servicing high throughput apis?
Or is this an anti pattern for this system
Updating based on your recent comment.
Here's some quick test results I did on large tables we have in production. *Changed the table names for display.
vLookupView records = 175,760,316
vMainView records = 179,035,026
SELECT
LP.REGIONCODE
, SUM(L.VALUE)
FROM DBO.vLookupView AS LP
INNER JOIN DBO.vMainView AS M
ON LP.PK = M.PK
GROUP BY LP.REGIONCODE;
Results:
SQL SERVER
Production box - 2:04 minutes
**Snowflake:**
By Warehouse (compute) size
XS - 17.1 seconds
Small - 9.9 seconds
Medium - 7.1s seconds
Large - 5.4 seconds
Extra Large - 5.4 seconds
When I added a WHERE condition
WHERE L.ENTEREDDATE BETWEEN '1/1/2018' AND '6/1/2018'
the results were:
SQL SERVER
Production box - 5 seconds
**Snowflake:**
By Warehouse (compute) size
XS - 12.1 seconds
Small - 3.9 seconds
Medium - 3.1s seconds
Large - 3.1 seconds
Extra Large - 3.1 seconds

MongoDB and Arctic

I intend to analyse multiple data sets on the same time series (daily EOD). I will need to use computed columns. Use column A + B to create column C (store net result of calculation in column C). Is this functionality available using the MongoDB / Arctic database?
I would also intend to search the data... for example: What happens when the advance decline thrust pushes over 70 when the cumulative TICK was below -100,000 in the past 'n days'
Two data sets: Cumulative TICK and the Advance Decline Thrust (Uses advancers / decliners data). So they would be stored in the database, then I would want to have the capability to search for the above condition. This is achievable with the mongoDB / Arctic database structure?
Just looking for some general information before I move to a DB format. Currently everything I had created is on excel / VBA now its alrady out grown!
Any information greatly appreciated.
Note: I will use the same database for weekly, monthly, yearly and 1 minute, 3 minute, 5 minute 60 minute TICK/TIME based bars - not feeding live but updated EOD
yes, this can be done with arctic. Arctic can store pandas dataframes, and an operation like you have mentioned is trivial in pandas. Arctic is just a store, so you'd want to read the data out of arctic (data is stored in symbols in arctic) and then perform your transform, and then write the data back. Any of the storage engines (VersionStore, TickStore, or ChunkStore) should work for this.

Lucene query performance

I have a lucene index which has close to 480M documents. The size of the index is 36G. And I ran around 10000 queries against the index. Each query is a boolean AND query with 3 term queries inside. That is the query has 3 operands which MUST occur. Executing such 3 word queries gives the following latency percentiles.
50th = 16 ms
75th = 52 ms
90th = 121 ms
95th = 262 ms
99th = 76010 ms
99.9th = 76037 ms
Is the latency expected to degrade when the number of docs is as high as 480M? All the segments in the index are merged into one segment. Even when the segments are not merged, the latencies are not very different. Each document has 5-6 stored fields. But as mentioned above, the above latencies are for boolean queries that don't access any stored fields, but just do a posting list lookup on 3 tokens.
Any ideas on what could be wrong here?

Performance effect of Synonyms over a linked server in SQL Server

On localserver (a SQL Server 2008 R2), I have a synonym called syn_view1 pointing to the linked server remoteserver.remotedb.dbo.view1
This SLOW query takes 20 seconds to run.
select e.column1, e.column2
from syn_view1 e
where e.column3 = 'xxx'
and e.column4 = 'yyy'
order by e.column1
This FAST query takes 1 second to run.
select e.column1, e.column2
from remoteserver.remotedb.dbo.view1 e
where e.column3 = 'xxx'
and e.column4 = 'yyy'
order by e.column1
The only difference in the two queries is really the presence of the synonym.
Obviously, the synonym has an impact on the performance of the query.
The execution plan for the SLOW query is :
Plan Cost % Subtree cost
4 SELECT
I/O cost: 0.000000 CPU cost: 0.000000 Executes: 0
Cost: 0.000000 0.00 3.3521
3 Filter
I/O cost: 0.000000 CPU cost: 0.008800 Executes: 1
Cost: 0.008800 0.26 3.3521
2 Compute Scalar
I/O cost: 0.000000 CPU cost: 3.343333 Executes: 1
Cost: 0.000000 0.00 3.3433
1 Remote Query
I/O cost: 0.000000 CPU cost: 3.343333 Executes: 1
Cost: 3.343333 99.74 3.3433
And for the FAST query:
Plan Cost % Subtree cost
3 SELECT
I/O cost: 0.000000 CPU cost: 0.000000 Executes: 0
Cost: 0.000000 0.00 0.1974
2 Compute Scalar
I/O cost: 0.000000 CPU cost: 0.197447 Executes: 1
Cost: 0.000000 0.00 0.1974
1 Remote Query
I/O cost: 0.000000 CPU cost: 0.197447 Executes: 1
Cost: 0.197447 100.00 0.1974
My understanding is that in the SLOW query, the server fetches all the data from the remote server, then applies the filter (though without index) whereas in the FAST query the server fetches the filtered data from the remote server, thus using the remote indexes.
Is there any way to use the synonym while being fast?
Maybe a setup of the linked server ? the local database server?
Thanks for the help!
I would dump the data without the order by into a temp table on local server. Then I would select from the temp table with the order by. Order by is almost always the killer.
The accepted answer for this post on dba.stackexchange.com notes that performance gotcha's may occur in queries over linked servers due to limited access rights on the linked server, restricting the visibility of the table statistics to the local server. This can affect query plan, and thus performance.
Excerpt:
And this is why I got different results. When running as sysadmin I
got the full distribution statistics which indicated that there are no
rows with order ID > 20000, and the estimate was one row. (Recall that
the optimizer never assumes zero rows from statistics.) But when
running as the plain user, DBCC SHOW_STATISTICS failed with a
permission error. This error was not propagated, but instead the
optimizer accepted that there were no statistics and used default
assumptions. Since it did get cardinality information, it learnt that
the remote table has 830 rows, whence the estimate of 249 rows.

Resources