Oracle Query with Unknown WHERE clause - database

We have a table with 100 columns. The end user is allowed to write any query trying to search on practically any column.
Essentially, they construct the query dynamically from a screen and the WHERE condition can have any number or combination of columns.
Example
select * from my_tab where col1=x
select * from my_tab where col1=x and col2=y and ....col10 =q
select * from my_tab where col10=a and col20=4 and col30=r
While this is possible syntactically, the biggest problem is performance because you cannot have all possible combination of indexes.
I know this seems to be "query from hell" but still :
What can be other approaches ( both - technical and non-technical ) to this problem ?

Related

MSSQL select query with prioritized OR

I need to build one MSSQL query that selects one row that is the best match.
Ideally, we have a match on street, zip code and house number.
Only if that does not deliver any results, a match on just street and zip code is sufficient
I have this query so far:
SELECT TOP 1 * FROM realestates
WHERE
(Address_Street = '[Street]'
AND Address_ZipCode = '1200'
AND Address_Number = '160')
OR
(Address_Street = '[Street]'
AND Address_ZipCode = '1200')
MSSQL currently gives me the result where the Address_Number is NOT 160, so it seems like the 2nd clause (where only street and zipcode have to match) is taking precedence over the 1st. If I switch around the two OR clauses, same result :)
How could I prioritize the first OR clause, so that MSSQL stops looking for other results if we found a match where the three fields are present?
The problem here isn't the WHERE (though it is a "problem"), it's the lack of an ORDER BY. You have a TOP (1), but you have nothing that tells the data engine which row is the "top" row, so an arbitrary row is returned. You need to provide logic, in the ORDER BY to tell the data engine which is the "first" row. With the rudimentary logic you have in your question, this would like be:
SELECT TOP (1)
{Explicit Column List}
realestates
WHERE Address_Street = '[Street]'
AND Address_ZipCode = '1200'
ORDER BY CASE Address_Number WHEN '160' THEN 1 ELSE 2 END;
You can't prioritize anything in the WHERE clause. It always results in ALL the matching rows. What you can do is use TOP or FETCH to limit how many results you will see.
However, in order for this to be effective, you MUST have an ORDER BY clause. SQL tables are unordered sets by definition. This means without an ORDER BY clause the database is free to return rows in any order it finds convenient. Mostly this will be the order of the primary key, but there are plenty of things that can change this.

Does SQL server optimize what data needs to be read?

I've been using BigQuery / Spark for a few years, but I'm not very familiar with SQL server.
If I have a query like
with x AS (
SELECT * FROM bigTable1
),
y AS (SELECT * FROM bigTable2)
SELECT COUNT(1) FROM x
Will SQL Server be "clever" enough to ignore the pointless data fetching?
Note: Due to the configuration of my environment I don't have access to the query planner for troubleshooting.
Like most of the leading professional DBMS's SQL Server has a statistical optimizer that will indeed eliminate datasources that are never used and cannot affect the results.
Note, however, that this does not apply to certain kinds of errors, so if your bigTable1 or bigTable2 do not exist (or you cannot access them), the query will throw a compile error, even though it would never actually use those tables.
SQL Server has certainly and actually the most advanced optimizer of all professionnal RDBMS (IBM DB2, Oracle...).
Before optimizing the algebriser transform the query, that is just a "demand" (not an execution code) into a mathematical formulae known has "algebraic tree". This mathematical object is formulae of relational algebrae that supports the Relational DBMS (a mathematical theory developped by Franck Edgar Codd in the early 70's).
The very first step of optimisation is done at this step by simplifying the mathematical formulae like things you've done with polynomial expression including x and y ( as an example : 2x - 3y = 3x² - 5x + 7 <=> y = (3x² - 3x + 7 ) / 3).
Query example (from Chris Date "A Cure for Madness") :
With the table :
CREATE TABLE T_MAD (TYP CHAR(4), VAL VARCHAR(16));
INSERT INTO T_MAD VALUES
('ALFA', 'ted'),('ALFA', 'chris'),('ALFA', 'michael'),
('NUM', '123'),('NUM', '4567'),('NUM', '89');
This query will fail :
SELECT * FROM T_MAD
WHERE VAL > 1000;
Because VAL is a string datatype uncompatible with a number value when comparing in WHERE.
But our table distinguish ALFA values from NUM values. By adding a restriction on the value with the TYP column like this :
SELECT * FROM T_MAD
WHERE TYP = 'NUM' AND VAL > 1000;
My query give the right result...
But all those queries too :
SELECT * FROM T_MAD
WHERE VAL > 1000 AND TYP = 'NUM';
SELECT * FROM
(
SELECT * FROM T_MAD WHERE TYP = 'NUM'
) AS T
WHERE VAL > 1000;
SELECT * FROM
(
SELECT * FROM T_MAD WHERE VAL > 1000
) AS T
WHERE TYP = 'NUM';
The last one is very important because the subquery is the first one that fails...
So what happen to this subquery that suddenly don't fail ?
In fact the algebriser rewrites all those query to a simplier form that conducts to a similar (not to say identical) formulae...
Just have a look on the query execution plans, which appear to be strictly equals !
NOTE that non professional DBMS like MySQL, MariaDB or PostGreSQL will fail for the last one.... An optimizer use a very huge set of IT developpers and researchers that open/free cannot mimic !
Second the optimizer have heuristic rules, that applies essentially at the semantic level. The execution plan is simplified when some contradictory conditions appears in the query text...
Just have a look over those two queries :
SELECT * FROM T_MAD WHERE 1 = 2;
SELECT * FROM T_MAD WHERE 1 = 1;
The first one will have no row returned while the second will have all the rows of the table returned... What does the optimizer ? The query execution plan give the answer :
The terms "Analyse de constante" in the query execution plan, means that the optimizer will not access to the table... That will be similar to what you will have with your last subquery ...
Note that every constrainst (PK, FK, UNIQUE, CHECK) can help the optimizer to simplify the query execution plan to optimize performances !
Third the optimizer will use statistics that are histogram computed on data distribution to predicts how many rows will be manipulated in every step of the query execution plan...
There is much more things to say about SQL Server query optimizer, like the fct that it works reversely from all the other optimizer, and with this technics it can predict the missing indexes since 18 years that all other RDBMS cannot !
PS : sorry to use the french version of SSMS... I am working in France and helps professionnals to optimize there databases !

flask-sqlalchemy slow paginate count

I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.
So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").

Snowflake Performance issues when querying large tables

I am trying to query a table which has 1Tb of data clustered by Date and Company. A simple query is taking long time
Posting the query and query profile
SELECT
sl.customer_code,
qt_product_category_l3_sid,
qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
dollars_spent,
units,
user_pii_sid,
promo_flag,
media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl
WHERE
transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND qt_product_category_l3_sid IN (SELECT DISTINCT qt_product_category_l3_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_category_l1_sid IN (246))
AND qt_product_brand_sid IN (SELECT qt_product_brand_sid
FROM cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand
WHERE qt_product_major_brand_sid IN (246903, 430138))
enter image description here
"simple query" I am not sure there is such a thing. A naive query, sure.
select * from really_large_table where column1 = value;
will perform really badly if you only care for 1 or 2 of the columns. As snowflake has to load all the data. You will get a column data to row data ratio improvement by using
select column1, column2 from really_large_table where column1 = value;
now only two columns of data need to be read form the data store.
Maybe you are looking for data where the value is > 100 because you think that should not happen. Then
select column1, column2 from really_large_table where column1 > 100 limit 1;
will perform much better than
select column1, column2 from really_large_table order by column1 desc limit 50;
but if what you are doing is doing the minimum work is can to have a correct answer, you next option is to increase the warehouse size. Which for IO bound work gives a scalar improvement, but some aggregation steps don't scale as linear.
Another thing to look for with regards is sometime your calculation can produce too much intermediate state, and it "external spills" (exact wording not correct) which is much like running out of ram and going to swap disk.
Then we have seen memory pressure when doing too much work in a JavaScript UDF, that slowed things down.
But most of these can be spotted by looking at the query profile and looking at the hotspots.
99% of the time was spent scanning the table. The filters within the query do not match your clustering keys, therefore it won't help much. Depending how much historical data you have on this table, and whether you will continue to read a year's worth of data, you might be better off (or creating a materialized view) clustering by qt_product_brand_sid or qt_product_category_l3_sid, depending on which one is going to be filtering the data quicker.
A big change requires changing the data structure of the transaction date to a true date field vs varchar.
second you have an IN clause w/ a single value. Use = instead.
but for the other IN clauses, I would suggest re-writing the query to separate out those sub-queries as CTE and then just join to those CTE.
Use this query :
SELECT
sl.customer_code,
s1.qt_product_category_l3_sid,
s1.qt_product_brand_sid,
sl.partner_code,
sl.transaction_id,
s1.dollars_spent,
s1.units,
s1.user_pii_sid,
s1.promo_flag,
s1.media_flag
FROM
cdw_dwh.public.qi_sg_promo_media_sales_lines_fact sl,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_cat,
cdw_dwh.PUBLIC.qi_sg_prompt_category_major_brand prod_brand
WHERE
s1.transaction_date_id >= (to_char(current_date - (52*7) , 'yyyymmdd') )
AND sl.partner_code IN ('All Retailers')
AND s1.qt_product_category_l3_sid =prod_cat.qt_product_category_l3_sid
AND prod_cat.qt_product_category_l1_sid =246
AND prod_cat.qt_product_brand_sid=prod_brand.qt_product_brand_sid
AND prod_brand.qt_product_major_brand_sid IN (246903, 430138)
Apparently performance is an area of focus for Snowflake R&D. After struggling to make complex queries perform on big data we got 100x improvements with Exasol, no tuning whatsoever.

MDX query is very slow and returns memory exception in SSRS

I'm trying to get a detailed list of all records where my total amount is more than 100k from the following Multidimensional Expressions (MDX) query:
with member [measures].[total] as
[Measures].[m1] + [Measures].[m2] + [Measures].[m3]
select non empty
[measures].[total] on columns,
non empty filter ([dim1].[h1].allmembers
* [dim1].[h2].allmembers
* [Loss Date].[Date].[Year].allmembers
* [Dim1].[h3].allmembers
, [measures].[total]>100000 and [Measures].[Open File Count]>0) on rows
from [Monthly Summary]
where ([1 Date - Month End].[Month End Date].[Month].&[20120331])
Although I get fast results from creating a stored procedure and the final result is less than 1000 rows, my MDX query runs for ever in SSMS and in SSRS returns a memory exception. Any idea on how to optimize or enhance it?
You could use Having instead of Filter, since it is applied after the Non Empty and you may get better performance (see this excellent blog post by Chris Webb). This would be the new version of the query:
with member [measures].[total] as
[Measures].[m1] + [Measures].[m2] + [Measures].[m3]
select non empty
[measures].[total] on columns,
non empty
[dim1].[h1].allmembers
* [dim1].[h2].allmembers
* [Loss Date].[Date].[Year].allmembers
* [Dim1].[h3].allmembers
having [measures].[total]>100000 and [Measures].[Open File Count]>0 on rows
from [Monthly Summary]
where ([1 Date - Month End].[Month End Date].[Month].&[20120331])
I would recommend a couple of changes.
First, do you really want the All member of each of your dimensions to be returned in the query? They will be included if they meet the condition in the filter. Also, I have found changing a where clause to a subselect to perform better in some cases. You need to test to see if it changes performance. Next, you can reduce the number of members that you're filtering by using a NonEmpty function first, by putting it inside the Filter function. Also, in some cases, using a polymorphic operator (*) performs worse than using a CrossJoin function or creating a tuple set of your members. The NON EMPTY on columns is unnecessary when you have only one item on the axis. I've combined all of these suggestions below:
with member [measures].[total] as
[Measures].[m1] + [Measures].[m2] + [Measures].[m3]
select
[measures].[total] on columns,
filter (
nonempty(
([dim1].[h1].[h1].members,
[dim1].[h2].[h2].members,
[Loss Date].[Date].[Year].members,
[Dim1].[h3].[h3].members)
, [measures].[m1]),
, [measures].[total]>100000 and [Measures].[Open File Count]>0) on rows
from
(select [1 Date - Month End].[Month End Date].[Month].&[20120331] on columns
from [Monthly Summary])
See this for a bit of explanation on the NON EMPTY versus NonEmpty: http://blogs.msdn.com/b/karang/archive/2011/11/16/mdx-nonempty-v-s-nonempty.aspx. In some cases putting a NonEmpty function inside a Filter function can produce a performance hit, sometimes not - you need to test.
The problem might be in a dimension or cube design (storage engine problem) and not in the query (formula engine problem). You can diagnose using the techniques here: http://www.microsoft.com/en-us/download/details.aspx?id=661. The whitepaper was written for SSAS 2005, but still applies to later versions of SSAS.
To reduce memory, order your dimensions.
Instead of:
[dim1].[h1].allmembers
* [dim1].[h2].allmembers
* [Loss Date].[Date].[Year].allmembers
* [Dim1].[h3].allmembers
Use
[dim1].[h1].allmembers
* [dim1].[h2].allmembers
* [Dim1].[h3].allmembers
* [Loss Date].[Date].[Year].allmembers
Under the covers, SSAS will inner join the dim1 members and outer join the Loss Date members.

Resources