I have a table Transactions which has a column store index and stored 74445000 rows.
I have a query like below
SELECT
CustomerNumber,
MONTH(CreationDate) AS m,
YEAR(CreationDate) AS y,
CurrencyID AS c
FROM Transactions
I am mulling that may be doing a JOIN on a Date Dimension table which contains month and year for all dates may be better than the above query which uses SQL date functions.
Can anyone verify this assumption and/or point to a resource which can provide a details?
Any alteration of the original column will have an impact on the query performance. For the calculation of month & year in your query, you should get a very efficient Batch Execution Mode, which will make alternatives look quite pale.
Plus, if your join will be done on an integer/bigint column, than you might be able to get Segment Elimination which should improve your query performance, but going for any string column will make query lasts incredible amount of time in comparison with the int data types.
In another words - unnecessary headaches.
There are no big alternatives since Columnstore Indexes do not support computed columns yet (SQL Server 2016).
Related
I have following table:
CREATE TABLE public.shop_prices
(
shop_name text COLLATE pg_catalog."default",
product text COLLATE pg_catalog."default",
product_category text COLLATE pg_catalog."default",
price text COLLATE pg_catalog."default"
)
and for this table i have a dataset from 18 months. In each file there are about 15M records. I have to some analysis, like in which month a shop has increased or decreased their price. I imported two months in a table and run following query just to test:
select shop, product from shop_prices group by shop, product limit 10
I waited more than 5 minutes, but no any result and response. It was still on working. What is the best way the store these datasets and run efficiency queries? Is it a good idea if I create for each dataset a seperate tables?
Using explain analyze select shop_name, product from shop_prices group by shop, product limit 10 you can see how Postgres is planning and executing the query and the time the execution takes. You'll see it needs to read the whole table (with the time consuming disk reads) and then sort it in memory - which will probably need to be cached on disk, before returning the results. In the next run you might discover the same query is very snappy if the number of shop_name+product combinations are very limited and thus stored in pg_stats after that explain analyze. The point being that a simple query like this can be deceiving.
You will faster execution by creating an index on the columns you are using (create index shop_prices_shop_prod_idx on public.shop_prices(shop_name,product)).
You should definitely change the price column type to numeric (or float/float8)) if you plan to do any numerical calculations on it.
Having said all that, I suspect this table is not what you will be using as it does not have any timestamp to compare prices between months to begin with.
I suggest you complete the table design and speculate on indices to improve performance. You might even want consider table partitioning https://www.postgresql.org/docs/current/ddl-partitioning.html
You will probably be doing all sorts of queries on this data so there is no simple solution to them all.
By all means return with perhaps more specific questions with complete table description and the output from the explain analyze statement for queries you are trying out and get some good advice.
Best regards,
Bjarni
What is your PostgreSQL version ?
First there is a typo: column shop should be shop_name.
Second you query looks strange because it has only a LIMIT clause without any ORDER BY clause or WHERE clause: do you really want to have "random" rows for this query ?
Can you try to post EXPLAIN output for the SQL statement:
explain select shop_name, product from shop_prices group by shop_name, product limit 10;
Can you also check if any statistics have been computed for this table with:
select * from pg_stats where tablename='shop_prices';
The following query is faster (1:05):
SELECT DATEPART(DW,DATEFROMPARTS(
FLOOR(20180501/10000)
,FLOOR(20180501-FLOOR(20180501/10000)*10000)/100
,FLOOR(20180501-FLOOR(20180501/100)*100)))
GO 1000
Than (1:10):
SELECT DATEPART(DW,CAST(CAST(20180501 AS nvarchar) AS DATE))
GO 1000
Why?
I have a table with 2 billions of records (roughly) so the difference becomes important. There is far more logic behind hardcoded date. Otherwise, if there exists a better approach, in term of performace, for executing the same logic, feel free to correct me.
The date column is always an integer, and not always have the same format. Two formats are retrieved YYYYMMDD and, YYYYMM. I know, a bit of a mess.
Thanks!
Delete duplicate rows when the first day of the month (YYYYMM01) is monday
If you wan to speed up the delete create a temporary table (or permanent if this is recurring operation) with a column of the same data type as your table's "date" column with all 1st Mondays of each month across XX years. Make sure the data is in the same format as you mentioned in your question. Be sure that this column has an index (clustered). Now use this table in your query as the filter without doing any conversions which will allow Sql Server to take advantage of any indexes that exist on the existing table's "date" column.
What is the correct way to model data in a star schema such that a BI tool (such as PowerBI) can select a date range crossing multiple days?
I've currently got fact tables that have separate date and time dimensions. My time resolution is to the second, date resolution is to the day.
It's currently very easy to do aggregation providing the data of interest is in the same day, or even multiple complete days, but it becomes a lot more complicated when you're asking for, say, a 12 hour rolling window that crosses the midnight boundary.
Yes, I can write a SQL statement to first pull out all rows for the entirety of the days in question, and then by storing the actual date time as a field in the fact table I can further filter down to the actual time range I'm interested in, but that's not trivial (or possible in some cases) to do in BI reporting tools.
However this must be a frequent scenario in data warehouses... So how should it be done?
An example would be give me the count of ordered items from the fact_orders table between 2017/Jan/02 1600 and 2017/Jan/03 0400.
Orders are stored individually in the fact_orders table.
In my actual scenario I'm using Azure SQL database, but it's more of a general design question.
Thank you.
My first option would be (as you mention in the question) to include a calculated column (Date + Time) in the SQL query and then filter the time part inside the BI tool.
If that doesn't work, you can create a view in the database to achieve the same effect. The easiest is to take the full joined fact + dimensions SQL query that you'd like to use in the BI tool and add the date-time column in the view.
Be sure to still filter on the Date field itself to allow index use! So for your sliding window, your parameters would be something like
WHERE Date between 2017/Jan/02 AND 2017/Jan/03
AND DateTime between 2017/Jan/02 1600 and 2017/Jan/03 0400
If that doesn't perform well enough due to data volumes, you might want to set up and maintain a separate table or materialized view (depending on your DB and ETL options) that does a Cartesian join of the time dimension with a small range of the Date dimension (only the last week or whatever period you are interested in partial day reports), then join the fact table to that.
The DateTimeWindow table/view would be indexed on the DateTime column and have only two extra columns: DateKey and TimeKey. Inner join that to the fact table using both keys and you should get exactly the window you want when the BI tool supplies a datetime range.
That is not easily modeled. A solution would be to build a additional dimension with date + time. Of course this could means you have to severely limit the granularity of the time dimension.
10 year hour granularity: 365 * 10 * 24 = 87600 rows
10 year minute granularity: 365 * 10 * 24 * 60 = 5256000 rows
You could use just this dimension, or (better) add it and do not show it to all users. It would means an additional key in the fact table: if the FT is not gigantic, no big deal.
I have a single database of over 9,000 tables, built by ingesting an enormous dataset. Each table can vary in size, but the columns of each table are the same (day, month, year, time, measurement, altitude).
How can I do this? I've read that using a single table will be much more efficient for me to query for, say, all measurements within a specific month. Rebuilding the database itself would be a pain, and would take too much time.
Assuming you have the list of tables, execute the following SQL statement for each table:
INSERT INTO Dateset_all SELECT * FROM Dataset_123;
I have a large database and use a query like this:
...WHERE (DATETIME > '30.10.2014 00:00:00' AND DATETIME < '03.11.2014 00:00:00')
My query is already ordered by the field DATETIME, so is it possible to break the query if DATETIME < '03.11.2014 00:00:00' is first time reached so that oracle don't need to check the remaining rows because they aren't needed and this would safe time?
Thanks!
You got basically 3 options here (ordered from best to worst):
If you will create the desired table partitioned by DATETIME column, the optimizer will scan only the relevant partitions (in range 30.10.2014 00:00:00—03.11.2014 00:00:00) instead of accessing the entire table.
Create the table as an IOT (Index Organized Table) — this way the table will be stored as an ordered B-Tree.
Create an index on the DATETIME column. This way, when accessing the column, you will scan the ordered index. This option has a major disadvantage — if the data which is being inserted into this table is "real time sequential data" (I mean that DATETIME column values are always increasing [sysdate for example] and not random [date of birth for example]), there will always be a Hot Block on your B-Tree index. This will cause contentions and probably many wait events (dependent on the data insertion rates of course). The way to "solve" this is to create this index reversed, but then you will have a big problem performing queries on ranges (like the query you've presented here), because the data will be scattered across the index and not stored sequentially.
So, my best advice for situations like this — work with partitions, this is the best way to work efficiently with big amounts of data on Oracle Databases.
If there's not enough data on this table to consider partitions, then the table is not that big and you can consider options #2, #3.
Best regards.