The following query is faster (1:05):
SELECT DATEPART(DW,DATEFROMPARTS(
FLOOR(20180501/10000)
,FLOOR(20180501-FLOOR(20180501/10000)*10000)/100
,FLOOR(20180501-FLOOR(20180501/100)*100)))
GO 1000
Than (1:10):
SELECT DATEPART(DW,CAST(CAST(20180501 AS nvarchar) AS DATE))
GO 1000
Why?
I have a table with 2 billions of records (roughly) so the difference becomes important. There is far more logic behind hardcoded date. Otherwise, if there exists a better approach, in term of performace, for executing the same logic, feel free to correct me.
The date column is always an integer, and not always have the same format. Two formats are retrieved YYYYMMDD and, YYYYMM. I know, a bit of a mess.
Thanks!
Delete duplicate rows when the first day of the month (YYYYMM01) is monday
If you wan to speed up the delete create a temporary table (or permanent if this is recurring operation) with a column of the same data type as your table's "date" column with all 1st Mondays of each month across XX years. Make sure the data is in the same format as you mentioned in your question. Be sure that this column has an index (clustered). Now use this table in your query as the filter without doing any conversions which will allow Sql Server to take advantage of any indexes that exist on the existing table's "date" column.
Related
What is the correct way to model data in a star schema such that a BI tool (such as PowerBI) can select a date range crossing multiple days?
I've currently got fact tables that have separate date and time dimensions. My time resolution is to the second, date resolution is to the day.
It's currently very easy to do aggregation providing the data of interest is in the same day, or even multiple complete days, but it becomes a lot more complicated when you're asking for, say, a 12 hour rolling window that crosses the midnight boundary.
Yes, I can write a SQL statement to first pull out all rows for the entirety of the days in question, and then by storing the actual date time as a field in the fact table I can further filter down to the actual time range I'm interested in, but that's not trivial (or possible in some cases) to do in BI reporting tools.
However this must be a frequent scenario in data warehouses... So how should it be done?
An example would be give me the count of ordered items from the fact_orders table between 2017/Jan/02 1600 and 2017/Jan/03 0400.
Orders are stored individually in the fact_orders table.
In my actual scenario I'm using Azure SQL database, but it's more of a general design question.
Thank you.
My first option would be (as you mention in the question) to include a calculated column (Date + Time) in the SQL query and then filter the time part inside the BI tool.
If that doesn't work, you can create a view in the database to achieve the same effect. The easiest is to take the full joined fact + dimensions SQL query that you'd like to use in the BI tool and add the date-time column in the view.
Be sure to still filter on the Date field itself to allow index use! So for your sliding window, your parameters would be something like
WHERE Date between 2017/Jan/02 AND 2017/Jan/03
AND DateTime between 2017/Jan/02 1600 and 2017/Jan/03 0400
If that doesn't perform well enough due to data volumes, you might want to set up and maintain a separate table or materialized view (depending on your DB and ETL options) that does a Cartesian join of the time dimension with a small range of the Date dimension (only the last week or whatever period you are interested in partial day reports), then join the fact table to that.
The DateTimeWindow table/view would be indexed on the DateTime column and have only two extra columns: DateKey and TimeKey. Inner join that to the fact table using both keys and you should get exactly the window you want when the BI tool supplies a datetime range.
That is not easily modeled. A solution would be to build a additional dimension with date + time. Of course this could means you have to severely limit the granularity of the time dimension.
10 year hour granularity: 365 * 10 * 24 = 87600 rows
10 year minute granularity: 365 * 10 * 24 * 60 = 5256000 rows
You could use just this dimension, or (better) add it and do not show it to all users. It would means an additional key in the fact table: if the FT is not gigantic, no big deal.
I have a table Transactions which has a column store index and stored 74445000 rows.
I have a query like below
SELECT
CustomerNumber,
MONTH(CreationDate) AS m,
YEAR(CreationDate) AS y,
CurrencyID AS c
FROM Transactions
I am mulling that may be doing a JOIN on a Date Dimension table which contains month and year for all dates may be better than the above query which uses SQL date functions.
Can anyone verify this assumption and/or point to a resource which can provide a details?
Any alteration of the original column will have an impact on the query performance. For the calculation of month & year in your query, you should get a very efficient Batch Execution Mode, which will make alternatives look quite pale.
Plus, if your join will be done on an integer/bigint column, than you might be able to get Segment Elimination which should improve your query performance, but going for any string column will make query lasts incredible amount of time in comparison with the int data types.
In another words - unnecessary headaches.
There are no big alternatives since Columnstore Indexes do not support computed columns yet (SQL Server 2016).
I have a large database and use a query like this:
...WHERE (DATETIME > '30.10.2014 00:00:00' AND DATETIME < '03.11.2014 00:00:00')
My query is already ordered by the field DATETIME, so is it possible to break the query if DATETIME < '03.11.2014 00:00:00' is first time reached so that oracle don't need to check the remaining rows because they aren't needed and this would safe time?
Thanks!
You got basically 3 options here (ordered from best to worst):
If you will create the desired table partitioned by DATETIME column, the optimizer will scan only the relevant partitions (in range 30.10.2014 00:00:00—03.11.2014 00:00:00) instead of accessing the entire table.
Create the table as an IOT (Index Organized Table) — this way the table will be stored as an ordered B-Tree.
Create an index on the DATETIME column. This way, when accessing the column, you will scan the ordered index. This option has a major disadvantage — if the data which is being inserted into this table is "real time sequential data" (I mean that DATETIME column values are always increasing [sysdate for example] and not random [date of birth for example]), there will always be a Hot Block on your B-Tree index. This will cause contentions and probably many wait events (dependent on the data insertion rates of course). The way to "solve" this is to create this index reversed, but then you will have a big problem performing queries on ranges (like the query you've presented here), because the data will be scattered across the index and not stored sequentially.
So, my best advice for situations like this — work with partitions, this is the best way to work efficiently with big amounts of data on Oracle Databases.
If there's not enough data on this table to consider partitions, then the table is not that big and you can consider options #2, #3.
Best regards.
Can I specify in a range that all rows having value in CreatedDate column earlier than one month from GETDATE() should be placed in one partition and the rest in other, so that I should query the 2nd partition for latest data and 1st one for archived data?
No, you can't. Partition function must be deterministic. Deterministic functions always return the same result any time they are called with a specific set of input values.
Unfortunately, GetDate() is nondeterministic function.
Unfortunately, you can't use GetDate(), because GetDate() is nondeterministic function.
See http://shannonlowder.com/2010/08/partitioning/ for more details
#Ismail
There are alternatives:
Create bit column LastMonth and partition function based on LastMonth column. You need to update field every day, before you start using your data. You don't need to do it daily, maybe is better way to update column you choose to flag your fresh data (or change your partition function), once in a period you choose (week/month/quarter).
I don't try this approach, you may need to start some maintenance on table for full performance after updating column.
Another idea that might be work is to make partition for every month, and change filegroups when new month start. For example, if you want your latest data on fast disk f: and history on s:, you will have PartitionJan on s: and PartitionFebruary on f:, when martch started move PartitionFebruary to s:, and start using PartitionMartch on f:.
I have column in my table, say updateStamp. I'd like to get an approach to update that field with a new sequential number upon row update.
The database has lot of traffic, mostly read, but multiple concurrent updates could also happen in batches. Therefore the solution should cause minimal locks.
Reason for this requirement is that I need to have a solution for clients to iterate over the table forwards and if a row is updated - it should come up on the result set again.
So, query would then be like
SELECT *
FROM mytable
WHERE updateStamp > #lastReturnedUpdateStamp
ORDER BY updateStamp
Unfortunately timestamps do not work here because multiple updates could happen at same time.
The timestamp (deprecated) or rowversion (current) data type is the only one I'm aware of that is updated on every write operation on the row.
It's not a time stamp per se - it doesn't store date, time in hours, seconds etc. - it's really more of a RowVersion (hence the name change) - a unique, ever-increasing number (binary) on the row.
It's typically used to check for any modifications between the time you have read the row, and the time you're going to update it.
Since it's not really a date/time information, you will most likely have to have another column for that human-readable information. You can add a LastModified DATETIME column to your table, and with a DEFAULT GETDATE() constraint, you can insert a new value upon insertion. For keeping that up to date, you'll have to write a AFTER UPDATE trigger to update the LastModified column when any update occurs.
SQL Server 2011 (a.k.a. "Denali") will bring us SEQUENCES which would be the perfect fit in your case here - but alas, that' still at least a year from official release.....