Looping Dates in Teradata - loops

I have a very large query which returns about 800,000 rows on a monthly basis
I have been asked to run it for the year but obviously this will not all fit into a single excel file. Unfortunately the details are necessary for the analysis being undertaken by the business analyst so i cant roll it up to aggregates
SELECT * FROM foo where foo.start_date >= 20150101 AND < 20150201
I was wondering is it possible in Teradata to set up a loop to run the first query, increment the time period for a month, export it as a text file then run the query again
I would be using Teradata SQL Assistant for this with the export functionality
I could run it for the year and split it in MS Access afterwards but i would just like to see if there is a way to implement it directly in Teradata
Thank you for your time

Related

Azure SQL Database Partitioning

I currently have an Azure SQL Database (Standard 100 DTUs S3) and I'm wanting to create partitions on a large table splitting a datetime2 value into YYYYMM. Each table has at least the following columns:
Guid (uniqueidentifier type)
MsgTimestamp (datetime2 type) << partition using this.
I've been looking on Azure documentation and SO but can't find anything that clearly says how to create a partition on a 'datetime2' in the desired format or even if it's supported on the SQL database type.
Another example if trying the link below, but I do not find the option to create a partition within SQL Studio to create a partition on the Storage menu.
https://www.sqlshack.com/database-table-partitioning-sql-server/
In addition, would these tables have to be created daily as the clock goes past 12am or is this done automatically?
UPDATE
I suspect I may have to manually create the partitions using the first link below and then at the beginning of each month, use the second link to create the next months partition table in advance.
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-partition-function-transact-sql?view=sql-server-ver15
Context
I currently connect into a real-time feed that feeds upto 600 rows a minute and have a backlog of around 370 million for 3 years worth of data.
Correct.
You can create partitions based upon datetime2 columns. Generally, you'd just do that on the start of month date, and you'd use a RANGE RIGHT (so that the start of the month is included in the partition).
And yes, at the end of every month, the normal action is to:
Split the partition function to add a new partition option.
Switch out the oldest monthly partition into a separate table for archiving purposes (presuming you want to have a rolling period of months)
And another yes, we all wish the product had options to do this for you automatically.
I was one of the tech reviewers on the following whitepaper by Ron Talmage, back in 2008 and 99% of the advice in it is still current:
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd578580(v=sql.100)

Selecting data across day boundaries from Star schema with separate date and time dimensions

What is the correct way to model data in a star schema such that a BI tool (such as PowerBI) can select a date range crossing multiple days?
I've currently got fact tables that have separate date and time dimensions. My time resolution is to the second, date resolution is to the day.
It's currently very easy to do aggregation providing the data of interest is in the same day, or even multiple complete days, but it becomes a lot more complicated when you're asking for, say, a 12 hour rolling window that crosses the midnight boundary.
Yes, I can write a SQL statement to first pull out all rows for the entirety of the days in question, and then by storing the actual date time as a field in the fact table I can further filter down to the actual time range I'm interested in, but that's not trivial (or possible in some cases) to do in BI reporting tools.
However this must be a frequent scenario in data warehouses... So how should it be done?
An example would be give me the count of ordered items from the fact_orders table between 2017/Jan/02 1600 and 2017/Jan/03 0400.
Orders are stored individually in the fact_orders table.
In my actual scenario I'm using Azure SQL database, but it's more of a general design question.
Thank you.
My first option would be (as you mention in the question) to include a calculated column (Date + Time) in the SQL query and then filter the time part inside the BI tool.
If that doesn't work, you can create a view in the database to achieve the same effect. The easiest is to take the full joined fact + dimensions SQL query that you'd like to use in the BI tool and add the date-time column in the view.
Be sure to still filter on the Date field itself to allow index use! So for your sliding window, your parameters would be something like
WHERE Date between 2017/Jan/02 AND 2017/Jan/03
AND DateTime between 2017/Jan/02 1600 and 2017/Jan/03 0400
If that doesn't perform well enough due to data volumes, you might want to set up and maintain a separate table or materialized view (depending on your DB and ETL options) that does a Cartesian join of the time dimension with a small range of the Date dimension (only the last week or whatever period you are interested in partial day reports), then join the fact table to that.
The DateTimeWindow table/view would be indexed on the DateTime column and have only two extra columns: DateKey and TimeKey. Inner join that to the fact table using both keys and you should get exactly the window you want when the BI tool supplies a datetime range.
That is not easily modeled. A solution would be to build a additional dimension with date + time. Of course this could means you have to severely limit the granularity of the time dimension.
10 year hour granularity: 365 * 10 * 24 = 87600 rows
10 year minute granularity: 365 * 10 * 24 * 60 = 5256000 rows
You could use just this dimension, or (better) add it and do not show it to all users. It would means an additional key in the fact table: if the FT is not gigantic, no big deal.

SQL/pivot Calculated fields, is this possible

I'm using Microsoft SQL Server and Excel. I have an issue I'm having problems getting my head round surrounding how to get some fields calculated so that it runs faster. I have a large data set that gets dropped into excel and stuck into a pivot table.
The table at it's simplest will contain a number of fields similar to the below.
Date user WorkType Count TotalTime
My issue is that I need to calculate an average in a particular way. Each user may have several worktypes on any given day. The formula I have is for each Date&User Sum(TotalTime)/Sum(Count) to get me the following
Date user Average
Currently I dump a select query into excel, apply formula to a column to get my averages then construct the pivot table using the personal details and the averages.
The calculation on over 20,000 rows however is about 5-7 minutes.
So my question is that possible to do that type of calculation in either SQL or Pivot table to cut down the processing time. I'm not very confident with Pivot tables, and I'm considered fairly inexperienced at SQL compared to here. I can manage bits of this but pulling it all together with the conditions of matching Date and User is beyond me right now.
I could parse the recordset into an array to do my calculations that way before it gets written to the spreadsheet, but I just feel that there should be a better way to achieve the same end.
Pre-calculated aggregates in SQL can go very wrong in an Excel Pivot.
Firstly, you can accidentally take an average of an average.
Secondly, once users start re-arranging the pivot you can get very strange sub-totals and totals.
Try to ensure you do all of your aggregation in one place.
If possible try to use SQL with SSRS, you can base a report on a parameterised stored procedure. Consequently you push all of the hard work onto the SQL box, and you restrict users from pivoting things around improperly.
SELECT Raw_Data.ID, Raw_Data.fldDate, CONVERT(datetime, DATEADD(second, SUM(DATEDIFF(second, 0, Raw_Data.TotalHandle)) / SUM(Raw_Data.[Call Count]), 0), 108) AS avgHandled
FROM Raw_Data
WHERE (Raw_Data.fldDate BETWEEN '05-12-2016' AND '07-12-2016')
GROUP BY Raw_Data.fldDate, Raw_Data.ID
For anyone interested here is the results of my searching. Thank you for the help that pointed me in the right direction. It seems quite clumsy with the conversions due to a time Datatype but it's working.

Import data from Oracle datawarehouse to SQL Server via SSIS 2008

I have an Oracle datawarehouse, which contains a huge amount of data (around 11 million rows) and want to extract this data on a daily basis to SQL Server database.
SSIS Package
I have created a package to import data from Oracle to SQL Server using slowly changing dimensions however it is handling around 600 rows per second.
I need my package to just insert new records without updating or doing anything to old records as the data is huge.
Is there any way to do it very fast with any other data flow items?
You could try to utilize a Merge Join in SSIS, this should allow for a comparison where only new records are inserted. Also, I don't like using just datetime when determining what data does and does not get inserted, I guess it depends on your source data though. Sounds like there is not a sequential ID field for the Oracle source data? If there is, I'd utilize that and datetime in combination for what data to insert. This could be done in SQL or SSIS.
600/sec is not too bad in your case.
If assume that those 11 millions were collected during only 1 year. That means the number of new records is just 30K per day. Which is about 1 minute to run.
The biggest problem is to identify records to insert.
If you have to have Timestamp or sequential ID to identify latest inserted records.
In case your ID is not sequential you can try to extract into SSIS ONLY ID field from Oracle table and compare it to the existing dataset and then request from Oracle only newest records.
If you don't have these fields you can extract all 11 million records, then generate hash on both sides and compare these hash values to know what new to insert.

Is it possible to show the records from an ADOQuery whilst opening it?

I have an ADOQuery linked to a DBGrid by a DataSource.
The ADOQuery and the DataSource are in a DataModule and the connection is in another form.
Is there any way to make my application show rows while the query is fetching the records?
Like MSSQL Management Studio.
The select takes about 7 min to terminate the execution.
I'm using Delphi 2007.
A difficult challenge. If I need to do massive queries, I normally break the query into chunks. I then create a stored procedure that takes parameters #ChunkNumber, #ChunkSize and #TotalChunks. So you would only run the query for records from (#ChunkNumber-1)#ChunkSize+ 1 to #ChunkNumber#ChunkSize. In your Delphi code, simply run a loop like this (PSeudo Code):
for(Chunk = 1 to TotalChunks)
{
DataTableResults = sp_SomePrecedure #ChunkNumber = #Chunk,
#ChunkSize = ChunkSize
RenderTableToClient(DataTableResults)
}
In this way, lets say you have 10,000 records, chunk size is 100. So you will have 100 SP calls. So you could render each chunk received from the SP, so the user is able to see the table updating.
Limitations are if the query running needs to run all records in one hit first. E.g. a Group By. SQL server uses OFFSET so you can combine to get something useful.
I have queries that run about 800K records take about 10 mins to run which I do this with. But what I do is chunk up the source tables and then run queries, e.g. if one table users has 1M records and you want to return a query which shows the total pages accessed per hour, you could chunk the users up and run the query for each chunk only.
Sorry I dont have specific code examples but hope this suggestion leads you in a positive direction.

Resources