Oracle db stop query if date outside range - database

I have a large database and use a query like this:
...WHERE (DATETIME > '30.10.2014 00:00:00' AND DATETIME < '03.11.2014 00:00:00')
My query is already ordered by the field DATETIME, so is it possible to break the query if DATETIME < '03.11.2014 00:00:00' is first time reached so that oracle don't need to check the remaining rows because they aren't needed and this would safe time?
Thanks!

You got basically 3 options here (ordered from best to worst):
If you will create the desired table partitioned by DATETIME column, the optimizer will scan only the relevant partitions (in range 30.10.2014 00:00:00—03.11.2014 00:00:00) instead of accessing the entire table.
Create the table as an IOT (Index Organized Table) — this way the table will be stored as an ordered B-Tree.
Create an index on the DATETIME column. This way, when accessing the column, you will scan the ordered index. This option has a major disadvantage — if the data which is being inserted into this table is "real time sequential data" (I mean that DATETIME column values are always increasing [sysdate for example] and not random [date of birth for example]), there will always be a Hot Block on your B-Tree index. This will cause contentions and probably many wait events (dependent on the data insertion rates of course). The way to "solve" this is to create this index reversed, but then you will have a big problem performing queries on ranges (like the query you've presented here), because the data will be scattered across the index and not stored sequentially.
So, my best advice for situations like this — work with partitions, this is the best way to work efficiently with big amounts of data on Oracle Databases.
If there's not enough data on this table to consider partitions, then the table is not that big and you can consider options #2, #3.
Best regards.

Related

Database design for IoT application

Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.
The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database.
The table looks something like this:
Measurements: (DeviceId, Property, Value, DateTime).
Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:
"Device latest value" (95% of queries): looking at the latest value only
"Device daily snapshot" (5% of queries): looking at a single representative value for each day
We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.
The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.
[EDIT #1: To make it less opinion-based]
What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?
One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.
[EDIT #2: In response to feedback]
In our test database, seeded with 50 million records, and with the following index applied:
CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)
a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:
SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
SELECT MAX(DateTime)
FROM Measurements m2
WHERE m2.DeviceId = m.DeviceId)
This is a very broad question - and as such, it's unlikely you'll get a definitive answer.
However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.
My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.
When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).
Instead, I went for a slightly different route
Put all the data into one table
On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
This filtered index is similar to a second copy of the table with just the latest rows only
The insert process therefore has two steps in a transaction
'Unflag' the last Latest_Flag for that device, etc
Insert the new row
It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.
When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.
For the filtered index, it may be something like
CREATE INDEX ix_measurements_deviceproperty_latest
ON Measurements (DeviceId, Property)
INCLUDE (Value, DateTime, Latest_Flag)
WHERE (Latest_Flag = 1)
Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.

How to store 300M records in Postgresql to run efficiency queries

I have following table:
CREATE TABLE public.shop_prices
(
shop_name text COLLATE pg_catalog."default",
product text COLLATE pg_catalog."default",
product_category text COLLATE pg_catalog."default",
price text COLLATE pg_catalog."default"
)
and for this table i have a dataset from 18 months. In each file there are about 15M records. I have to some analysis, like in which month a shop has increased or decreased their price. I imported two months in a table and run following query just to test:
select shop, product from shop_prices group by shop, product limit 10
I waited more than 5 minutes, but no any result and response. It was still on working. What is the best way the store these datasets and run efficiency queries? Is it a good idea if I create for each dataset a seperate tables?
Using explain analyze select shop_name, product from shop_prices group by shop, product limit 10 you can see how Postgres is planning and executing the query and the time the execution takes. You'll see it needs to read the whole table (with the time consuming disk reads) and then sort it in memory - which will probably need to be cached on disk, before returning the results. In the next run you might discover the same query is very snappy if the number of shop_name+product combinations are very limited and thus stored in pg_stats after that explain analyze. The point being that a simple query like this can be deceiving.
You will faster execution by creating an index on the columns you are using (create index shop_prices_shop_prod_idx on public.shop_prices(shop_name,product)).
You should definitely change the price column type to numeric (or float/float8)) if you plan to do any numerical calculations on it.
Having said all that, I suspect this table is not what you will be using as it does not have any timestamp to compare prices between months to begin with.
I suggest you complete the table design and speculate on indices to improve performance. You might even want consider table partitioning https://www.postgresql.org/docs/current/ddl-partitioning.html
You will probably be doing all sorts of queries on this data so there is no simple solution to them all.
By all means return with perhaps more specific questions with complete table description and the output from the explain analyze statement for queries you are trying out and get some good advice.
Best regards,
Bjarni
What is your PostgreSQL version ?
First there is a typo: column shop should be shop_name.
Second you query looks strange because it has only a LIMIT clause without any ORDER BY clause or WHERE clause: do you really want to have "random" rows for this query ?
Can you try to post EXPLAIN output for the SQL statement:
explain select shop_name, product from shop_prices group by shop_name, product limit 10;
Can you also check if any statistics have been computed for this table with:
select * from pg_stats where tablename='shop_prices';

Why this arithmetic calculation is faster than cast nvarchar?

The following query is faster (1:05):
SELECT DATEPART(DW,DATEFROMPARTS(
FLOOR(20180501/10000)
,FLOOR(20180501-FLOOR(20180501/10000)*10000)/100
,FLOOR(20180501-FLOOR(20180501/100)*100)))
GO 1000
Than (1:10):
SELECT DATEPART(DW,CAST(CAST(20180501 AS nvarchar) AS DATE))
GO 1000
Why?
I have a table with 2 billions of records (roughly) so the difference becomes important. There is far more logic behind hardcoded date. Otherwise, if there exists a better approach, in term of performace, for executing the same logic, feel free to correct me.
The date column is always an integer, and not always have the same format. Two formats are retrieved YYYYMMDD and, YYYYMM. I know, a bit of a mess.
Thanks!
Delete duplicate rows when the first day of the month (YYYYMM01) is monday
If you wan to speed up the delete create a temporary table (or permanent if this is recurring operation) with a column of the same data type as your table's "date" column with all 1st Mondays of each month across XX years. Make sure the data is in the same format as you mentioned in your question. Be sure that this column has an index (clustered). Now use this table in your query as the filter without doing any conversions which will allow Sql Server to take advantage of any indexes that exist on the existing table's "date" column.

Query performance from date functions VS Date dimension table

I have a table Transactions which has a column store index and stored 74445000 rows.
I have a query like below
SELECT
CustomerNumber,
MONTH(CreationDate) AS m,
YEAR(CreationDate) AS y,
CurrencyID AS c
FROM Transactions
I am mulling that may be doing a JOIN on a Date Dimension table which contains month and year for all dates may be better than the above query which uses SQL date functions.
Can anyone verify this assumption and/or point to a resource which can provide a details?
Any alteration of the original column will have an impact on the query performance. For the calculation of month & year in your query, you should get a very efficient Batch Execution Mode, which will make alternatives look quite pale.
Plus, if your join will be done on an integer/bigint column, than you might be able to get Segment Elimination which should improve your query performance, but going for any string column will make query lasts incredible amount of time in comparison with the int data types.
In another words - unnecessary headaches.
There are no big alternatives since Columnstore Indexes do not support computed columns yet (SQL Server 2016).

ssis upserting 10^8 rows - process by batch?

I have to gather a large volume of data from various SQL Server tables (~around 300 million rows) and to upsert them into a single fact table in my data warehouse.
1/ What is the best strategy to import all these rows?
2/ Is this a good practice to import by batches? How big should be a batch? 10k rows is ok?
The way that I designed this was for a data movement between 3 different layers
Landing Area
Staging area (where most of the look ups and key substitutions happened)
Data Warehouse
We created bulk tables in the landing area without any sort of key's or anything on there. We would simply land the data in that area and then would move it further along the system.
The way I designed the package was to create 2 very simple table in SQL Server with 4 columns each. The first table, I called it ToBeProcessed and the 2nd (quite obviously) Processed.
The columns that I had were
1)
dbo.ToBeProcessed
(ID INT IDENTITY (1,1),
BeginDate DATETIME,
EndDate DateTime,
Processed VARCHAR(1)
)
2)
dbo.Processed
( ID INT IDENTITY(1,1),
ProcessedEndDate DATETIME,
TableName VARCHAR (24),
CompletedDateTime DATETIME
)
What I did was to populate the ToBeProcessed Table with date ranges spanning a week each. For example 1st Row would be from 01/01/2014 to 01/07/2014, the next row would be from 01/08/2014 to 01/15/2014 and so on. This makes sure that you dont overlap any piece of data that you are pulling in.
On the SSIS Side you would want to create a for each loop container and parse through all the dates in the 1st table one by one. You can parametrize your Data Flow task with the variables you would create to store the dates from the For each loop container. Every time a weeks worth of data gets processed, you simple insert the end date into your 2nd table.
This makes sure that you have a track of the data you have processed. The reason for doing this is because if the package fails for any reason, you can start from the point of failure without repulling all the data that you have already processed (I think in your case, you may want to turn the T-Logs off if you are not working on production environment).
As for upserting, I think using a merge statement could be an option, but it all depends on what your processing time frames are. If you are looking to turn this around over the weekend, I would suggest using a stored proc on the data set and making sure that your Log tables can grow comfortably with that amount of data.
This is a brief summary of the quick and dirty way which worked for me. This does not mean its the best method out there, but certainly got the job done for me. Let me know if you have any questions.

Resources