Need to optimize Teradata query - query-optimization

I am trying to optimize the following teradata query. Can anyone please on this. It is taking lot of time to retrieve records.
select top 100 I.item_sku_nbr,L.loc_nbr,MIS.MVNDR_PRTY_ID from
QA_US_MASTER_VIEWS.item I,
qa4_US_MASTER_VIEWS.location L,
qa4_US_MASTER_VIEWS.item_str IST,
qa4_US_MASTER_VIEWS.mvndr_item_str MIS
where MIS.str_LOC_ID = L.loc_id and
mis.str_loc_id = IST.str_loc_id and
IST.str_loc_id = L.loc_id and
MIS.ITEM_STAT_CD = IST.ITEM_STAT_CD and
IST.ITEM_ID = I.ITEM_ID and
MIS.ITEM_ID = IST.ITEM_ID and
I.ITEM_STAT_CD = 100 and
IST.curr_rmeth_cd = 2 and
MIS.curr_dsvc_typ_cd = 3 and
MIS.OK_TO_ORD_FLG = 'Y' and
MIS.EFF_END_DT = DATE '9999-12-31' and
IST.EFF_END_DT = DATE '9999-12-31' and
MIS.ACTV_FLG ='Y' and
IST.ACTV_FLG ='Y' and I.ACTV_FLG='Y'
Explain plan for QA_US_MASTER.LOCATION:
1) First, we lock QA_US_MASTER.LOCATION in view
qa4_US_MASTER_VIEWS.Location for access.
2) Next, we do an all-AMPs RETRIEVE step from QA_US_MASTER.LOCATION
in view qa4_US_MASTER_VIEWS.Location by way of an all-rows scan
with no residual conditions into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
high confidence to be 10,903 rows (1,613,644 bytes). The
estimated time for this step is 0.01 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.01 seconds.
Explain plan for qa4_US_MASTER_VIEWS.item_str :
1) First, we lock QA_US_MASTER.item_str in view
qa4_US_MASTER_VIEWS.item_str for access.
2) Next, we do an all-AMPs RETRIEVE step from QA_US_MASTER.item_str
in view qa4_US_MASTER_VIEWS.item_str by way of an all-rows scan
with no residual conditions into Spool 1 (group_amps), which is
built locally on the AMPs. The input table will not be cached in
memory, but it is eligible for synchronized scanning. The result
spool file will not be cached in memory. The size of Spool 1 is
estimated with low confidence to be 1,229,047,917 rows (
325,697,698,005 bytes). The estimated time for this step is 4
minutes and 51 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 4 minutes and 51 seconds.
Explain plan for QA_US_MASTER.ITEM:
1) First, we lock QA_US_MASTER.ITEM in view qa4_US_MASTER_VIEWS.item
for access.
2) Next, we do an all-AMPs RETRIEVE step from QA_US_MASTER.ITEM in
view qa4_US_MASTER_VIEWS.item by way of an all-rows scan with no
residual conditions into Spool 1 (group_amps), which is built
locally on the AMPs. The size of Spool 1 is estimated with high
confidence to be 1,413,284 rows (357,560,852 bytes). The
estimated time for this step is 0.40 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.40 seconds.
Explain plan for QA_US_MASTER.MVNDR_ITEM_STR:
1) First, we lock QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str for access.
2) Next, we do an all-AMPs RETRIEVE step from
QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str by way of an all-rows scan with
no residual conditions into Spool 1 (group_amps), which is built
locally on the AMPs. The input table will not be cached in memory,
but it is eligible for synchronized scanning. The result spool
file will not be cached in memory. The size of Spool 1 is
estimated with high confidence to be 1,316,279,746 rows (
327,753,656,754 bytes). The estimated time for this step is 6
minutes and 4 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 6 minutes and 4 seconds.
Explain plan for Whole query:
1) First, we lock QA_US_MASTER.ITEM in view QA_US_MASTER_VIEWS.item
for access, we lock QA_US_MASTER.LOCATION in view
qa4_US_MASTER_VIEWS.location for access, we lock
QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str for access, and we lock
QA_US_MASTER.item_str in view qa4_US_MASTER_VIEWS.item_str for
access.
2) Next, we execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from QA_US_MASTER.LOCATION in
view qa4_US_MASTER_VIEWS.location by way of an all-rows scan
with no residual conditions into Spool 3 (all_amps)
(compressed columns allowed), which is duplicated on all AMPs.
The size of Spool 3 is estimated with high confidence to be
1,013,979 rows (20,279,580 bytes). The estimated time for
this step is 0.03 seconds.
2) We do an all-AMPs RETRIEVE step from QA_US_MASTER.ITEM in
view QA_US_MASTER_VIEWS.item by way of an all-rows scan with
a condition of ("(QA_US_MASTER.ITEM in view
QA_US_MASTER_VIEWS.item.ITEM_STAT_CD = 100) AND
(QA_US_MASTER.ITEM in view QA_US_MASTER_VIEWS.item.ACTV_FLG =
'Y')") into Spool 4 (all_amps) (compressed columns allowed)
fanned out into 14 hash join partitions, which is duplicated
on all AMPs. The size of Spool 4 is estimated with low
confidence to be 30,819,363 rows (678,025,986 bytes). The
estimated time for this step is 0.81 seconds.
3) We do an all-AMPs JOIN step from Spool 3 (Last Use) by way of an
all-rows scan, which is joined to QA_US_MASTER.item_str in view
qa4_US_MASTER_VIEWS.item_str by way of an all-rows scan with a
condition of
("(QA_US_MASTER.item_str in view
qa4_US_MASTER_VIEWS.item_str.CURR_RMETH_CD = 2) AND
((QA_US_MASTER.item_str in view
qa4_US_MASTER_VIEWS.item_str.EFF_END_DT = DATE '9999-12-31') AND
(QA_US_MASTER.item_str in view
qa4_US_MASTER_VIEWS.item_str.ACTV_FLG = 'Y'))"). Spool 3 and
QA_US_MASTER.item_str are joined using a dynamic hash join, with a
join condition of ("QA_US_MASTER.item_str.STR_LOC_ID = LOC_ID").
The input table QA_US_MASTER.item_str will not be cached in memory.
The result goes into Spool 5 (all_amps) (compressed columns
allowed), which is built locally on the AMPs into 14 hash join
partitions. The size of Spool 5 is estimated with no confidence
to be 69,133,946 rows (2,419,688,110 bytes). The estimated time
for this step is 1 minute and 8 seconds.
4) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an
all-rows scan, which is joined to Spool 5 (Last Use) by way of an
all-rows scan. Spool 4 and Spool 5 are joined using a hash join
of 14 partitions, with a join condition of ("(ITEM_ID = ITEM_ID)
AND (ACTV_FLG = ACTV_FLG)"). The result goes into Spool 6
(all_amps) (compressed columns allowed), which is redistributed by
the hash code of (QA_US_MASTER.item_str.STR_LOC_ID,
QA_US_MASTER.item_str.ITEM_STAT_CD, QA_US_MASTER.item_str.ITEM_ID,
QA_US_MASTER.ITEM.ITEM_ID, QA_US_MASTER.LOCATION.LOC_ID) to all
AMPs into 33 hash join partitions. The size of Spool 6 is
estimated with no confidence to be 36,434,893 rows (1,603,135,292
bytes). The estimated time for this step is 9.11 seconds.
5) We do an all-AMPs RETRIEVE step from QA_US_MASTER.MVNDR_ITEM_STR
in view qa4_US_MASTER_VIEWS.mvndr_item_str by way of an all-rows
scan with a condition of ("(QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str.CURR_DSVC_TYP_CD = 3) AND
((QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str.EFF_END_DT = DATE '9999-12-31')
AND ((QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str.ACTV_FLG = 'Y') AND
(QA_US_MASTER.MVNDR_ITEM_STR in view
qa4_US_MASTER_VIEWS.mvndr_item_str.OK_TO_ORD_FLG = 'Y')))") into
Spool 7 (all_amps) (compressed columns allowed) fanned out into 33
hash join partitions, which is redistributed by the hash code of (
QA_US_MASTER.MVNDR_ITEM_STR.ITEM_ID,
QA_US_MASTER.MVNDR_ITEM_STR.STR_LOC_ID,
QA_US_MASTER.MVNDR_ITEM_STR.ITEM_STAT_CD,
QA_US_MASTER.MVNDR_ITEM_STR.ITEM_ID,
QA_US_MASTER.MVNDR_ITEM_STR.STR_LOC_ID) to all AMPs. The input
table will not be cached in memory, but it is eligible for
synchronized scanning. The size of Spool 7 is estimated with no
confidence to be 173,967,551 rows (5,914,896,734 bytes). The
estimated time for this step is 2 minutes and 23 seconds.
6) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an
all-rows scan, which is joined to Spool 7 (Last Use) by way of an
all-rows scan. Spool 6 and Spool 7 are joined using a hash join
of 33 partitions, with a join condition of ("(STR_LOC_ID =
STR_LOC_ID) AND ((ITEM_STAT_CD = ITEM_STAT_CD) AND ((ITEM_ID =
ITEM_ID) AND ((ACTV_FLG = OK_TO_ORD_FLG) AND ((ACTV_FLG = ACTV_FLG)
AND ((EFF_END_DT = EFF_END_DT) AND ((ACTV_FLG = ACTV_FLG) AND
((OK_TO_ORD_FLG = ACTV_FLG) AND ((ITEM_ID = ITEM_ID) AND
(STR_LOC_ID = LOC_ID )))))))))"). The result goes into Spool 2
(all_amps) (compressed columns allowed), which is built locally on
the AMPs. The size of Spool 2 is estimated with no confidence to
be 12,939,628 rows (336,430,328 bytes). The estimated time for
this step is 4.00 seconds.
7) We do an all-AMPs STAT FUNCTION step from Spool 2 by way of an
all-rows scan into Spool 10, which is redistributed by hash code
to all AMPs. The result rows are put into Spool 1 (group_amps),
which is built locally on the AMPs. This step is used to retrieve
the TOP 100 rows. Load distribution optimization is used.
If this step retrieves less than 100 rows, then execute step 8.
The size is estimated with no confidence to be 100 rows (3,200
bytes).
8) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 10 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 1 (group_amps), which is built locally on the AMPs.
This step is used to retrieve the TOP 100 rows. The size is
estimated with no confidence to be 100 rows (3,200 bytes).
9) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

There's no ORDER BY in your query, so you just want 100 random rows?
In Teradata the TOP is done after the full result set has been created. You should move the TOP into a Derived Table like:
select I.item_sku_nbr,L.loc_nbr,MIS.MVNDR_PRTY_ID from
QA_US_MASTER_VIEWS.item I,
qa4_US_MASTER_VIEWS.location L,
(SELECT TOP 100 * FROM qa4_US_MASTER_VIEWS.item_str) IST,
qa4_US_MASTER_VIEWS.mvndr_item_str MIS
where MIS.str_LOC_ID = L.loc_id and
mis.str_loc_id = IST.str_loc_id and
IST.str_loc_id = L.loc_id and
MIS.ITEM_STAT_CD = IST.ITEM_STAT_CD and
IST.ITEM_ID = I.ITEM_ID and
MIS.ITEM_ID = IST.ITEM_ID and
I.ITEM_STAT_CD = 100 and
IST.curr_rmeth_cd = 2 and
MIS.curr_dsvc_typ_cd = 3 and
MIS.OK_TO_ORD_FLG = 'Y' and
MIS.EFF_END_DT = DATE '9999-12-31' and
IST.EFF_END_DT = DATE '9999-12-31' and
MIS.ACTV_FLG ='Y' and
IST.ACTV_FLG ='Y' and I.ACTV_FLG='Y'

Related

Why is the cost for the Sort operator in this execution plan so high?

The Sort operator says its only got 100 rows to sort. How could that possibly be more expensive than reading 1.9 million rows? I must be reading this wrong or misunderstanding something.
Also, how is the Estimated Number of Rows Per Execution in the Sort operator only 100? If the Index Seek operator estimates the Number of Rows Per Execution to be 1.9 million, how does only 100 rows get piped over to the Sort operator?
Here is the query:
DECLARE #PageIndex INT = 1000;
DECLARE #PageCount INT = 1000;
SELECT ID
FROM dbo.Table1
WHERE DateCreated >= '2021-10-27'
AND
DateCreated < '2021-10-28'
ORDER BY ID
OFFSET #PageIndex * #PageCount ROWS FETCH NEXT #PageCount ROWS ONLY
The Sort operator (as opposed to "Top N Sort") will sort the entirety of its input in its open method before returning any rows.
SQL Server estimates that the seek will output 1.9 million rows that go into the sort.
The costing is therefore for sorting 1.9 million rows.
You are doing
OFFSET 1000000 ROWS FETCH NEXT 1000 ROWS ONLY
The actual output rows from the sort will be at least 1,001,000 (maybe more in a parallel plan) and the TOP operator discards the first million for the offset and then stops requesting rows after it has received the 1000 to be returned.
The estimate of 100 is just a guess as SQL Server has no idea what the value of the variables will be at runtime when the plan is compiled.

How to speed up LIMIT & OFFSET query in a big database

I have a table which can contain up to billions rows
CREATE TABLE "Log4DataUsb" (
"Time" integer primary key not null ,
"Microseconds" integer ,
"Current" integer ,
"Voltage" integer )
Usually a user will want to query the data within a specific range, for example Time <= 123456789 and Time >= 0, because this may return billions rows, I want to segment the rows and only return a batch each time, like LIMIT 10,000, LIMITE 10,000 OFFSET X until it reaches the end of this time-range query.
I notice that when the number of rows goes up, this query can be quite slow, executing the queries below will take seconds even though I just want to move to the next batch.
SELECT * FROM TABLE WHERE Time <= 123456789 and Time >= 0 LIMIT 10,000
SELECT * FROM TABLE WHERE Time <= 123456789 and Time >= 0 LIMIT 10,000 OFFSET 10,0000
If the database is supposed to have 2 billion rows in total, is there any way it can largely increase the query performance?

SQL Server - poor performance during Insert transaction

I have a stored procedure which executes a query and return the line into variables like below:
SELECT #item_id = I.ID, #label_ID = SL.label_id,
FROM tb_A I
LEFT JOIN tb_B SL ON I.ID = SL.item_id
WHERE I.NUMBER = #VAR
I have a IF to check if #label_ID is null or not. If it is null, it goes to INSERT statement, otherwise it goes to UPDATE statement. Let's focus on INSERT where I know I'm having problems. The INSERT part is like below:
IF #label_ID IS NULL
BEGIN
INSERT INTO tb_B (item_id, label_qrcode, label_barcode, data_leitura, data_inclusao)
VALUES (#item_id, #label_qrcode, #label_barcode, #data_leitura, GETDATE())
END
So, tb_B has a PK in ID column and a FK in item_ID column which refers to column ID in tb_A table.
I ran SQL Server Profiler and I saw that sometimes the duration for this stored procedure takes around 2300ms and the normal average for this is 16ms.
I ran the "Execution Plan" and the biggest cost is in the "Clustered Index Insert" component. Showing below:
Estimated Execution Plan
Actual Execution Plan
Details
More details about the tables:
tb_A Storage:
Index space: 6.853,188 MB
Row count: 45988842
Data space: 5.444,297 MB
tb_B Storage:
Index space: 1.681,688 MB
Row count: 15552847
Data space: 1.663,281 MB
Statistics for INDEX 'PK_tb_B'.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name Updated Rows Rows Sampled Steps Density Average Key Length String Index
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PK_tb_B Sep 23 2018 2:30AM 15369616 15369616 5 1 4 NO 15369616
All Density Average Length Columns
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6.506343E-08 4 id
Histogram Steps
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 0 1 0 1
8192841 8192198 1 8192198 1
8270245 65535 1 65535 1
15383143 7111878 1 7111878 1
15383144 0 1 0 1
Statistics for INDEX 'IDX_tb_B_ITEM_ID'.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name Updated Rows Rows Sampled Steps Density Average Key Length String Index
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IDX_tb_B_ITEM_ID Sep 23 2018 2:30AM 15369616 15369616 12 1 7.999424 NO 15369616
All Density Average Length Columns
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6.50728E-08 3.999424 item_id
6.506343E-08 7.999424 item_id, id
Histogram Steps
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 2214 0 1
16549857 0 1 0 1
29907650 65734 1 65734 1
32097131 131071 1 131071 1
32296132 196607 1 196607 1
32406913 98303 1 98303 1
40163331 7700479 1 7700479 1
40237216 65535 1 65535 1
47234636 6946815 1 6946815 1
47387143 131071 1 131071 1
47439431 31776 1 31776 1
47439440 0 1 0 1
PK_tb_B Index fragmentation
IDX_tb_B_Item_ID
Is there any best practices where I can apply and make this execution duration stable?
Hope you can help me !!!
Thanks in advance...
It's probably that the problem is the DbType of the clustered index. Clustered indexes store the data in the table based on the key values. By default, your primary key is created with a clustered index. This is often the best place to have it,
but not always. If you have, for example, a clustered index over a NVARCHAR column, every time that an INSERT is performed, needs to find the right place to insert the new record. For example, if your table have one million rows, with registers ordered alphabetically, and your new register starts with A, then your clustered index needs to move registers from B to Z to put your new register in the A group. If your new register stars with Z, then moves a smaller number of records, but this doesn't mean that its fine too. If you donĀ“t have a column that let you insert new register sequentially, then you can create an identity column for this or have another column that logically is sequential to any transaction entered regardless of the system, for example, a datetime column that registers the time at the insert ocurrs.
If you want more info, please check this Microsoft documentation

ms sql table adding rows whenever level changes by more than 1 so that every row has difference of 1 in start_level and end_level

(This is my first stack overflow question. So please let me know suggestions for posing a better question, if you cannot understand.)
I have a table of around 500 people(users) who are going up the stairs from floor x (0=x, max(y) = 50). A person can climb zero/one or many levels in a single go which corresponds to a single row of the table along with the time taken to do so in seconds.
I want to find average time taken to go from floor a to a+1 where a is any of the floor number. To do so I intend to divide every row of the mentioned table into rows which have start_level+1= end_level. Duration will be divided equally as shown in EXPECTED OUTPUT TABLE for user b.
GIVEN TABLE INPUT
start_level end_level duration user
1 1 10 a
1 2 5 a
2 5 27 b
5 6 3 c
EXPECTED OUTPUT
start_level end_level duration user
1 1 10 a
1 2 5 a
2 3 27/3 b
3 4 27/3 b
4 5 27/3 b
5 6 3 c
Note: level jumps are in integers only.
After getting expected output, I can simply create a column sum(duration)/count(distinct users) at a start_level level to get average time taken to get one floor above from each floor.
Any help is appreciated.
You can use a Numbers table to "create" the incremental steps. Here's my setup:
CREATE TABLE #floors
(
[start_level] INT,
[end_level] INT,
[duration] DECIMAL(10, 4),
[user] VARCHAR(50)
)
INSERT INTO #floors
([start_level],
[end_level],
[duration],
[user])
VALUES (1,1,10,'a'),
(1,2,5,'a'),
(2,5,27,'b'),
(5,6,3,'c')
Then, using a Numbers table and some LEFT JOIN/COALESCE logic:
-- Create a Numbers table
;WITH Numbers_CTE
AS (SELECT TOP 50 [Number] = ROW_NUMBER()
OVER(
ORDER BY (SELECT NULL))
FROM sys.columns)
SELECT [start_level] = COALESCE(n.[Number], f.[start_level]),
[end_level] = COALESCE(n.[Number] + 1, f.[end_level]),
[duration] = CASE
WHEN f.[end_level] = f.[start_level] THEN f.[duration]
ELSE f.[duration] / ( f.[end_level] - f.[start_level] )
END,
f.[user]
FROM #floors f
LEFT JOIN Numbers_CTE n
ON n.[Number] BETWEEN f.[start_level] AND f.[end_level]
AND f.[end_level] - f.[start_level] > 1
Here are the logical steps:
LEFT JOIN the Numbers table for cases where end_level >= start_level + 2 (this has the effect of giving us multiple rows - one for each incremental step)
new start_level = If the LEFT JOIN "completes": take Number from the Numbers table, else: take the original start_level
new end_level = If the LEFT JOIN "completes": take Number + 1, else: take the original end_level
new duration = If end_level = start_level: take the original duration (to avoid divide by 0), else: take the average duration over end_level - start_level

Fill secondly data from Q KDB+

I have a csv file with some high frequency stock price data, and I'd like to get a secondly price data from the table.
In each file, there are columns named date, time, symbol, price, volume, and etc.
There are some seconds with no trading so there are missing data in some seconds.
I'm wondering how could I fill the missing data in Q to get the secondly data from 9:30 to 16:00 in full? If there is missing price, just use the recently price as its price in that second.
I'm considering to write some loop, but I don't know how to exactly to that.
Simplifying a little, I'll assume you have some random timestamps in your dataset like this:
time price
--------------------------------------
2015.01.20D22:42:34.776607000 7
2015.01.20D22:42:34.886607000 3
2015.01.20D22:42:36.776607000 4
2015.01.20D22:42:37.776607000 8
2015.01.20D22:42:37.886607000 7
2015.01.20D22:42:39.776607000 9
2015.01.20D22:42:40.776607000 4
2015.01.20D22:42:41.776607000 9
so there are some missing seconds there. I'm going to call this table t. So if you do a by-second type of query, obviously the seconds that are missing are still missing:
q)select max price by time.second from t
second | price
--------| -----
22:42:34| 7
22:42:36| 4
22:42:37| 8
22:42:39| 9
22:42:40| 4
22:42:41| 9
To get missing seconds, you have to join a list of nulls. In this case we know the data goes from 22:42:34 to 22:42:41, but in reality you'll have to find the min/max time and use that to create a temporary "null" table to join against:
q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N)
second price
--------------
22:42:34
22:42:35
22:42:36
22:42:37
22:42:38
22:42:39
22:42:40
22:42:41
Then left join:
q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second price
--------------
22:42:34 7
22:42:35
22:42:36 4
22:42:37 8
22:42:38
22:42:39 9
22:42:40 4
22:42:41 9
You can use fills or whatever your favourite filling heuristic is after that.
q)fills `second xasc asc ([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second price
--------------
22:42:34 7
22:42:35 7
22:42:36 4
22:42:37 8
22:42:38 8
22:42:39 9
22:42:40 4
22:42:41 9
(Note the sort on second before fills!)
By the way for larger tables this will be much faster than a loop. Loops in q are generally a bad idea.
EDIT
You could use a comma join too, both tables need to be keyed on the second column
t,t1
(where t1 is the null-filled table keyed on second)
I haven't tested it, but I suspect it would be slightly faster than the lj version.
Using aj which is one of the most powerful features of KDB:
q)data
sym time price size
----------------------------
MS 10:24:04 93.35974 8
MS 10:10:47 4.586986 1
APPL 10:50:23 0.7831685 1
GOOG 10:19:52 49.17305 0
in-memory table needs to be sym,time sorted with g# attribute applied to sym column
q)data:update `g#sym from `sym`time xasc data
q)meta trade
c | t f a
-----| -----
sym | s g
time | v
price| f
size | j
Creating a rack table intervalized per second per sym :
q)rack: `sym`time xasc (select distinct sym from data) cross ([] time:{x[0]+til `int$x[1]-x[0]}(min;max)#\:data`time)
Using aj to join the data :
q)aj[`sym`time; rack; data]

Resources