My team is using Flink SQL to build some of our pipelines.
For simplicity lets assume that there are 2 independent pipelines:
first one is listening to events in input_stream and storing enriched data in feature_jdbc_table and then emits an feature_updates event that the feature was updated
another one is listening to feature_updates and then when certain events come in fetches data from feature_jdbc_table to do calculations.
Here are sources/sinks definition:
CREATE TABLE input_stream (
user_id STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE feature_updates (
event_name STRING
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE feature_jdbc_table (
user_id STRING,
checked_at TIMESTAMP(3)
) WITH (
'connector' = 'jdbc',
-- ...
)
And here how sink statements look like
INSERT INTO feature_jdbc_table
SELECT user_id, event_time
FROM input_stream
-- NEXT SQL --
INSERT INTO feature_updates
SELECT 'my_feature_was_updated'
FROM input_stream
The issue is that we often run into race conditions when feature_updates event is created before feature_jdbc_table record is committed.
Is there way to emit feature_updates events after feature_jdbc_table records are committed?
I poked around with idea of reading data from feature_jdbc_table when emitting feature_updates but seems like that would require to configure CDC for the database otherwise Flink is treating feature_jdbc_table as "batch" source, not streaming one. And configuring CDC seems to be very involved.
Following code illustrates what I've tried to do
INSERT INTO feature_jdbc_table
SELECT user_id, event_time
FROM input_stream
-- NEXT SQL --
INSERT INTO feature_updates
SELECT 'my_feature_was_updated'
FROM feature_jdbc_table
I also tried to introduce a fixed delay (not great but at least something) with tumble windows.
Related
We have a Primary-Stanby setup on PostgreSQL 12.1. A DELETE query is run on the primary server (which takes time but finishes completely) however the same query does not finish (and runs like forever):
DELETE FROM EVENTS
WHERE Event_Type_ID != 2
AND last_update_time <= '2020-11-04'
AND Event_ID NOT IN ( SELECT DISTINCT Event_ID FROM Event_Association )
AND Event_ID NOT IN ( SELECT DISTINCT Event_ID FROM EVENTS WHERE Last_Update_Time > '2020-11-14');
The execution plan is as follows (replacing delete with select query for the same):
https://explain.depesz.com/s/GZp7
There is a INDEX created on EVENTS.Event_ID and Event_Association.Event_ID however the delete query still won't finish on the standby server.
The EVENTS table has more than 2 million rows and the Event_Association table has more than 300,000 rows.
Can someone help me resolve this issue?
Thanks
Is there any way to calculate the amount of read per second on a Postgres table?
but what I need is that whether a table has any read at the moment. (If no, then I can safely drop it)
Thank you
To figure out if the table is used currently, tun
SELECT pid
FROM pg_locks
WHERE relation = 'mytable'::regclass;
That will return the process ID of all backends using it.
To measure whether s table is used at all or not, run this query:
SELECT seq_scan + idx_scan + n_tup_ins + n_tup_upd + n_tup_del
FROM pg_stat_user_tables
WHERE relname = 'mytable';
Then repeat the query in a day. If the numbers haven't changed, nobody has used the table.
Audit SELECT activity
My suggestion is to wrap mytable in a view (called the_view_to_use_instead in the example) which invokes a logging function upon every select and then use the view for selecting from, i.e.
select <whatever you need> from the_view_to_use_instead ...
instead of
select <whatever you need> from mytable ...
So here it is
create table audit_log (table_name text, event_time timestamptz);
create function log_audit_event(tname text) returns void language sql as
$$
insert into audit_log values (tname, now());
$$;
create view the_view_to_use_instead as
select mytable.*
from mytable, log_audit_event('mytable') as ignored;
Every time someone queries the_view_to_use_instead an audit record with a timestamp appears in table audit_log. You can then examine it in order to find out whether and when mytable was selected from and make your decision. Function log_audit_event can be reused in other similar scenarios. The average number of selects per second over the last 24 hours would be
select count(*)::numeric/86400
from audit_log
where event_time > now() - interval '86400 seconds';
I have a stored procedure that performs a bulk insert of a large number of DNS log entries. I wish to summarise this raw data in a new table for analysis. The new table takes a given log entry for FQDN and Record Type and holds one record only with a hitcount.
Source table might include 100 rows of:
FQDN, Type
www.microsoft.com,A
Destination table would have:
FQDN, Type, HitCount
www.microsoft.com, A, 100
The SP establishes a unique ID made up of [FQDN] +'|'+ [Type], which is then used as the primary key in the destination table.
My plan was to have the SP fire a trigger that did an UPDATE...IF ##ROWCOUNT=0...INSERT. However, that of course failed because the trigger receives all the [inserted] rows as a single set so always throws a key violation error.
I'm having trouble getting my head around a solution and need some fresh eyes and better skills to take a look. The bulk insert SP works just fine and the raw data is exactly as desired. However trying to come up with a suitable method to create the summary data is beyond my present skills/mindset.
I have several 10s of Tb of data to process, so I don't see the summary as a something we could do dynamically with a SELECT COUNT - which is why I started down the trigger route.
The relevant code in the SP is driven by a cursor consisting of a list of compressed log files needing to be decompressed and bulk-inserted, and is as follows:
-- Bulk insert to a view because bulk insert cannot populate the UID field
SET #strDynamicSQL = 'BULK INSERT [DNS_Raw_Logs].[dbo].[vwtblRawQueryLogData] FROM ''' + #strTarFolder + '\' + #strLogFileName + ''' WITH (FIRSTROW = 1, FIELDTERMINATOR = '' '', ROWTERMINATOR = ''0x0a'', ERRORFILE = ''' + #strTarFolder + '\' + #strErrorFile + ''', TABLOCK)'
--PRINT #strDynamicSQL
EXEC (#strDynamicSQL)
-- Update [UID] field after the bulk insert
UPDATE [DNS_Raw_Logs].[dbo].[tblRawQueryLogData]
SET [UID] = [FQDN] + '|' + [Type]
FROM [tblRawQueryLogData]
WHERE [UID] IS NULL
I know that the UPDATE...IF ##ROWCOUNT=0...INSERT solution is wrong because it assumes that the input data is a single row. I'd appreciate help on a way to do this.
Thank you
First, at that scale make sure you understand columnstore tables. They are very highly compressed and fast to scan.
Then write a query that reads from the raw table and returns the summarized
create or alter view DnsSummary
as
select FQDN, Type, count(*) HitCount
from tblRawQueryLogData
group by FQDN, Type
Then if querying that view directly is too expensive, write a stored procedure that loads a table after each bulk insert. Or make the view into an indexed view.
Thanks for the answer David, obvious when someone else looks at it!
I ran the view-based solution with 14M records (about 4 hours worth) and it took 40secs to return, so I think i'll modify the SP to drop and re-create summary table each time it runs the bulk insert.
The source table also includes a timestamp for each entry. I would like to grab the earliest and latest times associated with each UID and add that to the summary.
My current summary query (courtesy of David) looks like this:
SELECT [UID], [FQDN], [Type], COUNT([UID]) AS [HitCount]
FROM [DNS_Raw_Logs].[dbo].tblRawQueryLogData
GROUP BY [UID], [FQDN], [Type]
ORDER BY COUNT([UID]) DESC
And returns:
UID, FQDN, Type, HitCount
www.microsoft.com|A, www.microsoft.com, A, 100
If I wanted to grab first earliest and latest times then I think I'm looking at nesting 3 queries to grab the earliest time (SELECT TOP N...ORDER BY... ASC), the latest time (SELECT TOP N...ORDER BY... DESC) and the hitcount. Is there a more efficient way of doing this, before I try and wrap my head around this route?
We have a SQL Server database which has table consisting of tickers. Something like
Ticker | description
-------+-------------
USDHY | High yield
USDIG | Investment grade ...
Now we have a lot of other tables which has data corresponding to these tickers (time series). We want to able to create a report which can show us which of these tickers are more queried for and which not not queried for at all. This can allow us to selectively run some procedures for the tickers which are more frequently used and ignore the others on a regular basis.
Is there some way to achieve this in SQL, any report which could generate this stat over a period of time say n-months.
Any help is much appreciated
Seems like no answers so far. As I mentioned, one possibility is to use Extended Events like below:
CREATE EVENT SESSION [TestTableSelectLog]
ON SERVER
ADD EVENT sqlserver.sp_statement_completed (
WHERE [statement] LIKE '%SELECT%TestTable%' --Capure all selects from TestTable
AND [statement] NOT LIKE '%XEStore%' --filter extended event queries
AND [statement] NOT LIKE '%fn_xe_file_target_read_file%'),
ADD EVENT sqlserver.sql_statement_completed (
WHERE [statement] LIKE '%SELECT%TestTable%'
AND [statement] NOT LIKE '%XEStore%'
AND [statement] NOT LIKE '%fn_xe_file_target_read_file%')
ADD TARGET package0.event_file (SET FILENAME=N'C:\Temp\TestTableSelectLog.xel');--log to file
ALTER EVENT SESSION [TestTableSelectLog] ON SERVER STATE=START;--start capture
You can then select from file using sys.fn_xe_file_target_read_file:
CREATE TABLE TestTable
(
Ticker varchar(10),
[Description] nvarchar(100)
)
SELECT * FROM TestTable
SELECT *, CAST(event_data AS XML) AS 'event_data_XML'
FROM sys.fn_xe_file_target_read_file('C:\Temp\TestTableSelectLog*.xel', NULL, NULL, NULL)
The SELECT statement should be captured.
Extended Events can be also configured from GUI (Management/Extended Events/Sessions in Management Studio).
I have two processes that work with data in the same table.
One process inserts daily, one by one (pure ADO.NET), about 20000 records in the target table.
The second process calls ( periodically, every 15 minutes ) a stored procedure that
Detects the duplicates in those 20000 records by looking at all the records 7 days back and marks them as such.
Marks all records that are not duplicates with a 'ToBeCopied' flag.
Select a number of columns from the records marked as 'ToBeCopied' and returns the set.
Sometimes these two processes overlap ( due to delays in data processing ) and I am suspecting that if the first process inserts new records when second process is somewhere between 1 and 2 then records will be marked 'ToBeCopied' without having gone through the duplicate sifting.
This means that now the store procedure is returning some duplicates.
This is my theory but In practice I have not been able to replicate it...
I am using LINQ to SQL to insert duplicates (40-50 or so a second) and while this is running I am manually calling the stored procedure and store its results.
It appears that when the stored procedure is running the inserting pauses ... such that at the end no duplicates have made it to the final result set.
I am wondering if LINQ to SQL or SQL Server has a default mechanism that prevents concurrency and is pausing the inserting while the selecting or updating takes place.
What do you think?
EDIT 1:
The 'duplicates' are not identical rows. They are 'equivalent' given the business/logical entities these records represent. Each row has a unique primary key.
P.S. Selecting the result set takes place with NOLOCK. Trying to reproduce on SQL Server 2008. Problem is alleged to occur on SQL Server 2005.
What do I think?
Why do you have duplicates in the database? Data purity begins in the client at the app drawing board, which should have a data model that simply does not allow for duplicates.
Why do you have duplicates in the database? Check constraints should prevent this from happening if the client app misbehaves
If you have duplicates, the reader must be prepared to handle them.
You cannot detect duplicates in two stages (look then mark), it has to be one single atomic mark. In fact, you cannot do almost anything in a database in two stages 'look and mark'. All 'look for record then mark the records found' processes fail under concurrency.
NOLOCK will give you inconsistent reads. Records will be missing or read twice. Use SNAPSHOT isolation.
Linq-To-SQL has no pixie dust to replace bad design.
Update
Consider this for instance:
A staging table with a structure like:
CREATE TABLE T1 (
id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
date DATETIME NOT NULL DEFAULT GETDATE(),
data1 INT NULL,
data2 INT NULL,
data3 INT NULL);
Process A is doing inserts at leisure into this table. It doe snot do any validation, it just dumps raw records in:
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
...
Process B is tasked with extracting this staging table and moving cleaned up data into a table T2. It has to remove duplicates that, by business rules, mean records with same values in data1, data2 and data3. Within a set of duplicates, only the first record by date should be kept:
set transaction isolation snapshot;
declare #maxid int;
begin transaction
-- Snap the current max (ID)
--
select #maxid = MAX(id) from T1;
-- Extract the cleaned rows into T2 using ROW_NUMBER() to
-- filter out duplicates
--
with cte as (
SELECT date, data1, data2, datta3,
ROW_NUMBER() OVER
(PARTITION BY data1, data2, data3 ORDER BY date) as rn
FROM T1
WHERE id <= #maxid)
MERGE INTO T2
USING (
SELECT date, data1, data2, data3
FROM cte
WHERE rn = 1
) s ON s.data1 = T2.data1
AND s.data2 = T2.data2
AND s.data3 = T2.data3
WHEN NOT MATCHED BY TARGET
THEN INSERT (date, data1, data2, data3)
VALUES (s.date, s.data1, s.data2, s.data3);
-- Delete the processed row up to #maxid
--
DELETE FROM T1
WHERE id <= #maxid;
COMMIT;
Assuming Process A only inserts, this procedure would safely process the staging table and extract the cleaned duplicates. Of course, this is just a skeleton, a true ETL process would have error handling via BEGIN TRY/BEGIN CATCH and transaction log size control via batching.
When are you calling submit on your data context? I believe that this happens within a transaction.
As for your problem, what you are saying sounds plausible - would it maybe make more sense to do you load into a staging table (if it's slow) and then do a
SELECT * FROM StagingTable INTO ProductionTable
once your load is complete?