Logstash: multiple tracking_columns for the same pipeline - sql-server

I have a Logstash pipeline that fetches data from MS SQL view that joins to tables A and B and put the denormalised data into ES.
Initially, INSERTS or UPDATES could happen only for table A. Therefore, to configure Logstash to pick up only newly inserted or updated records since last iteration of the polling loop, I have defined the tracking_column field which refers updatedDate timestamp column in table A:
jdbc {
#Program Search
jdbc_connection_string => "jdbc:sqlserver://__DB_LISTNER__"
jdbc_user => “admin”
jdbc_password => “admin”
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_driver_library => "/usr/share/logstash/drivers/mssql-jdbc-6.2.2.jre8.jar"
sql_log_level => "info"
tracking_column => "updated_date_timestamp"
use_column_value => true
tracking_column_type => "timestamp"
schedule => "*/1 * * * *"
statement => "select *, updateDate as updated_date_timestamp from dbo.MyView where (updateDate > :sql_last_value and updateDate < getdate()) order by updateDate ASC"
last_run_metadata_path => "/usr/share/logstash/myfolder/.logstash_jdbc_last_run"
}
Now, the UPDATES can also happen in table B. With this new requirement I am confused how can I configure Logstash to track changes on the table B as well in the same pipeline. Can I define multiple tracking_columns for the same pipeline?
Another two options I have in mind but not sure about them are:
Generate a composite value from updateDate fields of table A and B, that will be referenced by the tracking_column. But I am not sure how the SQL query should look like then?
Create another pipeline that will track changes for table B only. Though, the drawback, I see for this approach, is that the existing and new pipelines will do duplicate work on the initial iterations in order to process all the records from the DB view.
Please, advise me how should I go from here?

I found this ES discussion that suggests to use a function to select greatest value of provided dates in the SQL query. For the SQL server there is GREATEST function, but it is not recognised by SQL server I am currently using. Long story short, as a workaround I found iff() function which I use for dates comparing. So my SQL query looks like this:
select *, iif(A.updatedDate>B.updatedDate, A.updatedDate, B.updatedDate) as updated_date_timestamp from dbo.MyView where (iif(A.updatedDate>B.updatedDate, A.updatedDate, B.updatedDate) > :sql_last_value and iif(A.updatedDate>B.updatedDate, A.updatedDate, B.updatedDate) < getdate()) order by iif(A.updatedDate>B.updatedDate, A.updatedDate, B.updatedDate) ASC, id ASC

Related

How to get from which table a particular view is created in Snowflake through query

I am new to snowflake. I am building an application where I need to display the Table information(name and attributes) from which a particular View is created.
example:
Snowflake views with its source tables
So far I have tried the query from the following site but it did not give the source table of a particular view.
https://dataedo.com/kb/query/snowflake/list-views-with-their-scripts
So, is there a query where i can get the source Table of a particular View (with attributes involved).
The view dependencies could be extracted using GET_OBJECT_REFERENCES:
SELECT REFERENCED_DATABASE_NAME,
REFERENCED_SCHEMA_NAME,
REFERENCED_OBJECT_NAME,
REFERENCED_OBJECT_TYPE,
*
FROM TABLE(get_object_references(database_name=>'<db_name>',
schema_name=>'<schema_name>',
object_name=>'<view_name>'));
Column list could be queried using INFORMATION_SCHEMA.COLUMNS.
You can use below sample query to pull column level metadata...
select
-- refr_tab.referenced_database_name,refr_tab.referenced_schema_name, refr_tab.referenced_schema_name,referenced_object_name
ref_cols.*
from
table(
get_object_references(
database_name => 'ex1_gor_y',
schema_name => 'public',
object_name => 'y_view_f'
)
) refr_tab,
information_schema.columns ref_cols
where
refr_tab.referenced_object_type = 'TABLE'
and refr_tab.referenced_schema_name = ref_cols.table_schema
and refr_tab.referenced_object_name = ref_cols.table_name
and ref_cols.table_name = 'Y_TAB_A';

Joining continuous queries in Flink SQL

I'm trying to join two continuous queries, but keep running into the following error:
Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.\nPlease check the documentation for the set of currently supported SQL features.
Here's the table definition:
CREATE TABLE `Combined` (
`machineID` STRING,
`cycleID` BIGINT,
`start` TIMESTAMP(3),
`end` TIMESTAMP(3),
WATERMARK FOR `end` AS `end` - INTERVAL '5' SECOND,
`sensor1` FLOAT,
`sensor2` FLOAT
)
and the insert query
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
MAX(a.`start`) `start`,
MAX(a.`end`) `end`,
MAX(a.`sensor1`) `sensor1`,
MAX(m.`sensor2`) `sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
a.`start` = m.`timestamp`
GROUP BY a.`MachineID`, a.`cycleID`, SESSION(a.`start`, INTERVAL '1' SECOND)
In the source tables Aggregated and MachineStatus, the start and timestamp columns are time attributes with a watermark.
I've tried casting the input rows of the join to timestamps, but that didn't fix the issue and would mean that I cannot use SESSION, which is supposed to ensure that only one data point gets recorded per cycle.
Any help is greatly appreciated!
I investigated this a little further and noticed that the GROUP BY statement doesn't make sense in that context.
Furthermore, the SESSION can be replaced by a time window, which is the more idiomatic approach.
INSERT INTO `Combined`
SELECT
a.`MachineID`,
a.`cycleID`,
a.`start`,
a.`end`,
a.`sensor1`,
m.`sensor2`
FROM `Aggregated` a, `MachineStatus` m
WHERE
a.`MachineID` = m.`MachineID` AND
a.`cycleID` = m.`cycleID` AND
m.`timestamp` BETWEEN a.`start` AND a.`start` + INTERVAL '0' SECOND
To understand the different ways to join dynamic tables, I found the Ververica SQL training extremely helpful.

BigQuery or SQL Server SPLIT query

I have searched around and can not find much on this topic. I have a table, that gets logging information. As a result the column I am interested in contains multiple values that I need to search against. The column is formatted in a php URL style. i.e.
/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32
This makes all searches end up with really long regexes to get data. Then join statements to combine data.
Is there a way in BigQuery, or SQL Server that I can pull the information from that column and put it into new columns?
Example:
The information I would like extracted begins after the ?, and ends at &, The string can sometimes be longer, and contains additional headers.
Thanks,
Below is for BigQuery Standard SQL and addresses below aspect of your question
Is there a way in BigQuery, ... that I can pull the information from that column and put it into new columns?
#standardSQL
CREATE TEMP FUNCTION parseColumn(kv STRING, column_name STRING) AS (
IF(SPLIT(kv, '=')[OFFSET(0)]= column_name, SPLIT(kv, '=')[OFFSET(1)], NULL)
);
WITH `project.dataset.table` AS (
SELECT '/test/test.aspx?extra=abc&DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url UNION ALL
SELECT '/test/test.aspx?DS_Vendor=55192&DS_ProdVer=4.30.100.0&more=123&DS_ProdLang=DE&DS_Product=MTE&DS_OfficeBits=64'
)
SELECT
MIN(parseColumn(kv, 'DS_Vendor')) AS DS_Vendor,
MIN(parseColumn(kv, 'DS_ProdVer')) AS DS_ProdVer,
MIN(parseColumn(kv, 'DS_ProdLang')) AS DS_ProdLang,
MIN(parseColumn(kv, 'DS_Product')) AS DS_Product,
MIN(parseColumn(kv, 'DS_OfficeBits')) AS DS_OfficeBits
FROM `project.dataset.table`,
UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS kv
GROUP BY url
with the result as below
Row DS_Vendor DS_ProdVer DS_ProdLang DS_Product DS_OfficeBits
1 55039 7.90.100.0 EN MTT 32
2 55192 4.30.100.0 DE MTE 64
Below is also addressed
The string can sometimes be longer, and contains additional headers.
One example using BigQuery (with standard SQL):
SELECT REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
This returns the parts of the URL as an ARRAY<STRING>. To go one step further, you can get back an ARRAY<STRUCT<key STRING, value STRING>> with a query of this form:
SELECT
ARRAY(
SELECT AS STRUCT
SPLIT(part, '=')[OFFSET(0)] AS key,
SPLIT(part, '=')[OFFSET(1)] AS value
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS part
) AS keys_and_values
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
...or with the keys and values as top-level columns:
SELECT
SPLIT(part, '=')[OFFSET(0)] AS key,
SPLIT(part, '=')[OFFSET(1)] AS value
FROM (
SELECT '/test/test.aspx?DS_Vendor=55039&DS_ProdVer=7.90.100.0&DS_ProdLang=EN&DS_Product=MTT&DS_OfficeBits=32' AS url
)
CROSS JOIN UNNEST(REGEXP_EXTRACT_ALL(url, r'[?&]([^?&]+)')) AS part

SSIS Package how to successfully do a foreach or for loop to auto increment a value for field insert?

First of all I have never attempted something like this in SSIS and I am very new to SSIS package development.
I need to build a component in my package that will run through a table of data (say 80 rows) and set a field titled DisplayOrder to the auto incremented number. The catch is that one of the records HAS to be set to 0 and then the rest of he records set to the auto incremented number.
In regards to code, I am not even sure what code to attach to this question or even what screenshots.
I finally figured it out and there is no need for a loop.
Create a SQL Task to clear the linked Table.
Script Used
DELETE FROM [Currency].[ExchangeRates]
Create a SQL Task to clear the main table.
Script Used
DELETE FROM [Currency].[CurrencyList]
Load the values into the main table.
Actions Used
Load values from XML Source
Dump values to [ExchangeRates] Table
Create a SQL Task to load the Values from the main table to the linked table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], ROW_NUMBER() OVER (ORDER BY [ER].[TargetName]) AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] ORDER BY [CurrencyName]
Create a SQL Task to load a new record to the main table for use as DisplayOrder 0.
Script Used
INSERT INTO [Currency].[ExchangeRates] ([Title], [Link], [Description], [PubDate], [BaseCurrency], [TargetCurrency], [TargetName], [ExchangeRate]) VALUES ('1 USD = 1 USD','http://www.floatrates.com/usd/usd/','1 U.S. Dollar = 1 U.S. Dollar',(SELECT TOP 1 [PubDate] FROM [Currency].[ExchangeRates]),'USD','USD','United States Dollar','1')
Create a SQL Task to reference the newly created record from the main table.
Script Used
INSERT INTO [Currency].[CurrencyList] (CurrencyCode, CurrencyName, ExchangeRateID, DisplayOrder) SELECT [er].[TargetCurrency] AS [CurrencyCode], [er].[TargetName] AS [CurrencyName], [er].[ID] AS [ExchangeRateID], 0 AS [DisplayOrder] FROM [Currency].[ExchangeRates] AS [er] WHERE [er].[TargetCurrency] = 'USD'

Correct method of deleting over 2100 rows (by ID) with Dapper

I am trying to use Dapper support my data access for my server app.
My server app has another application that drops records into my database at a rate of 400 per minute.
My app pulls them out in batches, processes them, and then deletes them from the database.
Since data continues to flow into the database while I am processing, I don't have a good way to say delete from myTable where allProcessed = true.
However, I do know the PK value of the rows to delete. So I want to do a delete from myTable where Id in #listToDelete
Problem is that if my server goes down for even 6 mintues, then I have over 2100 rows to delete.
Since Dapper takes my #listToDelete and turns each one into a parameter, my call to delete fails. (Causing my data purging to get even further behind.)
What is the best way to deal with this in Dapper?
NOTES:
I have looked at Tabled Valued Parameters but from what I can see, they are not very performant. This piece of my architecture is the bottle neck of my system and I need to be very very fast.
One option is to create a temp table on the server and then use the bulk load facility to upload all the IDs into that table at once. Then use a join, EXISTS or IN clause to delete only the records that you uploaded into your temp table.
Bulk loads are a well-optimized path in SQL Server and it should be very fast.
For example:
Execute the statement CREATE TABLE #RowsToDelete(ID INT PRIMARY KEY)
Use a bulk load to insert keys into #RowsToDelete
Execute DELETE FROM myTable where Id IN (SELECT ID FROM #RowsToDelete)
Execute DROP TABLE #RowsToDelte (the table will also be automatically dropped if you close the session)
(Assuming Dapper) code example:
conn.Open();
var columnName = "ID";
conn.Execute(string.Format("CREATE TABLE #{0}s({0} INT PRIMARY KEY)", columnName));
using (var bulkCopy = new SqlBulkCopy(conn))
{
bulkCopy.BatchSize = ids.Count;
bulkCopy.DestinationTableName = string.Format("#{0}s", columnName);
var table = new DataTable();
table.Columns.Add(columnName, typeof (int));
bulkCopy.ColumnMappings.Add(columnName, columnName);
foreach (var id in ids)
{
table.Rows.Add(id);
}
bulkCopy.WriteToServer(table);
}
//or do other things with your table instead of deleting here
conn.Execute(string.Format(#"DELETE FROM myTable where Id IN
(SELECT {0} FROM #{0}s", columnName));
conn.Execute(string.Format("DROP TABLE #{0}s", columnName));
To get this code working, I went dark side.
Since Dapper makes my list into parameters. And SQL Server can't handle a lot of parameters. (I have never needed even double digit parameters before). I had to go with Dynamic SQL.
So here was my solution:
string listOfIdsJoined = "("+String.Join(",", listOfIds.ToArray())+")";
connection.Execute("delete from myTable where Id in " + listOfIdsJoined);
Before everyone grabs the their torches and pitchforks, let me explain.
This code runs on a server whose only input is a data feed from a Mainframe system.
The list I am dynamically creating is a list of longs/bigints.
The longs/bigints are from an Identity column.
I know constructing dynamic SQL is bad juju, but in this case, I just can't see how it leads to a security risk.
Dapper request the List of object having parameter as a property so in above case a list of object having Id as property will work.
connection.Execute("delete from myTable where Id in (#Id)", listOfIds.AsEnumerable().Select(i=> new { Id = i }).ToList());
This will work.

Resources