I want to continuously poll a table from my database and push to Kafka.
I am using apache camel for this.
My routes are as follows:-
from(timer:every 1 sec).
to(sql:select first 1000 * from myTable where id > myId).
to(updateMyId).
to(kafka:url).end()
The problem is that it doesn't take the updated myId in the next iteration. The route is static and if initially myId = 1, then it keeps polling from 1.
How do I dynamically update myId?
Try like this.set myId property as it gets updated.
from(timer:every 1 sec).
to(sql:select first 1000 * from myTable where id >:#${property.myId}).
to(updateMyId).
to(kafka:url).end();
Or move following logic to a bean.
to(sql:select first 1000 * from myTable where id >:#${property.myId}).
to(updateMyId).
Related
Consider the code below in a unit test, where I add a new Tag object in a pre-populated SQLite database.
#Test // Line 1
public void add() {
Tag tagToAdd = new Tag("Tall");
Tag addedTag = this.tagDao.add(tagToAdd);
assertNotNull(addedTag);
assertEquals(3L, addedTag.getId()); // Line 6
assertEquals(tagToAdd.getTag(), addedTag.getTag());
List<Tag> tags = this.tagDao.get();
assertEquals(3, tags.size());
}
On line 6, I expect the ID of the Tag to be 3, because the field is an AUTOINCREMENT and the test is initialized with a database already containing 2 Tags. This works fine every time I run the test and the ID is always 3.
Now, I am integrating flyway to the project. Every time I run the test, the AUTOINCREMENT starts from the value of the last run, so the Tag ID increments by 1 every run, and the test fails.
Any idea on how I can get flyway to always reset the database to a brand new state, and reset the AUTOINCREMENT value ? I could write a query to do it manually, but this is not maintainable.
What I have tried so far ?
Integrate #FlywayTest, as this executes flyway task clean
Defined a FlywayMigrationStrategy bean, which contains flyway.clean()
Set spring.flyway.clean-on-validation-error to true in my application.properties (that said, there was no change in my sql, so not sure if this changed anything)
-- Edit
My 1st migration script contains the below.
DROP TABLE IF EXISTS Tag;
CREATE TABLE Tag(
id INTEGER PRIMARY KEY AUTOINCREMENT,
tag VARCHAR(255) NOT NULL UNIQUE,
createdDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
modifiedDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
If I understood everything correctly - you have a database and a table in this database which is created once and the same table is used for tests every time - you just delete rows from the table (without removing it) when tests are completed (or before starting next tests) and flyway just inserts two tags into this table every time you run the tests.
If that's right - you can just reset sequence in SQLite to set it back to 1 so next inserted row will be inserted with this id. You can do it by running the following query:
UPDATE `sqlite_sequence` SET `seq` = 1 WHERE `name` = 'tags_table_name';
Alternatively, you can set seq to 0 - this value is incorrect so SQLite will use next available correct value (if there are no rows in the table - it will be one, if there are some values - it will first available number).
Yet another possibility is just to delete your table after tests and recreate it before running next tests - as it is a database and table just for tests - it should work correctly. This way you have your sequence counter set back to value 1 each time. I would actually go this way until you have really good reason not to delete the table.
We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.
My Use case-
Collect events for a particular duration and then group them based on the key
Objective
After processing, user can save data of particular duration based on the key
How i am planning to do
1)Receive events from Kafka
2)Create data stream of events
3)associate a table with it and collect data for a particular duration by running a SQL query
4)associate a new table with step-2 output and group collected data according to the key
5)save the data in DB
Solution i tried-
I am able to-
1)receive events from Kafka,
2)setup a data stream(lets say sensorDataStream)-
DataStream<SensorEvent> sensorDataStream
= source.flatMap(new FlatMapFunction<String, SensorEvent>() {
#Override
public void flatMap(String catalog, Collector<SensorEvent> out) {
// create SensorEvent(id, sensor notification value, notification time) creation
});
3)associate a table(lets say table1) with data stream and after running SQL query like-
SELECT id, sensorNotif, notifTime FROM SENSORTABLE WHERE notifTime > t1_Timestamp AND notifTime < t2_Timestamp.
Here t1_Timestamp and t2_Timestamp is predefined epoch time and will change based on some predefined conditions
4)I am able to print this sql query result by using following query on the console-
tableEnv.toAppendStream(table1, Row.class).print();
5)Created a new table(lets say table2) by using table1 and following type of sql query-
Table table2 = tableEnv.sqlQuery("SELECT id AS SensorID, COUNT(sensorNotif) AS SensorNotificationCount FROM table1 GROUP BY id);
6)Collecting and print data by using -
tableEnv.toRetractStream(table2 , Row.class).print();
Problem
1)I am not able to see output of step 6 on the console.
I did some experiment and found that If i skip table1 setup step(that means no sensor data clubbing for a duration) and directly associate my senserDataStream with table2 then i can see the output of step-6 but as this is RetractStream so i can see data in the form of and if new event is coming then this retract stream will invalidate data and print newly calculated data.
Suggestion i would like to have
1)How can i merge step 5 and step 6(means table1 and table2). I already merged these tables but as data is not visible on console so i have doubt? Am i doing something wrong? Or data is merged but not visible?
2)My plan is to --
2.a)filter data in 2 pass, in first pass filter data for a particular interval and in second pass group this data
2.b)Save 2.a output in DB
Will this approach work(i have doubt because i am using data stream and table1 out put is append stream but table2 output is retract stream)?
I am using Dapper on ADO.NET. So at present I am doing the following:
using (IDbConnection conn = new SqlConnection("MyConnectionString")))
{
conn.Open());
using (IDbTransaction transaction = conn.BeginTransaction())
{
// ...
However, there are various levels of transactions that can be set. I think this is the various settings.
My first question is how do I set the transaction level (where I am using Dapper)?
My second question is what is the correct level for each of the following cases? In each of these cases we have multiple instances of a web worker (Azure) service running that will be hitting the DB at the same time.
I need to run monthly charges on subscriptions. So in a transaction I need to read a record and if it's due for a charge create the invoice record and mark the record as processed. Any other read of that record for the same purpose needs to fail. But any other reads of that record that are just using it to verify that it is active need to succeed.
So what transaction do I use for the access that will be updating the processed column? And what transaction do I use for the other access that just needs to verify that the record is active?
In this case it's fine if a conflict causes the charge to not be run (we'll get it the next day). But it is critical that we not charge someone twice. And it is critical that the read to verify that the record is active succeed immediately while the other operation is in its transaction.
I need to update a record where I am setting just a couple of columns. One use case is I set a new password hash for a user record. It's fine if other access occurs during this except for deleting the record (I think that's the only problem use case). If another web service is also updating that's the user's problem for doing this in 2 places simultaneously.
But it's key that the record stay consistent. And this includes the use case of "set NumUses = NumUses + #ParamNum" so it needs to treat the read, calculation, write of the column value as an atomic action. And if I am setting 3 column values, they all get written together.
1) Assuming that Invoicing process is an SP with multiple statements your best bet is to create another "lock" table to store the fact that invoicing job is already running e.g.
CREATE TABLE InvoicingJob( JobStarted DATETIME, IsRunning BIT NOT NULL )
-- Table will only ever have one record
INSERT INTO InvoicingJob
SELECT NULL, 0
EXEC InvoicingProcess
ALTER PROCEDURE InvoicingProcess
AS
BEGIN
DECLARE #InvoicingJob TABLE( IsRunning BIT )
-- Try to aquire lock
UPDATE InvoicingJob WITH( TABLOCK )
SET JobStarted = GETDATE(), IsRunning = 1
OUTPUT INSERTED.IsRunning INTO #InvoicingJob( IsRunning )
WHERE IsRunning = 0
-- job has been running for more than a day i.e. likely crashed without releasing a lock
-- OR ( IsRunning = 1 AND JobStarted <= DATEADD( DAY, -1, GETDATE())
IF NOT EXISTS( SELECT * FROM #InvoicingJob )
BEGIN
PRINT 'Another Job is already running'
RETURN
END
ELSE
RAISERROR( 'Start Job', 0, 0 ) WITH NOWAIT
-- Do invoicing tasks
WAITFOR DELAY '00:01:00' -- to simulate execution time
-- Release lock
UPDATE InvoicingJob
SET IsRunning = 0
END
2) Read about how transactions work: https://learn.microsoft.com/en-us/sql/t-sql/language-elements/transactions-transact-sql?view=sql-server-2017
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql?view=sql-server-2017
You second question is quite broad.
I'm trying to use apache camel (sql component) to update the DB. Problem is that the DB just doesn't get updated. The sql:update is working fine when i hard code the query, but when i try to use ${body[0][id]} it doesn't update required field. Any feedback on what might be going wrong?
from("direct:updateSql")
.to("sql:select * from mytable limit 1")
.log("update mytable set status = '100' where id = '${body[0][id]}'")
.to("sql:update mytable set status = '100' where id = '${body[0][id]}'")
.end();
Note that status and id are integer fields but if i remove the ' from the .to() then i throws some errors.
According to the docs you must use :# as the placeholder syntax
http://camel.apache.org/sql-component
So try with
.to("sql:update mytable set status = 100 where id = :#${body[0][id]}")
Below code worked too.
from("direct:updateSql")
.to("sql:select * from mytable limit 1")
.setHeader("id", simple("${body[0][id]}"))
.to("sql:update mytable set status_id = 100 where id = :#id")
.end();