Kafka MicrosoftSqlServerSource Connect - Duplicate entries in topic

Kafka MicrosoftSqlServerSource Connect - Duplicate entries in topic - sql-server

We have a requirement to set up Kafka MicrosoftSqlServerSource connect.
This is to capture all the transactions(insert/update) performing in one of the sales table in Azure SQL database.
In order to bring the support for the above source connect, we have initially enabled CDC at both database and table level.
We also created a view out of the source table which will be the input for the source connect( TableType = VIEW in connector configuration).
Once we complete the set up at both connector as well as database level, we could see messages flowing to the respective topic created automatically along with the connector as when a new updations/insertions happened at the table level.
One strange behavior we observed while testing is that when we stopped the testing, the last message received in the topic starts getting duplicated until a new message arrived.
Could you please help us to understand whether this is a system behavior?
Or Did we miss any configuration that has resulted in these duplicate entries.
Please guide us on how we can tackle the above duplicate issue.
Attaching the snapshot
Connector Summary
Connector Class = MicrosoftSqlServerSource
Max Tasks = 1
kafka.auth.mode = SERVICE_ACCOUNT
kafka.service.account.id = **********
topic.prefix = ***********
connection.host = **************8
connection.port = 1433
connection.user = ***************
db.name = **************88
table.whitelist = item_status_view
timestamp.column.name = ProcessedDateTime
incrementing.column.name = SalesandRefundItemStatusID
table.types = VIEW
schema.pattern = dbo
db.timezone = Europe/London
mode = timestamp+incrementing
timestamp.initial = -1
poll.interval.ms = 10000
batch.max.rows = 1000
timestamp.delay.interval.ms = 30000
output.data.format = JSON

What you're describing is controlled by
mode = timestamp+incrementing
poll.interval.ms = 10000
It should save the last timestamp, then query only for timestamps greater than the last... If you are getting greater than or equal to, then that is certainly a bug that should be reported.
Or you should read the docs
A timestamp column must use datetime2 and not datetime. If the timestamp column uses datetime, the topic may receive numerous duplicates
As an alternative, you could use Debezium (run your own Connector, not use Confluent Cloud offering) to truly stream all table operations.

Related

using servicestack ormlite, is there a way to get an execution plan?

Using servicestack ormlite 6,4 and azure SQL server - using SQLServerDialect2012, we have an issue with an enums causing excessive stalling and timeouts.
If we just convert it to a string its quick as it should be.
var results = db.Select(q => q.SomeColumn == enum.value); -> 3,5 seconds
var results2 = db.Select(q => q.SomeColumn.tostring() == enum.value.tostring()); -> 0,08
we are using default settings so the enum in the db is defined as a varchar(255)
both queries give the same result.
to track the issue we wanted to see what its actually firing, but all we get is a query with some #1 #2 etc with no indication of what parameters where used or how they are defined.
All our attempts to get a 1:1 SQL string we can use to manually test the query and see the results have failed... mini profiler was the closest as it shows the parameter values...
but it does not contain the details necessary to recreate the used query and recreate the issue we have. (manually recreating the query gives 80ms as above)
Trying to get the execution plan with the query also fail.
db.ExecuteSql("SET STATISTICS PROFILE ON;");
var results = db.Select(q => q.SomeColumn == enum.value);
db.ExecuteSql("SET STATISTICS PROFILE OFF;");
only returns data, not any extra info i was hoping for.
I have not been able to find any sites or threads that explain how others get any kind of debug info.
What is the correct next step here?

OrmLite's Logging & Introspection page shows how you can view the SQL generated by OrmLite.
E.g. Configuring a debug logger should log the generated SQL and params:
LogManager.LogFactory = new ConsoleLogFactory(debugEnabled:true);

ALTERing a database table inside a test

For a seemingly good reason, I need to alter something in a table that was created from a fixture while a test runs.
Of many things I tried, here's what I have right now. Before I even get to the ALTER query, first I want to make sure I have access to that database where the temporary tables sit.
public function testFailingUpdateValidation()
{
// i will run SQL against a connection, lets get it's exact name
// from one of it's tables; first get the instance of that table
// (the setup was blindly copied and edited from some docs)
$config = TableRegistry::getTableLocator()->exists('External/Products')
? []
: ['className' => 'External/Products'];
$psTable = TableRegistry::getTableLocator()
->get('External/Products', $config);
// get the connection name to work with from the table
$connectionName = $psTable->getConnection()->configName();
// get the connection instance to run the query
$db = \Cake\Datasource\ConnectionManager::get($connectionName);
// run the query
$statement = $db->query('SHOW TABLES;');
$statement->execute();
debug($statement->fetchAll());
}
The output is empty.
########## DEBUG ##########
[]
###########################
I'm expecting it to have at least a name of that table I got the connection name from. Once I have the working connection, I'd run an ALTER query on a specific column.
How do I do that? Please help
Thought I'd clarify what I'm up to, if that's needed
This is a part of an integrated code test. The methods involved are already tested individually. The code inside the method I'm testing takes a bunch of values from one database table and copies them to another, then it immediately validates whether values in the two tables match. I need to simulate a case when they don't.
The best I could come up with is to change the schema of the table column in order to have it store values with low precision, and cause my code test to fail when it tries to match (validate) the values.

The problem was I didn't load the fixture for the table I wanted to ALTER. I know it's kind of obvious, but I'm still confused because then why didn't SHOW TABLES return the other tables that did have their fixtures loaded? It showed me no tables so I thought it was not working at all, and didn't even get to loading that exact fixture. Once I did, it now shows me the table of that fixture, and the others.
Anyway, here's the working code. Hope it helps someone figuring this out.
// get the table instance
$psTable = TableRegistry::getTableLocator()->get('External/Products');
// get the table's corresponding connection name
$connectionName = $psTable->getConnection()->configName();
// get the underlying connection instance by it's name
$connection = \Cake\Datasource\ConnectionManager::get($connectionName);
// get schemas available within connection
$schemaCollection = $connection->getSchemaCollection();
// describe the table schema before ALTERating it
$schemaDescription = $schemaCollection->describe($psTable->getTable());
// make sure the price column is initially DECIMAL
$this->assertEquals('decimal', $schemaDescription->getColumnType('price'));
// keep the initial column data for future reference
$originalColumn = $schemaDescription->getColumn('price');
// ALTER the price column to be TINYINT(2), as opposed to DECIMAL
$sql = sprintf(
'ALTER TABLE %s '.
'CHANGE COLUMN price price TINYINT(2) UNSIGNED NOT NULL',
$psTable->getTable()
);
$statement = $connection->query($sql);
$statement->closeCursor();
// describe the table schema after ALTERations
$schemaDescription = $schemaCollection->describe($psTable->getTable());
// make sure the price column is now TINYINT(2)
$this->assertEquals('tinyinteger', $schemaDescription->getColumnType('price'));
Note: after this test was done, I had to restore the column back to its initial type using the data from $originalColumn above.

Common strategy in handling concurrent global 'inventory' updates

To give a simplified example:
I have a database with one table: names, which has 1 million records each containing a common boy or girl's name, and more added every day.
I have an application server that takes as input an http request from parents using my website 'Name Chooser' . With each request, I need to pick up a name from the db and return it, and then NOT give that name to another parent. The server is concurrent so can handle a high volume of requests, and yet have to respect "unique name per request" and still be high available.
What are the major components and strategies for an architecture of this use case?

From what I understand, you have two operations: Adding a name and Choosing a name.
I have couple of questions:
Qustion 1: Do parents choose names only or do they also add names?
Question 2 If they add names, doest that mean that when a name is added it should also be marked as already chosen?
Assuming that you don't want to make all name selection requests to wait for one another (by locking of queueing them):
One solution to resolve concurrency in case of choosing a name only is to use Optimistic offline lock.
The most common implementation to this is to add a version field to your table and increment this version when you mark a name as chosen. You will need DB support for this, but most databases offer a mechanism for this. MongoDB adds a version field to the documents by default. For a RDBMS (like SQL) you have to add this field yourself.
You havent specified what technology you are using, so I will give an example using pseudo code for an SQL DB. For MongoDB you can check how the DB makes these checks for you.
NameRecord {
id,
name,
parentID,
version,
isChosen,
function chooseForParent(parentID) {
if(this.isChosen){
throw Error/Exception;
}
this.parentID = parentID
this.isChosen = true;
this.version++;
}
}
NameRecordRepository {
function getByName(name) { ... }
function save(record) {
var oldVersion = record.version - 1;
var query = "UPDATE records SET .....
WHERE id = {record.id} AND version = {oldVersion}";
var rowsCount = db.execute(query);
if(rowsCount == 0) {
throw ConcurrencyViolation
}
}
}
// somewhere else in an object or module or whatever...
function chooseName(parentID, name) {
var record = NameRecordRepository.getByName(name);
record.chooseForParent(parentID);
NameRecordRepository.save(record);
}
Before whis object is saved to the DB a version comparison must be performed. SQL provides a way to execute a query based on some condition and return the row count of affected rows. In our case we check if the version in the Database is the same as the old one before update. If it's not, that means that someone else has updated the record.
In this simple case you can even remove the version field and use the isChosen flag in your SQL query like this:
var query = "UPDATE records SET .....
WHERE id = {record.id} AND isChosend = false";
When adding a new name to the database you will need a Unique constrant that will solve concurrenty issues.

Jdbc batched updates good key retrieval strategy

I insert alot of data into a table with a autogenerated key using the batchUpdate functionality of JDBC. Because JDBC doesn't say anything about batchUpdate and getAutogeneratedKeys I need some database independant workaround.
My ideas:
Somehow pull the next handed out sequences from the database before inserting and then using the keys manually. But JDBC hasn't got a getTheNextFutureKeys(howMany). So how can this be done? Is pulling keys e.g. in Oracle also transaction save? So only one transaction can ever pull the same set of future keys.
Add an extra column with a fake id that is only valid during the transaction.
Use all the other columns as secondary key to fetch the generated key. This isn't really 3NF conform...
Are there better ideas or how can I use idea 1 in a generalized way?

Partial answer
Is pulling keys e.g. in Oracle also transaction save?
Yes, getting values from a sequence is transaction safe, by which I mean even if you roll back your transaction, a sequence value returned by the DB won't be returned again under any circumstances.
So you can prefetch the id-s from a sequence and use them in the batch insert.

Never run into this, so I dived into it a little.
First of all, there is a way to retrieve the generated ids from a JDBC statement:
String sql = "INSERT INTO AUTHORS (LAST, FIRST, HOME) VALUES " +
"'PARKER', 'DOROTHY', 'USA', keyColumn";
int rows = stmt.executeUpdate(sql,
Statement.RETURN_GENERATED_KEYS);
ResultSet rs = stmt.getGeneratedKeys();
if (rs.next()) {
ResultSetMetaData rsmd = rs.getMetaData();
int colCount = rsmd.getColumnCount();
do {
for (int i = 1; i <= colCount; i++) {
String key = rs.getString(i);
System.out.println("key " + i + "is " + key);
}
}
while (rs.next();)
}
else {
System.out.println("There are no generated keys.");
}
see this http://download.oracle.com/javase/1.4.2/docs/guide/jdbc/getstart/statement.html#1000569
Also, theoretically it could be combined with the JDBC batchUpdate
Although, this combination seems to be rather non-trivial, on this pls refer to this thread.
I sugest to try this, and if you do not succeed, fall back to pre-fetching from sequence.

getAutogeneratedKeys() will also work with a batch update as far as I remember.
It returns a ResultSet with all newly created ids - not just a single value.
But that requires that the ID is populated through a trigger during the INSERT operation.

How to use MS Sync Framework to filter client-specific data?

Let's say I've got a SQL 2008 database table with lots of records associated with two different customers, Customer A and Customer B.
I would like to build a fat client application that fetches all of the records that are specific to either Customer A or Customer B based on the credentials of the requesting user, then stores the fetched records in a temporary local table.
Thinking I might use the MS Sync Framework to accomplish this, I started reading about row filtering when I came across this little chestnut:
Do not rely on filtering for security.
The ability to filter data from the
server based on a client or user ID is
not a security feature. In other
words, this approach cannot be used to
prevent one client from reading data
that belongs to another client. This
type of filtering is useful only for
partitioning data and reducing the
amount of data that is brought down to
the client database.
So, is this telling me that the MS Sync Framework is only a good option when you want to replicate an entire table between point A and point B?
Doesn't that seem to be an extremely limiting characteristic of the framework? Or am I just interpreting this statement incorrectly? Or is there some other way to use the framework to achieve my purposes?
Ideas anyone?
Thanks!

No, it is only a security warning.
We use filtering extensively in our semi-connected app.
Here is some code to get you started:
//helper
void PrepareFilter(string tablename, string filter)
{
SyncAdapters.Remove(tablename);
var ab = new SqlSyncAdapterBuilder(this.Connection as SqlConnection);
ab.TableName = "dbo." + tablename;
ab.ChangeTrackingType = ChangeTrackingType.SqlServerChangeTracking;
ab.FilterClause = filter;
var cpar = new SqlParameter("#filterid", SqlDbType.UniqueIdentifier);
cpar.IsNullable = true;
cpar.Value = DBNull.Value;
ab.FilterParameters.Add(cpar);
var nsa = ab.ToSyncAdapter();
nsa.TableName = tablename;
SyncAdapters.Add(nsa);
}
// usage
void SetupFooBar()
{
var tablename = "FooBar";
var filter = "FooId IN (SELECT BarId FROM dbo.GetAllFooBars(#filterid))";
PrepareFilter(tablename, filter);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Kafka MicrosoftSqlServerSource Connect - Duplicate entries in topic - sql-server

Related

using servicestack ormlite, is there a way to get an execution plan?

ALTERing a database table inside a test

Common strategy in handling concurrent global 'inventory' updates

Jdbc batched updates good key retrieval strategy

How to use MS Sync Framework to filter client-specific data?

Categories

Resources