Increase Throughput of Debezium Kafka Connector - sql-server

I have started to use Debezium recently to deal with capture changes data in realtime and sink to the target database.
Instead of Kafka, I use Azure Event Hub with Kafka Connect to connect SQL Server and use confluent JDBC to sink changed data to the target database which is SQL Server.
I understand Debezium does async to have less impact on the performance of the database but is there any way I can increase the throughput of streaming?
Recently, I spin up Event Hub with minimum throughput units is 10 and auto-inflate to 20. So I expect Debezium + Kafka Connect + Event Hubs could stream 10MB - 20MB / second and the egress should be 20 - 40MB / second.
However, the real performance is worst. I manually import 10k of records to the source database which is less than 6MB. So I expect Debezium with sink connector will capture the changes and sink to the target database in a few seconds.
Instead of getting data at one, the sink connector manually updates the data to the target database.
The following is my config. Please let me know if I need to change the configuration to improve performance. Any helps would be very much appreciated.
Kafka Connect: Kafk
bootstrap.servers=sqldbcdc.servicebus.windows.net:9093
group.id=connect-cluster-group
# connect internal topic names, auto-created if not exists
config.storage.topic=connect-cluster-configs
offset.storage.topic=connect-cluster-offsets
status.storage.topic=connect-cluster-status
# internal topic replication factors - auto 3x replication in Azure Storage
config.storage.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
rest.advertised.host.name=connect
offset.flush.interval.ms=10000
connections.max.idle.ms=180000
metadata.max.age.ms=180000
auto.register.schemas=false
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
# required EH Kafka security settings
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://sqldbcdc.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**************************=";
producer.security.protocol=SASL_SSL
producer.sasl.mechanism=PLAIN
producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://sqldbcdc.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**************************=";
consumer.security.protocol=SASL_SSL
consumer.sasl.mechanism=PLAIN
consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://sqldbcdc.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**************************=";
plugin.path=C:\kafka\libs
SQL Connector:
{
"name": "sql-server-connection",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "1433",
"database.user": "sa",
"database.password": "******",
"database.dbname": "demodb",
"database.server.name": "dbservername",
"table.whitelist": "dbo.portfolios",
"database.history":"io.debezium.relational.history.MemoryDatabaseHistory",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3"
}
}
Sink Connector:
{
"name": "jdbc-sink",
"config":{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "portfolios",
"connection.url": "jdbc:sqlserver://localhost:1433;instance=NEWMSSQLSERVER;databaseName=demodb",
"connection.user":"sa",
"connection.password":"*****",
"batch.size":2000,
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": false,
"transforms.unwrap.delete.handling.mode": "none",
"auto.create": "true",
"insert.mode": "upsert",
"delete.enabled":true,
"pk.fields": "portfolio_id",
"pk.mode": "record_key",
"table.name.format": "replicated_portfolios"
}
}

Related

How do I keep track of progress of a Cloudant replication?

If I have an IBM Cloudant database with around one million documents in it and I set up a replication process to copy this data to another region, how can I tell how far the replication job has progressed? I know when it starts and when it has finished, but nothing in between. Is there a way to track progress?
If the target database was empty at the start of the replication, and there's no other changes being written to the target, then it's a case of waiting until the sequence token of the target database matches the sequence token of the source.
You can find the current sequence token of a database by using the GET /<database name> endpoint on the source & the target e.g.
curl $URL/sourcedb
{
"update_seq": "23600-g1AAAARXeJyd0",
"db_name": "sourcedb",
...
}
In the above example, there are 23600 changes that the replicator needs to work through. The same command can be run against the target database to see the progress of replication.
Alternatively, there is an API endpoint that allows you to view the replication job's progress: GET _scheduler/docs/_replicator/<replication id> where the replication id is the _id of the document in the _replicator database that was created to initiate the replication.
It returns an object like this:
{
"database": "_replicator",
"doc_id": "e0330b1936f6194da22af8fa663c5be8",
"id": null,
"source": "https://source.cloudant.com/sourcedb/",
"target": "https://target.cloudant.com/targetdb/",
"state": "completed",
"error_count": 0,
"info": {
"revisions_checked": 1005,
"missing_revisions_found": 1005,
"docs_read": 1005,
"docs_written": 1005,
"changes_pending": 376,
"doc_write_failures": 0,
"checkpointed_source_seq": "1011-g1AAAAfLeJy91FFKwzAYwPGigo_uBvqq0JmkbZqCsomoj3oDzZcvZYxtFbc96w30BnoDvYHeQG-gN9AbzCYpbntbhfQlhdJ"
},
"start_time": "2020-11-17T09:55:01Z",
"last_updated": "2020-11-17T09:55:58Z"
}
which includes the status of the replication, how many documents have been processed and the last checkpointed sequence token, which should be enough to estimate progress of the replication job.
The full details of the API call are here.

Read Flink latency tracking metric in Datadog

I'm following this doc https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/metrics/#end-to-end-latency-tracking and enabled metrics.latency.interval in flink-conf.yaml as shown below:
metrics.latency.interval: 60000
metrics.latency.granularity: operator
Now, I have the following questions:
how could I know what kind of metrics(a list of metrics name) are enabled? I didn't find any in metrics UI.
Datadog is my reporter, will the latency metrics send to Datadog just like other system metrics listed here https://docs.datadoghq.com/integrations/flink/#data-collected? If yes, what's their name? If no, is there anything I need to do to get them in Datadog?
I'm new to the Flink and the Datadog.Many thanks!
You can access these metrics via rest api integration:
http://{job_manager_address}:8081/jobs/{job_id}/metrics
which will return:
[
{
"id": "latency.source_id.3d28eee20f19966ad0843c8183e96045.operator_id.9c9bbdbebfd61a4aaac39e2c417a4f21.operator_subtask_index.7.latency_min"
},
{
"id": "latency.source_id.bca0e5ddee87a6f64a26077804c63e69.operator_id.197249262ed30764bb323b65405e10b4.operator_subtask_index.14.latency_p75"
},
{
"id": "latency.source_id.bca0e5ddee87a6f64a26077804c63e69.operator_id.b9d4ed4c91fec482427d3584100b1c90.operator_subtask_index.12.latency_median"
},
]
This means that latency from the source_id 3d28eee20... to operator_id 9c9bbdbe with subtask index 7.
However I don't know exact meaning of the latency_p75 or latency_min. Maybe someone else can help us both.
#monstero has explained where to find the latency metrics -- they are job metrics.
The latency metrics are histogram metrics. latency_p75, for example, is the 75th percentile latency, meaning that 75% of the time the latency was less than the reported value.
In all, you can access the min, max, mean, median, stddev, p75, p90, p95, p98, p99, and p999.

Getting error while ingesting a nested Avro in a MS SQL table using kafka-connect-jdbc in kafka

As part of POC, I am trying to ingest Avro messages with schema registry enabled from Kafka Topics into JDBC Sink(MS SQL Database).But i am facing some issues while ingesting nested avro data to a MS Sql table. I am using kafka-connect-jdbc-sink to ingest avro data to a MS Sql table from Kafka Avro Console Producer.
Details mentioned below
Kafka Avro Producer CLI Command
kafka-avro-console-producer --broker-list server1:9092, server2:9092,server3:9092 --topic testing25 --property schema.registry.url=http://server3:8081 --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"tradeid","type":"int"},{"name":"tradedate", "type": "string"}, {"name":"frontofficetradeid", "type": "int"}, {"name":"brokerorderid","type": "int"}, {"name":"accountid","type": "int"}, {"name": "productcode", "type": "string"}, {"name": "amount", "type": "float"}, {"name": "trademessage", "type": { "type": "array", "items": "string"}}]}'
JDBC-Sink.properties
name=test-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=testing25 connection.url=jdbc:sqlserver://testing;DatabaseName=testkafkasink;user=Testkafka
insert.mode=upsert
pk.mode=record_value
pk.fields=tradeid
auto.create=true
tranforms=FlattenValueRecords
transforms.FlattenValueRecords.type=org.apache.kafka.connect.transforms.Flatten$Value
transforms.FlattenValueRecords.field=trademessage
connect-avro-standalone.properties
bootstrap.servers=server1:9092,server2:9092,server3:9092
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://server3:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://server3:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
plugin.path=/usr/share/java
So after running the jdbc-sink and the producer while i am trying to insert the data in cli i am getting this error
ERROR WorkerSinkTask{id=test-sink-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. (org.apache.kafka.connect.runtime.WorkerSinkTask:584)
org.apache.kafka.connect.errors.ConnectException: null (ARRAY) type doesn't have a mapping to the SQL database column type
I understand that it is failing on Array Data type as SQL Server does not contain any such data type. So I researched and found that we can use Kafka Connect SMT's(Single Message Transform) functionality(flatten) to flatten nested values.
But this does not seems to be working in my case. Transform values passed in JDBC-sink are doing nothing. Infact I tested with other transformations as well like InsertField$Value & InsertField$Key but none of them are working. Please let me know if i am doing anything wrong in running these transformations in Kafka connect.
Any help would be appreciated.
Thanks

DocumentDB Sql Injection?

I'm trying to offload some client specific query building to the client. I don't think I'm in danger of sql injection for documentdb since it doesn't have UPDATE or DELETE statements but i'm not positive. Additionally, I don't know if these will be added in the future.
Here is an example of my problem.
IceCreamApp wants to find all flavors where the name is like "choco". A flavor document looks like this-
{
"name": "Chocolate",
"price": 1.50
}
The API knows about the DocumentDB and knows how to request data from it, but it doesn't know the entity structure of any of the clients entities. So to do this on the API-
_documentClient.CreateDocumentQuery("...")
.Where((d) => d.name.Contains(query));
Would throw an error (d is dynamic and name isn't necessarily a common property).
I could build this on the client and send it.
Client search request-
{
"page": 1,
"pageSize": 10,
"query": "CONTAINS(name, 'choco')"
}
Without sanitzation this would be a big no-no for sql. But does it / will it ever matter for documentdb? How safe am I to run un-sanitized client queries?
As this official document Announcing SQL Parameterization in DocumentDB:
Using this feature, you can now write parameterized SQL queries. Parameterized SQL provides robust handling and escaping of user input, preventing accidental exposure of data through “SQL injection” *. Let's take a look at a sample using the .NET SDK; In addition to plain SQL strings and LINQ expressions, we've added a new SqlQuerySpec class that can be used to build parameterized queries.
DocumentDB is not susceptible to the most common kinds of injection attacks that lead to “elevation of privileges” because queries are strictly read-only operations. However, it might be possible for a user to gain access to data they shouldn’t be accessing within the same collection by crafting malicious SQL queries. SQL parameterization support helps prevent these sort of attacks.
Here's a official sample that queries a "Books" collection with a single user supplied parameter for author name:
POST https://contosomarketing.documents.azure.com/dbs/XP0mAA==/colls/XP0mAJ3H-AA=/docs
HTTP/1.1 x-ms-documentdb-isquery: True
x-ms-date: Mon, 18 Aug 2014 13:05:49 GMT
authorization: type%3dmaster%26ver%3d1.0%26sig%3dkOU%2bBn2vkvIlHypfE8AA5fulpn8zKjLwdrxBqyg0YGQ%3d
x-ms-version: 2014-08-21
Accept: application/json
Content-Type: application/query+json
Host: contosomarketing.documents.azure.com
Content-Length: 50
{
"query": "SELECT * FROM books b WHERE (b.Author.Name = #name)",
"parameters": [
{"name": "#name", "value": "Herman Melville"}
]
}

Apache Drill 1.2 and SQL Server JDBC

Apache Drill 1.2 adds the exciting feature of including JDBC relational sources in your query. I would like to include Microsoft SQL Server.
So, following the docs I copied the SQL Server jar sqldjbc42.jar (the most recent MS JDBC driver) into the proper 3rd party directory.
I successfully added the storage.
The configuration is:
{
"type": "jdbc",
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url": "jdbc:sqlserver://myservername",
"username": "myusername",
"password": "mypassword",
"enabled": true
}
as "mysqlserverstorage"
However, running queries fails. I've tried:
select * from mysqlserverstorage.databasename.schemaname.tablename
(of course I've use real existing tables instead of the placeholders here)
Error:
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: From line 2, column 6 to line 2, column 17: Table 'mysqlserverstorage.databasename.schemaname.tablename' not found [Error Id: f5b68a73-973f-4292-bdbf-54c2b6d5d21e on PC1234:31010]
and
select * from mysqlserverstorage.`databasename.schemaname.tablename`
Error:
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: Exception while reading tables [Error Id: 213772b8-0bc7-4426-93d5-d9fcdd60ace8 on PC1234:31010]
Has anyone had success in configuring and using this new feature?
Success has been reported using a storage plugin configuration, such as
{
type: "jdbc",
enabled: true,
driver: "com.microsoft.sqlserver.jdbc.SQLServerDriver",
url:"jdbc:sqlserver://172.31.36.88:1433;databaseName=msdb",
username:"root",
password:"<password>"
}
on pre-release Drill 1.3 and using sqljdbc41.4.2.6420.100.jar.
Construct you query as,
select * from storagename.schemaname.tablename
This will work with sqljdbc4.X as it works for me.

Resources