Docker, Debezium not streaming data from mssql to elasticsearch - sql-server

I followed this examples to stream data from mysql to elasticsearch
https://github.com/debezium/debezium-examples/tree/master/unwrap-smt#elasticsearch-sink
The example itself works great on my local machine.
But in my case I want to stream data from mssql (which is on another server, not docker) to elasticsearch.
So in the "docker-compose-es.yaml" file i removed "mysql" part and removed the mysql links.
And created my own connectors/sink for elastic and mssql:
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA"
}
}
{
"name": "elastic-sink-test",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "TEST_A",
"connection.url": "http://localhost:9200/",
"transforms": "unwrap,key",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "SQ",
"key.ignore": "false",
"type.name": "TEST_A",
"behavior.on.null.values": "delete"
}
}
When adding these the kafka connect I/O is working hard and has over 40GB input see image below:
In the kafka logs it looks like its going through all the tables. Here is one of the table logs:
2021-06-17 10:20:10,414 - INFO [data-plane-kafka-request-handler-5:Logging#66] - [Partition MyServer.dbo.TemplateGroup-0 broker=1] Log loaded for partition MyServer.dbo.TemplateGroup-0 with initial high watermark 0
2021-06-17 10:20:10,509 - INFO [data-plane-kafka-request-handler-3:Logging#66] - Creating topic MyServer.dbo.TemplateMeter with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(1))
2021-06-17 10:20:10,516 - INFO [data-plane-kafka-request-handler-3:Logging#66] - [KafkaApi-1] Auto creation of topic MyServer.dbo.TemplateMeter with 1 partitions and replication factor 1 is successful
2021-06-17 10:20:10,526 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(MyServer.dbo.TemplateMeter-0)
2021-06-17 10:20:10,528 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [Log partition=MyServer.dbo.TemplateMeter-0, dir=/kafka/data/1] Loading producer state till offset 0 with message format version 2
The database is only 2GB. I'm not sure why it has so high input.
No test_a index was created in elasticsearch when running this command:
curl http://localhost:9200/_aliases?pretty=true
Does anyone know how I troubleshoot from here or point me to the right direction?
Thanks in advance!

how I troubleshoot from here
docker compose logs?
Modify the log4j.properties of Kafka Connect and/or Elasitcsearch processes to get more logs?
Use a regular Kafka consumer to see if data is actually read into the TEST_A topic?
in the "docker-compose-es.yaml" ....
If Debezium is running in a container, then Elasticsearch is not available at localhost:9200
Change that value to http://elastic:9200, like shown in the es-sink.json

Related

Flink - unable to recover after yarn node termination

We are running flink on yarn. We were performing Disaster recovery Testing and as part of that, we manually terminated one of the nodes that had a flink application running. Once the instance was brought back up, the application went in for multiple attempts and each attempt had the following error :
AM Container for appattempt_1602902099413_0006_000027 exited with exitCode: -1000
Failing this attempt.Diagnostics: Could not obtain block: BP-986419965-xx.xx.xx.xx-1602902058651:blk_1073743332_2508
file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-986419965-10.61.71.85-1602902058651:blk_1073743332_2508 file=/user/hadoop/.flink/application_1602902099413_0006/application_1602902099413_0006-flink-conf.yaml1528536851005494481.tmp
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1053)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1036)at
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1015)at
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:647)at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:926)at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:982)at
java.io.DataInputStream.read(DataInputStream.java:100)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:90)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:64)at
org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:125)at
org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:369)at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)at
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)at
java.security.AccessController.doPrivileged(Native Method)at
javax.security.auth.Subject.doAs(Subject.java:422)at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at
java.util.concurrent.FutureTask.run(FutureTask.java:266)at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at
java.lang.Thread.run(Thread.java:748)For
more detailed output, check the application tracking page: http://<>.compute.internal:8088/cluster/app/application_1602902099413_0006 Then click on links to logs of each attempt.
Could someone let us know what content is being stored in HDFS and if this could be redirected to S3?
Adding checkpoint related settings :
StateBackend rocksDbStateBackend = new RocksDBStateBackend("s3://Path", true);
streamExecutionEnvironment.setStateBackend(rocksDbStateBackend)
streamExecutionEnvironment.enableCheckpointing(10000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
streamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(5000);
streamExecutionEnvironment.getCheckpointConfig().setCheckpointTimeout(60000);
streamExecutionEnvironment.getCheckpointConfig().setMaxConcurrentCheckpoints(60000);
streamExecutionEnvironment.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
streamExecutionEnvironment.getCheckpointConfig().setPreferCheckpointForRecovery(true);
I had the same issue,
I fixed with setting for hdfs-site
{
"Classification": "hdfs-site",
"Properties": {
"dfs.client.use.datanode.hostname": "true",
"dfs.replication": "2",
"dfs.namenode.replication.min": "2",
"dfs.namenode.maintenance.replication.min": "2"
}
}
I think hdfs lost data on node terminated, so i replication data on multi node.

How do I properly extract/convert JSON data into objects?

I have structure below coming via webhook and I'm having trouble understanding if there is built in action in Logic App to get nicely formatted object array of rows which is inside table, where each item will have name and associated value with it.
"SearchResults": {
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "TimeGenerated",
"type": "datetime"
},
{
"name": "ResourceGroup",
"type": "string"
},
{
"name": "ActivityStatusValue",
"type": "string"
},
{
"name": "d_resource",
"type": "dynamic"
},
{
"name": "c_title",
"type": "dynamic"
},
{
"name": "c_details",
"type": "dynamic"
}
],
"rows": [
[
"2020-06-18T16:30:07.89Z",
"USEASTPROD",
"Updated",
"aueglbwvhypap07",
"Remote disk disconnected",
"We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
],
[
"2020-06-18T16:30:07.89Z",
"USEASTPROD",
"Updated1",
"agggggypap07",
"Remote disk disconnected",
"We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
]
]
}
]
}
I'd like it to be an array where columns are entity name and each value from rows is it's value like below
"rows": [
[
"timeGenerated" :"2020-06-18T16:30:07.89Z",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue":"Updated",
"d_resource" : "aueglbwvhypap07",
"c_title" : "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
],
[
"timeGenerated" :"2020-06-18T16:30:07.89Z",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue":"Updated1",
"d_resource" : "agggggypap07",
"c_title" : "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
]
]
For this requirement, we can just use liquid with integration account in logic app to implement it. Please refer to the steps below:
1. We need to create an integration account on azure portal first and link it to your logic app. You can refer to this tutorial.
2. In my logic app, I initialize a variable and store your data(add a {} around the data) to simulate your situation.
3. Then use "Parse JSON" action to parse the string above.
4. Create a liquid map in local (I named it as testRow.liquid) with the code below:
{% assign rows = content.SearchResults.tables[0].rows %}
{
"rows": [
{% for item in rows %}
{
"timeGenerated": "{{item[0]}}",
"ResourceGroup": "{{item[1]}}",
"ActivityStatusValue": "{{item[2]}}",
"d_resource": "{{item[3]}}",
"c_title": "{{item[4]}}",
"c_details": "{{item[5]}}"
},
{% endfor %}
]
}
Upload the liquid map(testRow.liquid) to your integration account, for this step you can refer to this tutorial.
5. Then use "Transform JSON to JSON" action", choose the Body from "Parse JSON" action above and use the map which you upload.
6. After running the logic app, we can get the result as below:
The whole result json is:
{
"rows": [
{
"timeGenerated": "6/18/2020 4:30:07 PM",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue": "Updated",
"d_resource": "aueglbwvhypap07",
"c_title": "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
},
{
"timeGenerated": "6/18/2020 4:30:07 PM",
"ResourceGroup": "USEASTPROD",
"ActivityStatusValue": "Updated1",
"d_resource": "agggggypap07",
"c_title": "Remote disk disconnected",
"c_details": "We're sorry, your virtual machine is unavailable because of connectivity loss to the remote disk. An unexpected problem is preventing us from automatically recovering your virtual machine."
}
]
}
By the way:
The format of the sample you provided is invalid in json, we must use {} instead of [] in each of the item.
In liquid map, it will convert your datetime to another format automatically.
Hope it helps~

Can't connect to Neptune anymore

I have created a Neptune instance in my AWS and a Load Balancer to access it from my local machine to play around.
I'm basically redirecting all connections on the :80 at my LB to :8182 in my Neptune.
So I can easily query it through the browser. In fact, this is the output for the /status:
// 20191211170323
// http://my-lb/status
{
"status": "healthy",
"startTime": "Mon Dec 09 20:06:21 UTC 2019",
"dbEngineVersion": "1.0.2.1.R2",
"role": "writer",
"gremlin": {
"version": "tinkerpop-3.4.1"
},
"sparql": {
"version": "sparql-1.1"
},
"labMode": {
"ObjectIndex": "disabled",
"Streams": "disabled",
"ReadWriteConflictDetection": "enabled"
}
}
Problem is when I try to connect with it through Gremlin Console or Java code I'm getting the following errors:
gremlin> :remote connect tinkerpop.server conf/remote-neptune.yaml
ERROR org.apache.tinkerpop.gremlin.driver.Handler$GremlinResponseHandler - Could not process the response
io.netty.handler.codec.http.websocketx.WebSocketHandshakeException: Invalid handshake response getStatus: 403 Forbidden
at io.netty.handler.codec.http.websocketx.WebSocketClientHandshaker13.verify(WebSocketClientHandshaker13.java:226)
at io.netty.handler.codec.http.websocketx.WebSocketClientHandshaker.finishHandshake(WebSocketClientHandshaker.java:276)
at org.apache.tinkerpop.gremlin.driver.handler.WebSocketClientHandler.channelRead0(WebSocketClientHandler.java:69)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
And my remote-neptune.yaml is as simple as:
hosts: [my-lb]
port: 80
connectionPool: { enableSsl: false}
serializer: { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
I have updated my AWS credentials although I don't think that's related since I'm accessing it through the LB.
And the weirdest part is that this same scenario was working like a week ago :/
Any ideas?
Thanks!
Looks like the problem has auto resolved, but just sharing a few things to watch out for in case this happens again in the future. If you see connection issues, your first line of operation should be to check if its a network connectivity issue. (You mentioned that you were going to check if something changed with regards to security groups, so do update if that was indeed that case). To check if it indeed is a SG issue - log into your client instance, and do a simple telnet call to the DB endpoint.
telnet <endpoint> <port>
If it responds with "Connected", then you can be sure that your SGs are correct, and now you are dealing with an Application layer problem.
As called out in comments, some of the possible culprits could be:
You previously had a setup without IAM Auth in Neptune (not on ALB) and now you enabled IAM Auth. (Emphasis - I'm referring to IAM Auth on the database, and not some other component in between).
Gremlin client-server mismatches.
Some explicit settings on the ALB that could hinder the requests.
And a few others. To summarize, try to classify if it is a L2/L3 issue or an L7 issue and start investigating based off that.

Training Job in Sagemaker gives error in locating file in S3 to docker image path

I am trying to use scikit_bring_your_own/container/decision_trees/train mode, running in AWS CLI, I had no issues. Trying to replicate in Creating Sagemaker Training Job , facing issue in loading data from S3 to docker image path.
In CLI command we used specify the docker run -v $(pwd)/test_dir:/opt/ml --rm ${image} train from where the input needs to referred.
In training job, mentioned the S3 bucket location and output path for model artifacts.
Error entered in the exception as in train - "container/decision_trees/train"
raise ValueError(('There are no files in {}.\n' +
'This usually indicates that the channel ({}) was incorrectly specified,\n' +
'the data specification in S3 was incorrectly specified or the role specified\n' +
'does not have permission to access the data.').format(training_path, channel_name))
Traceback (most recent call last):
File "/opt/program/train", line 55, in train
'does not have permission to access the data.').format(training_path, channel_name))
So not understanding is there any tweaking required or any access missing.
kindly help
If you set the InputDataConfig in the CreateTrainingJob API like this
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": "s3://<bucket>/a.csv"
}
},
"InputMode": "File",
},
{
"ChannelName": "eval",
"DataSource": {
"S3DataSource": {
"S3DataDistributionType": "FullyReplicated",
"S3DataType": "S3Prefix",
"S3Uri": "s3://<bucket>/b.csv"
}
},
"InputMode": "File",
}
]
SageMaker download the data specified above from S3 to the /opt/ml/input/data/channel_name directory in the Docker container. In this case, the algorithm container should be able to find the input data under
/opt/ml/input/data/train/a.csv
/opt/ml/input/data/eval/b.csv
You can find more details in https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html

Elasticsearch ver 1.4.2 polling configuration is not working. why?

Hi' I am trying to poll table data every 5 sec but it does not work
POST /_river/mytest_river/_meta
{
"type":"jdbc",
"jdbc":
{
"driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url":"jdbc:sqlserver://[my_ip];databaseName=mega",
"user":"sa","password":"******",
"sql":"SELECT [OrderID],[CustomerName],[UserFullName],[Status] FROM [Orders_Table]",
"poll":"5s",
"index": "mega",
"type": "orders_table"
}
}
What is wrong with my configuration?
You need to use "schedule" parameter as described here:
Time scheduled execution of JDBC river

Resources