I have a project using Debezium, mostly based on this example, which is then connected to an Apache Pulsar.
I have changed a few configurations. The file now looks like this:
database.history=io.debezium.relational.history.MemoryDatabaseHistory
connector.class=io.debezium.connector.mysql.MySqlConnector
offset.storage=org.apache.kafka.connect.storage.FileOffsetBackingStore
offset.storage.file.filename=offset.dat
offset.flush.interval.ms=5000
name=mysql-dbz-connector
database.hostname={ip}
database.port=3308
database.user={user}
database.password={pass}
database.dbname=database
database.server.name=test
table.whitelist=database.history_table,database.project_table
snapshot.mode=schema_only
schemas.enable=false
include.schema.changes=false
pulsar.topic=persistent://public/default/{0}
pulsar.broker.address=pulsar://{ip}:6650
database.history=io.debezium.relational.history.MemoryDatabaseHistory
As you may understand, what I'm trying to do is to monitor the history_table and the project_table modifications from the database and then write payloads onto an Apache Pulsar.
My problem is as follows. In whatever snapshot mode I use, when an offset has been written, I can't restart the Debezium without getting an error on the next database update.
Encountered change event for table database.history_table whose schema isn't known to this connector
It only happens with an existing offset.dat file. I think this is because the schema is null within the offset.dat file. Take this one for example:
¨Ìsrjava.util.HashMap⁄¡√`—F
loadFactorI thresholdxp?#wur[B¨Û¯T‡xpG{"schema":null,"payload":["mysql-dbz-connector",{"server":"test"}]}uq~U{"ts_sec":1563802215,"file":"database-bin.000005","pos":79574,"server_id":1,"event":1}x
I first suspected the schemas.enable=false or the include.schema.changes=false parameters that I used to make the JSON more concise, but their values don't change anything in the offset.dat file.
The problem lies in line database.history=io.debezium.relational.history.MemoryDatabaseHistory. The history will not survive restart. You should use FileDatabaseHistory instead of it.
Related
I am referring this document to perform mysql(installed on local machine) to pubsub data streaming using debezium connector.
My properties file looks like below
databaseName=testdb
databaseUsername=root
databaseAddress=localhost
databasePort=3306
gcpProject=GCP_project_name
databasePassword=password
whitelistedTables=instance-name.testdb.testtab
singleTopicMode=true
gcpPubsubTopicPrefix=debeziumTest
databaseManagementSystem=mysql
I have already created topic in pubsub with name "debeziumTest".
But the issue is, when i run
sudo mvn exec:java -pl cdc-embedded-connector -Dexec.args="/path/to/properties-file"
, it runs without any error:
but there is no data uploaded to pubsub.
Based on the documentation, table updates are pushed to a topic that matches this format- ${PREFIX}${DB_INSTANCE}.${DATABASE}.${TABLE}
In your case I believe you should create a topic with the name - "debeziumTestinstance-name.testdb.testtab"
This may not be the only problem based on the warnings I see in the logs you shared.
The problem seems to be with your whitelistedTables.
According to the docuumentation, you should use ${instance}.${database}.${table}, for your given example it should be whitelistedTables=testdb.databaseName.testTab (if testTab is your tablename)
I'm getting started with Kapacitor and have been trying to run the first guide in the Kapacitor documentation, but with data I already have. I managed to define a task, but I can neither enable it nor can I run a backfill. I came across this question, which is similar to my problem, but the answer there didn't help. In contrast to the error message there I get empty strings for database, retention policy, and/or measurement.
In Kapacitor config I set an InfluxDB connection to the local host instance with the name localhost (which has a database mydb and the measurements weather.current.clouds and weather.current.visibility with default retention policy autogen) and created the following weathertest.tick script:
dbrp "mydb"."autogen"
var clouds = batch
|query('select mean(value) / 100.0 as val from "mydb"."autogen"."weather.current.clouds"')
.period(1h)
.every(1h)
.groupBy(time(1m), *)
.fill(0)
var vis = batch
|query('select mean(value) / 10000.0 as val from "mydb"."autogen"."weather.current.visibility"')
.period(1h)
.every(1h)
.groupBy(time(1m), *)
.fill(0)
clouds
|join(vis)
.as('c', 'v')
|eval(lambda: 100 * (1 - "c.val") * "v.val")
.as('pcent')
|influxDBOut()
.cluster('localhost')
.database('mydb')
.retentionPolicy('autogen')
.measurement('testmetric')
.tag('host', 'myhost.local')
.tag('key', 'weather.current.lightidx')
This is what I came up with after hours of trial and (especially) error. As given in the title, when I try to enable my task with kapacitor enable weathertest, I get the error message enabling task weathertest: batch query is not allowed to request data from ""."". Same thing when I try to record as in the "Backfill" example. Also, in that example there is a start and a stop date for limiting the time frame. The time format given there is wrong and is not understood by Kapacitor. Instead of e. g. 2015-10-01 I have to put in 2015-10-01T00:00Z to make it at least pass the error message regarding time format error.
In the Kapacitor logs there is not a single line regarding these errors, only when I try to remove a record, I get something like remove /var/lib/kapacitor/replay/1f5...750.brpl: no such file or directory and this can be found in the logs. There are lots of info lines in the logs showing successful POSTs to/from InfluxDB for the _internal database with HTTP response result 204.
Has anyone an Idea what I may be doing wrong?
OK, after the weekend I tried again. Without any change it accepted my script now in the failing steps, however, now I was able to find error messages in the log. The node mentioned there was the eval node and pointed towards a type mismatch. When I changed the line
|eval(lambda: 100 * (1 - "c.val") * "v.val")
to
|eval(lambda: 100.0 * (1.0 - "c.val") * "v.val")
the error messages were gone and the command kapacitor show weathertest showed a rather sane content now.
Furthermore, I redefined, recorded, replayed and deleted the tasks and recordings during my tests over and over again and I may have forgotten to redefine tasks after making changes to the tick script (not really sure). After changing the above, redefining the task and replaying it I finally found the expected data in the InfluxDB instance.
I am using the Dataset API with Flink and I am trying to partition parquet files by a key in my POJO e.g. date. The end goal is to write my files down using the following file structure.
/output/
20180901/
file.parquet
20180902/
file.parquet
Flink provides a convenience class to wrap AvroParquetOutputFormat as shown below but I don't see anyway to provide a partitioning key.
HadoopOutputFormat<Void, Pojo> outputFormat =
new HadoopOutputFormat(new AvroParquetOutputFormat(), Job.getInstance());
I'm trying to figure out the best way to proceed. Do I need to write my own version of AvroParquetOutputFormat that extends hadoops MultipleOutputs type or can I leverage the Flink APIs to do this for me.
The equivalent in Spark would be.
df.write.partitionBy('date').parquet('base path')
You can use the BucketingSink<T> sink to write data in partitions you defined by supplying an instance of the Bucketer interface. See the DateTimeBucketer for an example.
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/DateTimeBucketer.java
I'm trying to execute the code below. Some times it works fine. But some times it does not work.
#db.transactional
def _add_data_to_site(self, key):
site = models.Site.get_by_key_name('s:%s' % self.site_id)
if not site:
site = models.Site()
if key not in site.data:
site.data.append(key)
site.put()
memcache.delete_multi(['', ':0', ':1'], key_prefix='s%s' %
self.site_id)
I'm getting the error:
File "/base/data/home/apps/xxxxxxx/1-7-1.366398694339889874xxxxxxx.py", line 91, in _add_data_to_site
site.put()
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/__init__.py", line 1070, in put
return datastore.Put(self._entity, **kwargs)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 579, in Put
return PutAsync(entities, **kwargs).get_result()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1569, in __put_hook
self.check_rpc_success(rpc)
File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1224, in check_rpc_success
raise _ToDatastoreError(err)
BadRequestError: cross-group transaction need to be explicitly specified, see TransactionOptions.Builder.withXG
So, my question is:
If I'm changing only one entity (models.Site) why am I getting a cross-group transaction error?
As mentioned in the logs: "Cross-group transaction need to be explicitly specified".
Try specifying it by using
#db.transactional(xg=True)
Instead of:
#db.transactional
Does this work if you specify parent=None in your get_by_key_name() query?
Essentially, in order to use a transaction, all entities in the transaction must share the same parent (ie you query using one parent, and create a new entity withe the same parent), or you must use a XG transaction. You're seeing a problem because you didn't specify a parent.
You may need to create artificial entities to behave as parents in order to do what you're trying to do.
I had the same issue. By stepping through the client code, I made the following two observations:
1) Setting a parent of (None) seems to still indicate a parent of that kind, even if there's no specific record elected as that parent.
2) Your transaction will include all ReferenceProperty properties as well.
Therefore, you should, theoretically, get the cross-group transaction exception if you haven't declared a parent (by either omitting or setting to (None)) on any of the kinds that you're affecting if there's at least two (because if you're using kind A and kind B, it looks like you're using two different entity groups, for A records and for B records), -as well as- any of the kinds referred-to by any ReferenceProperty properties.
To fix this, you must create, at least, a kind without any properties, that can be set as the parent of all of your previously no-parent records, as well as the parent of all ReferenceProperty properties that they declare.
If that's not sufficient, then set the flag for the cross-group transaction.
Also, the text of the exception, for me, was: "cross-groups transaction need to be explicitly specified" (plural "groups"). I have version 1.7.6 of the Python AppEngine client.
Please upvote this answer if it fits your scenario.
A cross group transaction error refers to the entity groups being used, not the kind used (here Site).
When it occurs, it's because you are attempting a transaction on entities with different parents, hence putting them in different entity groups.
SHAMELESS PLUG:
You should stop using db and move your code to ndb, especially since it seems you're in the development phase.
I need to aggregate a number of csv inbound files in-memory, if necessary resequencing them, on Mule ESB CE 3.2.1.
How could I implement this kind of logics?
I tried with message-chunking-aggregator-router, but it fails on startup because xsd schema does not admit such a configuration:
<message-chunking-aggregator-router timeout="20000" failOnTimeout="false" >
<expression-message-info-mapping correlationIdExpression="#[header:correlation]"/>
</message-chunking-aggregator-router>
I've also tried to attach mine correlation ids to inbound messages, then process them by a custom-aggregator, but I've found that Mule internally uses a key made up of:
Serializable key=event.getId()+event.getMessage().getCorrelationSequence();//EventGroup:264
The internal id is everytime different (also if correlation sequence is correct): this way, Mule does not use only correlation sequence as I expected and same message is processed many times.
Finally, I can re-write a custom aggregator, but I would like to use a more consolidated technique.
Thanks in advance,
Gabriele
UPDATE
I've tried with message-chunk-aggregator but it doesn't fit my requisite, as it admits duplicates.
I try to detail the scenario I need to cover:
Mule polls (on a SFTP location)
file 1 "FIXEDPREFIX_1_of_2.zip" is detected and kept in memory somewhere (as an open SFTPStream, it's ok).
Some correlation info are mantained for grouping: group, sequence, group size.
file 1 "FIXEDPREFIX_1_of_2.zip" is detected again, but cannot be inserted because would be duplicated
file 2 "FIXEDPREFIX_2_of_2.zip" is detected, and correctly added
stated that group size has been reached, Mule routes MessageCollection with the correct set of messages
About point 2., I'm lucky enough to get info from filename and put them into MuleMessage::correlation* properties, so that subsequent components could use them.
I did, but duplicates are processed the same.
Thanks again
Gabriele
Here is the right router to use with Mule 3: http://www.mulesoft.org/documentation/display/MULE3USER/Routing+Message+Processors#RoutingMessageProcessors-MessageChunkAggregator