Can yugabyteDB used for storage & streaming? - database

We have GoLang backend service used to:
Store data in yugabyte DB using YCL driver
Publish the same data to Kafka
Step 2 was necessary so that consumers can stream through kafka
Can yugabyteDB help stream data, once a new row created in a table? to avoid maintainence of state in kafka....
if yes, does yugabyte db support streaming with push model?

CDC feature is actively being worked on at https://github.com/yugabyte/yugabyte-db/issues/9019. Also support for 2, pushing into kafka is in the works.

Related

Data Sync between Ignite Cluster

We have two Apache Ignite Clusters (Cluster_A and Cluster_B) of version 2.13.0.
We are writing data into Cluster_A tables. We want to sync/copy the data into Cluster_B tables from Cluster_A.
Is there any efficient way?
Data Sync between Apache Ignite Clusters.
In general, it could be possible to leverage the CDC replication using Kafka to transfer updates from one cluster to another. But it's totally worth mentioning that it would require running and maintaining a separate Kafka cluster to store updates.
On the other hand, GridGain Enterprise has built-in functionality Data Center Replication the cross data center replication cases. It doesn't require any 3rd party installations, GridGain stores updates in a persistent and reliable manner using native persistence. It's also possible to establish active-active replication out of the box.
Another advantage is that GridGain DR has a dedicated functionality to transfer the entire state of a cluster.
To get more info about how to configure DCR follow at link.

Flink - Postgres CDC connnector - custom query

I am working on the Flink application with Postgres DB as a source to read certain configuration data, convert it into a data stream and then join it with an incoming real-time data stream.
I have tried using Postgres CDC connector, and I am able to read a single table and deserialize it into POJO and use it further.
However, my requirement is to read from multiple tables using a join condition in the CDC source itself and then convert it into a data stream. Can we write a custom query in the source? I could not find the possibility yet, the only solution I could think of is to then create multiple sources separately and then join those before finally joining with incoming real-time data. Can someone help here?
Regards,
Swapnil
Could you solve your problem the other way around, by reading your incoming real-time data stream and then perform a lookup to the Postgres DB via the JDBC connector? The CDC connector is meant for monitoring changes happening in tables and send each change into Flink. I don't think there's a possibility to perform any joining in the CDC connector upfront.

Using NATS Streaming Server as the primary data store for IoT position data?

I have a Mosquitto broker which receives positioning information from remote devices.
I need to store this data somewhere to be processed by other micro-services.
At present, there is a Node.js process which subscribes to the broker, and writes to the Postgres database in batches.
Devices -> Mosquitto -> DB writer -> (source-of-truth) Postgres
(source-of-truth) -> Service A
-> Service B
But the problem I see is that now any other service which needs to process this position data, needs to query the Postgres database.
Constraint: This is for on-premise deployments so ideally we want to maintain as little as possible. One VM with a database, and perhaps a link to a customer-maintained database.
An alternative to the database as the source of truth for the sensor data is a Kafka-like event log / event-sourcing approach. Then there would be one subscriber to the broker, and all microservices could read from it, and pick up where they left off if they go down.
Because it is on-premise I want something more lightweight than Kafka, and have found NATS Streaming Server.
Now, the NATS event log can be persisted by configuring it with a data store. It currently supports a simple file store and a SQL store.
Now if I used the SQL store, it seems like a waste of time to store raw messages to database, read from database, then store them again, plus bad for performance. The SQL store interface also has its own batching implemented. I'm not sure how much I trust the file store as the source of truth too.
So, is this a viable approach?
You can consume messages "by batches" in NATS Streaming by creating your subscription with a MaxInflight and ManualAckMode. The server will not send more than MaxInflight messages without receiving corresponding message acks from the clients.
If you need to do transformation before storing, I understand your process. However, if you just don't trust the FileStore or SQLStore from the NATS Streaming server, why would you be using NATS Streaming in the first place? That is, the stores have been implemented by the same people (including me) that wrote the NATS Streaming server ;-)

Realtime Streaming of SQL SERVER (RDS) transactions to NoSQL

I have a situation where I want to stream all the Updates, Deletes and Inserts from my AWS RDS SQL Server to a NoSQL DB such as DynamoDB or RethinkDB.
What I am trying to achieve is to divide my users into critical and non critical databases reducing the load on my rds server and using technologies like rethinkdb or dynamodb streams to send the other set of data (non critical) to front end.
I have thought of various ways to do this:
the most obvious to just asynchronously make entry in both databases though I can end up in a situation where one of the entries may fail.
two is to use RabbitMQ or queing service aws sqs to que the second entry and make sure that it inserts.
(which I want to achive) is if somehow a nodejs service can listen to mssql streams and push the content to nosql.
What can be done in a situation like this?
The profit I am looking for is to store a dataset in nosql that can be served to over 100k users as they all want to see the same data with just some where clause changes and in realtime. This in turn will reduce the RDS Server transactions to a minimum reads and writes.
You can use 2 approach below :
AWS DMS
Or, combining EMR, Amazon Kinesis, and Lambda (with custom scripts)

Near real-time data ingestion from SQL SERVER to HDFS in cloudera

We have PLC data in SQL Server which gets updated every 5 min.
Have to push the data to HDFS in cloudera distribution in the same time interval.
Which are the tools available for this?
I would suggest to use the Confluent Kafka for this task (https://www.confluent.io/product/connectors/).
The idea is as following:
SQLServer --> [JDBC-Connector] --> Kafka --> [HDFS-Connector] --> HDFS
All these connectors are already available via confluent web site.
I'm assuming your data is being written in some directory in local FS. You may use some streaming engine for this task. Since you've tagged this with apache-spark, I'll give you the Spark Streaming solution.
Using structured streaming, your streaming consumer will watch your data directory. Spark streaming reads and processes data in configurable micro batches (stream wait time) which in your case will be of a 5 min duration. You may save data in each micro batch as text files which will use your cloudera hadoop cluster for storage.
Let me know if this helped. Cheers.
You can google the tool named sqoop. It is an open source software.

Resources