I am working on the Flink application with Postgres DB as a source to read certain configuration data, convert it into a data stream and then join it with an incoming real-time data stream.
I have tried using Postgres CDC connector, and I am able to read a single table and deserialize it into POJO and use it further.
However, my requirement is to read from multiple tables using a join condition in the CDC source itself and then convert it into a data stream. Can we write a custom query in the source? I could not find the possibility yet, the only solution I could think of is to then create multiple sources separately and then join those before finally joining with incoming real-time data. Can someone help here?
Regards,
Swapnil
Could you solve your problem the other way around, by reading your incoming real-time data stream and then perform a lookup to the Postgres DB via the JDBC connector? The CDC connector is meant for monitoring changes happening in tables and send each change into Flink. I don't think there's a possibility to perform any joining in the CDC connector upfront.
Related
There is a huge ETL process on snowflake side that updates a table. I need to stream the changes made to it to other consumers-processors outside of snowflake. Not querying the table for updates but streaming (push pattern, not pull).
I see there is a kafka connector and examples to stream data into snowflake. Is there a way to stream it outside of it? Maybe using CDC + Streams + Tasks to some queue?
(Submitting on behalf of a Snowflake User)
We have a database that stores raw data from all our local sources. My team has it's own environment in which we have full permissions to create standardized feeds and/or tables/views etc that is ready to consume through Power BI. A few additional details:
The final 'feed' tables are derived through SQL statements and most are pulling from more than one table of our 'raw' data.
Raw table data is updated daily.
My question is what is the best operation to keep the tables fully updated and what is the standard work flow for this operations? Our current understanding is one of these processes is the best:
Using COPY INTO <stage> then COPY INTO <table>.
Using STREAMS to add incremental data.
Using PIPES (may be the same as STREAMS)
Or simplify our feeds to one table sources and use a materialized view.
We'd ideally like to avoid views to improve consumption speed at the power bi level.
Tasks have been recommended as it seems like a good fit since they only need to update the final table once per day.
(https://docs.snowflake.net/manuals/sql-reference/sql/create-task.html)
Any other recommendations??? THANKS!
We have a similar scenario where we have our raw datalake tables being updated real time from files in S3. Those raw tables are loaded via snowpipe using the auto ingest feature.
In turn, we have a data mart which contains facts about the raw data. To update the data mart, we have created streams on top of the raw tables to track changes. We then use tasks run at a given frequency (every five minutes in our case) to update the data mart from the changed data in the raw tables. Using streams allows us to limit processing to only changed data, without having to track last update dates, etc.
I have a SQL server database, Where millions of rows are (inserted/deleted/updated) every day. I'm supposed to propose an ETL solution to transfer data from this database to a data warehouse. At first i tried to work with CDC and SSIS, but the company i work in want a more real time solution. I've done some research and discovered stream processing. I've also looked for Spark and Flink tutorials but i didn't find anything.
my question is which stream processing tool do i choose? and how do i learn to work with it?
Open Source Solution
You can use Confluent Kafka Integration tool to track Insert and Update operation using Load Timestamp. These would automatically provide you the real time data which get inserted or Updated in the database. If you are having a soft delete in your database , that can be also tracked by using load timestamp and active or inactive flag.
If there is no such flags then you need to provide some logic on which partition might get updated on that day and send that entire partition into the stream which is definitely resource exhaustive.
Paid Solution
There is a paid tool called Striim CDC which can provide real time responses to your system
I have a Mosquitto broker which receives positioning information from remote devices.
I need to store this data somewhere to be processed by other micro-services.
At present, there is a Node.js process which subscribes to the broker, and writes to the Postgres database in batches.
Devices -> Mosquitto -> DB writer -> (source-of-truth) Postgres
(source-of-truth) -> Service A
-> Service B
But the problem I see is that now any other service which needs to process this position data, needs to query the Postgres database.
Constraint: This is for on-premise deployments so ideally we want to maintain as little as possible. One VM with a database, and perhaps a link to a customer-maintained database.
An alternative to the database as the source of truth for the sensor data is a Kafka-like event log / event-sourcing approach. Then there would be one subscriber to the broker, and all microservices could read from it, and pick up where they left off if they go down.
Because it is on-premise I want something more lightweight than Kafka, and have found NATS Streaming Server.
Now, the NATS event log can be persisted by configuring it with a data store. It currently supports a simple file store and a SQL store.
Now if I used the SQL store, it seems like a waste of time to store raw messages to database, read from database, then store them again, plus bad for performance. The SQL store interface also has its own batching implemented. I'm not sure how much I trust the file store as the source of truth too.
So, is this a viable approach?
You can consume messages "by batches" in NATS Streaming by creating your subscription with a MaxInflight and ManualAckMode. The server will not send more than MaxInflight messages without receiving corresponding message acks from the clients.
If you need to do transformation before storing, I understand your process. However, if you just don't trust the FileStore or SQLStore from the NATS Streaming server, why would you be using NATS Streaming in the first place? That is, the stores have been implemented by the same people (including me) that wrote the NATS Streaming server ;-)
I'm currently running a text search in SQL Server, which is becoming a bottleneck and I'd like to move things to Elasticsearch for obvious reasons, however I know that I have to denormalize data for best performance and scalability.
Currently, my text search includes some aggregation and joining multiple tables to get the final output. Tables, that are joined, aren't that big (up to 20GB per table) but are changed (inserted, updated, deleted) irregularly (two of them once in a week, other one on demand x times a day).
My plan would be to use Apache Kafka together with Kafka Connect in order to read CDC from my SQL Server, join this data in Kafka and persist it in Elasticsearch, however I cannot find any material telling me how deletes would be handled when data is being persisted to Elasticsearch.
Is this even supported by the default driver? If not, what are the possibilities? Apache Spark, Logstash?
I am not sure whether this is already possible in Kafka Connect now, but it seems that this can be resolved with Nifi.
Hopefully I understand the need, here is the documentation for deleting Elasticsearch records with one of the standard NiFi processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-elasticsearch-5-nar/1.5.0/org.apache.nifi.processors.elasticsearch.DeleteElasticsearch5/