Flink diff file sink connections (Table api datastream api) - apache-flink

In datastream api we have an argument withInactivityInterval, which can set our available interval until the file is closed.
But there is no such thing in the table api, and if our datastream to the table is suspended for a few seconds, the table api will close the file, and after a pause it will start a new one. Is there any way to avoid this?
And how we can set suffix in table api?

I have submitted a pr, if nothing else, it should be supported in version 1.15
https://github.com/apache/flink/pull/18359

Confirmed the current implementation of table api, it really does not support setting inactivityInterval
I created a jira issue,then follow up, thanks for your feedback
https://issues.apache.org/jira/browse/FLINK-25484

Related

Is there a way to delete data in TDengine?

I read the document and can only find the retention policy of TDengine. Is there a way to delete a range of data?
Currently TDengine 2.x version does support deleting specified a range of data. The only way to delete data is to set the "keep" option in config with a strategy to eliminate out-dated data if data storage is longer than keep.
Delete will be supported in next release, you can clone TDengine and checkout develop branch, build TDengine by yourself.
the grammar is like
delete from stb where ts > timestamp and tag = tagvalue
delete from tb

Tools for compare data before writing into db

I am writing a project which needs to scrape data from a website, I am using pyspider, and it runs automatically every 24 hours(scraping the data every 24hours). Problem is , before writing the new data entry into the dB, I want to compare the new data with the existing data in the dB.
Is there a tool/lib I can use?
I am running my project on aws, what’s the best tool I can use to work with aws?
My idea is to set up some rule for the data to update/insert into the dB, but when the new data is somehow conflict with rule then I will be able to view the data/scrape log(where the tool will label it as pending)and waiting for admin to do further operation.
Thanks in advance.
[List of data compare, synchronization and migration tools.]https://dbmstools.com/categories/data-compare-tools
visit there. it might be helpful

Snowflake Snowpipe - Email Alert Mechanism

I am planning to use Snowpipe to load data from Kafka, but the support team monitoring the pipe jobs needs an alert mechanism.
How can I implement an alert mechanism for Snowpipe via email/slack/etc?
The interface provided by Snowflake between the database and surroundings is mainly with cloud storage. There is no out-of-the-box integration with messaging apart from cloud storage events.
All other integration and messaging must be provided by client solutions.
Snowflake also provides scheduled tasks that can be used for monitoring purposes, but the interface limitations are the same as described above.
Snowflake is database as a service and relies on other (external) cloud services for a complete systems solution.
This is different from installing your own copy of database software on your own compute resource, where you can install any software alongside with the database.
Please correct my understanding if anything I say is incorrect. I believe Snowpipe is great for continuous data loading but it is hard or no way to track all the errors in the source file. As mentioned in the previous suggestions, we could build a visualization querying against COPY_HISTORY and/or PIPE_USAGE HISTORY but it doesn't give you ALL THE ERRORS in the source file. It only tells you these related to the errors
PIPE_USAGE HISTORY will tell you nothing about the errors in the source file.
The only function that can be helpful (for returning all errors) is the VALIDATE table function in the Information_Schema but it only validates for COPY_INTO.
There is a similar function for PIPE called VALIDATE_PIPE_LOAD but according to the documentation it returns only the first error. Snowflake says "This function returns details about ANY errors encountered during an attempted data load into Snowflake tables." But the output column ERROR says only the first error in the source file.
So here is my question. If any of you guys have successfully Snowpipe to load in real-time production environment how are you doing the error handling and alerting mechanism?
I think as compared to Snowpipe, using COPY_INTO within a Stored Procedure and have shell script calling this Stored procedure and then scheduling this script to run using any Enterprise Scheduler like Autosys/Control-m is a much streamlined solution.
Using External functions, Stream and Task for alerting is an elegant solution maybe but again I am not sure if solves the problem of error-tracking.
Both email and Slack alerts can be implemented via external functions.
EDIT (2022-04-27): Snowflake now officially supports Error Notifications for Snowpipe (currently in Public Preview, for AWS only).
"Monitoring" & "alert mechanism" are a very broad terms. What do you want to monitor? What should be triggering the alerts? The answer can only be as good as the question, so adding more details would be helpful.
As Hans mentioned in his answer, any solution would require the use of systems external to Snowflake. However, Snowflake can be the source of the alerts by leveraging external functions or notification integrations.
Here are some options:
If you want to monitor Snowpipe's usage or performance:
You could simply hook up a BI visualization tool to Snowflake's COPY_HISTORY and/or PIPE_USAGE_HISTORY. You could also use Snowflake's own visualization tool, called Snowsight.
If you want to be alerted about data loading issues:
You could create a data test against COPY_HISTORY in DBT, and schedule it to run on a regular basis in DBT Cloud.
Alternatively, you could create a task that calls a procedure on a schedule. Your procedure would check COPY_HISTORY first, then call an external function to report failures.
Some notes about COPY_HISTORY:
Please be aware of the limitations described in the documentation (in terms of the privileges required, etc.)
Because COPY_HISTORY is an INFORMATION_SCHEMA function, it can only operate on one database at a time.
To query multiple databases at once, UNION could be used to combine the results.
COPY_HISTORY can be used for alerting only, not diagnostic. Diagnosing data load errors is another topic entirely (the VALIDATE_PIPE_LOAD function is probably a good place to start).
If you want to be immediately notified of every successful data load performed by Snowpipe:
Create an external function to send notifications/alerts to your service(s) of choice.
Create a stream on the table that Snowpipe loads into.
Add a task that runs every minute, but only when the stream contains data, and have it call your external function to send out the alerts/notifications.
EDIT: This solution does not provide alerting for errors - only for successful data loads! To send alerts for errors, see the solutions above ("If you want to be alerted about data loading issues").

How to save streaming data to InfluxDB?

I am trying to save data as it arrives in a streaming fashion (with the least amount of delay) to my database which is InfluxDB. Currently I save it in batches.
Current setup - interval based
Currently I have an Airflow instance where I read the data from a REST API every 5min and then save it to the InfluxDB.
Desired setup - continuous
Instead of saving data every 5 min, I would like to establish a connection via a Web-socket (I guess) and save the data as it arrives. I have never done this before and I am confusing how actually it is done? Some question I have are:
One I write the code for it, do I keep it up like a daemon?
Do I need to use something like Telegraf for this or that's not really the case (example article)
Instead of Airflow (since it is for batch processing) do I need to use something like Apache Beam or Spark?
As you can see, I am quite lost on where to start, what to read and how to make sense from all this. Any advise on direction and/or guidance for a set-up would be very appreciated.
If I understand correctly, you are keen to code a java service which would process the incoming data, so one of the solution is to implement a websocket with for example jetty.
From there you receive the data in json format for example and you process the data using the influxdb-java framework with which you fill the database. Influxdb-java will allow you to create and manage the data.
I don't know airflow, and how you produce the data, so maybe there is built-in tools (influxdb sinks) that can save you some work in your context.
I hope that this can give you some guide lines to start digging more.

Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?

The Kafka Connector can make use of a primary key and a timestamp to determine which rows need to be processed.
I'm looking for a way to reset the Connector so that it will process from the beginning of time.
Because the requirement is to run in distributed mode, the easiest thing to do is to update the connector name to a new value. This will prompt a new entry to be made into the connect-offsets topic as it looks like a totally new connector. Then the connector should start reading again as if nothing has been written to Kafka yet. You could also manually send a tombstone message to the key in the connect-offsets topic associated with that particular connector, but renaming is much easier than dealing with that. This method applies to all source connectors, not only the JDBC one described here.
I got a bit tired of renaming the connector every time during development so started using the tombstone method. This methods can be used for any source connector.
First check the format of the key/value of the connector:
kafka-console-consumer --bootstrap-server localhost:9092 --topic kafka-connect-offsets --from-beginning --property print.key=true
["demo",{"query":"query"}] {"timestamp_nanos":542000000,"timestamp":1535768081542}
["demo",{"query":"query"}] {"timestamp_nanos":171831000,"timestamp":1540435281171}
["demo",{"query":"query"}] {"timestamp_nanos":267775000,"timestamp":1579522539267}
Create the tombstone message by sending the key without any value:
echo '["demo",{"query":"query"}]#' | kafka-console-producer --bootstrap-server localhost:9092 --topic kafka-connect-offsets --property "parse.key=true" --property "key.separator=#"
Now restart or recreate the connector and it will start producing messages again.
Be very careful with this in production unless you really know what you're doing. There's some more information here: https://rmoff.net/2019/08/15/reset-kafka-connect-source-connector-offsets/

Resources