Consuming time based file paths in flink based on current time - apache-flink

I have a list of time stamped S3 objects
eg: s3://01-02-20:10:00:00 , s3://01-02-20:10:00:01
and so on.
I want to consume all files which are within the last 5 minutes from S3 into flink as a DataSource and have checkpointing work as expected.
Can we do this with current File based source / Is this even possible ?

There is no source function available for your requirement, you need to implement a RichSourceFunction yourself and filter out the file path you need

Related

Logic app how to check sftp has file after, 5 min if file is not present

I have a scenario where a Logic App will be scheduled to run at 11 am everyday and will move file from one SFTP to another SFTP, which I have done.
I want to add a condition if the first time the file is not present on SFTP then it should check again after 5 min for 3 retry attempts.
Thanks in advance.
I want to add a condition if the first time the file is not present on SFTP then it should check again
In order to check if the file is present or not, you can use List files in folder action.
Then you can check for file's existance by looping through files inside that folder using DisplayName Variable.
I want to add a condition if the first time the file is not present on SFTP then it should check again after 5 min for 3 retry attempts.
Whereas For the above requirement, when you want to retry you can use until Action of Control Connector.
Then set its count to 3 from Change Limits. Below is the flow in my Logic app
In the next step you can use Delay action and set its limits to 5 minutes.
So that for every 5 minutes the flow is going to check whether the file is present or not in sftp for 3 times.
Below is the whole flow that satisfies your requirement.

ADF-Azure Data factory multiple wild card filtering

I have a condition where i have more than 2 types of files which i have to filter out. I can filter out 1 type using wildcard, something like: *.csv but cant do something like *.xls, *.zip.
I have a pipeline which should convert csv, avro, dat files into .parquet format. But, folder also have .zip, excel, powerpoint files and i want them o be filtered out. Instead of using 3-4 activities i am finding if any way i can use (or) condition to filter out multiple extensions using wildcard option of data factory?
Dynamic content can't accept multiple wildcards or Regular expression based on my test.
You have to using multiple activities to match the different types of your files.Or you could consider a workaround that using LookUp activity+For-each Activity.
1.LookUp Activity loads all the file names from specific folder.(Child Item)
2.Check the file format in the for-each activity condition.(using endswith built-in feature)
3.If the file format matches the filter condition, then go into the True branch and configure it as dynamic path of dataset in the copy activity.

How download more than 100MB data into csv from snowflake's database table

Is there a way to download more than 100MB of data from Snowflake into excel or csv?
I'm able to download up to 100MB through the UI, clicking the 'download or view results button'
You'll need to consider using what we call "unload", a.k.a. COPY INTO LOCATION
which is documented here:
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-location.html
Other options might be to use a different type of client (python script or similar).
I hope this helps...Rich
.....EDITS AS FOLLOWS....
Using the unload (COPY INTO LOCATION) isn't quite as overwhelming as it may appear to be, and if you can use the snowSQL client (instead of the webUI) you can "grab" the files from what we call an "INTERNAL STAGE" fairly easily, example as follows.
CREATE TEMPORARY STAGE my_temp_stage;
COPY INTO #my_temp_stage/output_filex
FROM (select * FROM databaseNameHere.SchemaNameHere.tableNameHere)
FILE_FORMAT = (
TYPE='CSV'
COMPRESSION=GZIP
FIELD_DELIMITER=','
ESCAPE=NONE
ESCAPE_UNENCLOSED_FIELD=NONE
date_format='AUTO'
time_format='AUTO'
timestamp_format='AUTO'
binary_format='UTF-8'
field_optionally_enclosed_by='"'
null_if=''
EMPTY_FIELD_AS_NULL = FALSE
)
overwrite=TRUE
single=FALSE
max_file_size=5368709120
header=TRUE;
ls #my_temp_stage;
GET #my_temp_stage file:///tmp/ ;
This example:
Creates a temporary stage object in Snowflake, which will be discarded when you close your session.
Takes the results of your query and loads them into one (or more) csv files in that internal temporary stage, depending on size of your output. Notice how I didn't create another database object called a "FILE FORMAT", it's considered a best practice to do so, but you can do these one off extracts without creating that separate object if you don't mind having the command be so long.
Lists the files in the stage, so you can see what was created.
Pulls the files down using the GET, in this case this was run on my mac and the file(s) were placed in /tmp, if you are using Windoz you will need to modify a little bit.

How to delete a single value from graphite's whisper data?

I need to delete selected values from a graphite whisper data set. It is possible to overwrite a single value just by sending a new value, or to delete the whole set by deleting the .wsp file, but what I need to do is delete just one (or several) selected values, ie reset them to the same state as if they had not been written (undefined, graphite returns nulls). Overwriting doesn't do that.
How to do it? (Programmatically is ok)
See also:
How to cleanup the graphite whisper's data?
Removing spikes from Graphite due to erroneous data
Graphite (whisper) usually ships with whisper-update utility
You can use it to modify the content of a wsp file:
whisper-update.py [options] path timestamp:value [timestamp:value]*
If the timestamp you want to modify is recent (as defined by carbon), you may want to wait or shutdown your carbon-cache daemons.

Conditional ETL in Camel based on matching .md5

Looked through the docs for a way to use Camel for ETL just as in the site's examples, except with these additional conditionals based on an md5 match.
Like the camel example, myetl/myinputdir would be monitored for any new file, and if found, file of ${filename} would be processed.
Except it would first wait for ${filename}.md5 to show up, which would contain the correct md5. If ${filename}.md5 never showed up, it would simply ignore the file until it did.
And if ${filename}.md5 did show up but the md5 didn't match, it would be processed but with an error condition.
Found suggestions to use crypto for matching, but have not figured out how to ignore the file until the matching .md5 file shows up. Really, these two files need to be processed as a matched pair for everything to work properly, and they may not arrive in the input directory at the exact same millisecond. Or alternately, the md5 file might show up a few milliseconds before the data file.
You could use an aggregator to combine the two files based on their file name. If your files are suitably named, then you can use the file name (without extension) as the correlation ID. Continue the route once completionSize equals 2. If you set groupExchanges to true then in your next route step you have access to both the file to compute the hash value for and the contents of the md5 file to compare the hash value against. Or if the md5 or content file never arrived within completionTimeout you can trigger whatever action is appropriate for your scenario.

Resources