The Use Case:
Store versions of Large Datasets (CSV/Snowflake Tables) and query across versions
DeltaLake says that unless we run vacuum command we retain historical information in a DeltaTable. And Log files are deleted every 30 days. Here
And Additional Documentation states that we need both the log files and DataFiles to time travel. here
Does this imply that we can only time travel 30 days?
But isn't Delta a file format? How would it automatically delete it's logs?
If yes, what are the other open source versions that can solve querying across dataset versions.?
Just set the data and log retention settings to a very long period.
alter table delta.`/path/to/table` set TBLPROPERTIES ('delta.logRetentionDuration'='interval 36500000 days', 'delta.deletedFileRetentionDuration'='interval 36500000 days')
spark.sql("alter table delta.`{table_path}` set TBLPROPERTIES ("
"'delta.logRetentionDuration'='interval {log_retention_days} days', "
"'delta.deletedFileRetentionDuration'='interval {data_rentention_days} days');".format(
table_path="path/to/table",
log_retention_days=36000000,
data_rentention_days=36000000))
Databricks has open sourced deltalake project in Apr'2018 (Open source deltalake project still to get some functionalities like data skipping etc) Details: Deltalake, Docs, Github Repo
Delta is not file format - it is storage layer on top of parquet & metadata (in json format) files.
It doesn't delete files automatically. Vacuum operation should be performed to delete older & not referenced (not active) file.
So without running 'vacuum' operation, you can time travel infinitely as all data would be available. On other hand, if you perform 'vacuum' with 30 days retention, you can access last 30 days data.
Yes, it solves querying across dataset versions. Each version can be identified by timestamp. Sample queries to access specific version data:
Scala:
val df = spark.read
.format("delta")
.option("timestampAsOf", "2020-10-01")
.load("/path/to/my/table")
Python:
df = spark.read \
.format("delta") \
.option("timestampAsOf", "2020-10-01") \
.load("/path/to/my/table")
SQL:
SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01"
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2010-10-01 01:30:00.000"
(Note: I am using open sourced deltalake in production for multiple use cases)
Related
FYI I am using Azure Synapse Analytics - Serverless SQL and Gen2 Data Lake for all this work
Here is my issue - I am working with the Azure cost data export feature that creates a new folder for each month and a new folder for each day (this is in Azure Gen2 Data Lake). In that folder, it puts a CSV with all the cost data for that month. Ultimately I only care about the last folder in each month. Data continues to trickle in, so usually, the last folder date is several day after the end of the month.
I have my data partitioned, so if I could pass back the latest month and dates, it greatly reduces the amount of data that is being scanned. I have been able to test this by hardcoding some values in a where statement and it works great.
My problem is I can't figure out how to get the max folder without actually scanning all the files in the 1st place
There is some example code that works, it does give me back a result that I can use though it does a full scan of the data lake. This pulls back nearly 6GB of data, vs. with hard coded values I can get it down to about 400 MB. The current size isn't a problem, but I know this is only going to continue to grow.
with usethesefiles as (
select
--rows.*
distinct
files.filepath(1) as file_path_one
,files.filepath(2) as file_path_two
from
openrowset(bulk 'azurecosts/azuredailyexport/*/*/*/*.csv'
, data_source = 'mydatalake_dfs_core_windows_net'
, format = 'CSV'
, parser_version = '2.0'
, string_delimiter = '"'
, header_row = true
)as files)
select file_path_one,
max(file_path_two) as file_path_two
from usethesefiles
group by file_path_one
This returns the following data:
file_path_one
file_path_two
20230201-20230228
202302101419
20230101-20230131
202302051419
wocoommerce after each change status on an order sends comments that types are "order_note" in database.
how can i remove all or disable some of it.
they are 36000 rows in my database now.
some of the order status note comments are not important for us.
picture
36000 rows actually is not that big of a deal. WooCommerce has many performance and database structure related imperfections you should keep in mind, this is probably not one of them.
Anyway...
WooCommerce stores it's order notes inside wp_comments table, with the comment type set as order_note.
You can safely delete these rows as you wish. For example if you want to delete order notes from year 2021 and earlier (and keep only those from 2022), you can run this query:
DELETE FROM `wp_comments` WHERE `comment_type` = 'order_note' AND `comment_date` <= '2021-12-31';
If you want to delete order notes for specific order IDs (e.g. for order 12345 and older), you can do it the similar way:
DELETE FROM `wp_comments` WHERE `comment_type` = 'order_note' AND `comment_post_ID` <= 12345;
You can implement this SQL query as a PHP script using $wpdb, e.g. to automatically delete order notes, that had been created last year or earlier:
global $wpdb;
// Delete all order notes created last year and earlier
$delete_before = date( 'Y-m-d', strtotime( 'last year December 31st' ) );
$wpdb->query($wpdb->prepare("DELETE FROM `wp_comments` WHERE `comment_type` = 'order_note' AND `comment_date` <= %s;", $delete_before));
You can implement such script as a function and trigger it automatically, either with wp_schedule_event() or as a standard CRON job.
I have scheduled to unload data from snowflake to S3 every hour . Data gets uploaded to this path : My_bucket/year=2021/month= /day= /hour = /data.csv
year , month , day and hour gets dynamically updated in the path at every hour run .
Data need not necessarily be there every hour. At that time No folder or path is getting created
I need to have folder for every hour in S3 irrespective of data flowing in .
like hour=1 ,hour=2 ,hour=3 and so on for all 24 hours every time the query runs.
There should be csv file if data is present in table and even if the data is not present path for that hour should be there with empty file
So how should I modify my sql query?
Hi You achieve this by using one workaround.
Since snowflake will not upload any file if the query return zero rows.
This workaround will copy empty-file with headers only if no data is available.
copy into <s3_path>
from (
-- this will act as header for both condition if data present or not.
select 'customer_id','store_id','metro_id','message_type'
UNION
select to_char(customer_id),to_char(store_id),to_charmetro_id(),message_type
from myTable
)
OVERRIDE=TRUE
FILE_FORMAT=(type=csv)
I am new to Snowflake SQL and writing a requirement to get historical dashboard.
Requirement is to get the data 5 years from current data + want to include the whole of the first month of the past 5th year. For example if today is 26-05-2021, we would need to get the data from
01-05-2016 to 25-05-2021.
Using my present Snowflake SQL query, I get the data of last 5 years, is it possible in Snowflake inbuilt to get the delta of remaining days.
select * from table where portion_start >= trunc(add_months(sysdate(),-12*5),'YEAR')
I believe this is what you are looking for:
select *
from table
where portion_start >= dateadd(year,-5,date_trunc(month,current_date()))
I have 2 very large (billions of rows) splayed tables, Trades and StockPrices, on a remote server. I want to do an asof join
h:hopen `:RemoteServer:Port
h"aj[`Stock`Date`Time,
select from Trades where Date within 2014.04.01 2014.04.13,
StockPrices
]"
But I just get the error (I'm Studio for KDB+)
An error occurred during execution of the query.
The server sent the response:
splay
Studio Hint: Possibly this error refers to nyi op on splayed table
So what would be the correct way to do such a join?
Also, performance and efficiency is an issue with such a big table -- what should I be doing to ensure the query doesn't take hours and doesn't consume to much of the server's system resources?
You need to map the splayed StockPrices table into memory. This can be done by using a select query:
q)(`::6060)"aj[`sym`time;select from trade;quote]" / bad
'splay
q)(`::6060)"aj[`sym`time;select from trade;select from quote]" / good
sym time prx bid ask
-------------------------------------------
aea 01:01:16.347 637.7554 866.0131 328.1476
aea 01:59:14.108 819.5301 115.053 208.1114
aea 02:42:44.724 69.38325 641.8554 333.3092
This page may be useful for looking up errors from Kdb+: http://code.kx.com/q/ref/error-list/
Regarding optimising performance of aj see http://code.kx.com/q/ref/joins/#aj-aj0-asof-join
Also, if there isn't an overlap of data between days, it may be faster to run the query on a day by day basis, possibly in parallel.
If there is an overlap of data across days, combining the date & time columns into a single timestamp column would speed up the lookup.