Credit usage within a task executing a copy statement - snowflake-cloud-data-platform

We have some task that execute a copy statement against and external S3 bucket/prefix. The bucket/prefix has millions of files. Even when there are no additional files to load the task still takes 7 minutes with the LIST_EXTERNAL_FILES_TIME in query history showing this is where it is spending its time.
Ignoring the design for a moment :) does the LIST_EXTERNAL_FILES_TIME consume credits? Will it consume credits even when utilizing a SERVERLESS warehouse?
Thanks

I would have assumed it would... but:
show warehouses;
name
state
type
COMPUTE_WH
SUSPENDED
STANDARD
list #citibike_trips;
4.2K files listed
show warehouses;
name
state
type
COMPUTE_WH
SUSPENDED
STANDARD
So it seems the fetch is on the CLOUD_COMPUTE, which you will get billed for the amount of the days usage that is over 10% of you normal compute.

Related

TASKS history in Snowflake

Is there a efficient way to see the logs of task run in snowflake
I am using this. Is there a possibility to wipe off the history from here?
select *
from table(information_schema.task_history(
scheduled_time_range_start=>dateadd('hour',-1,current_timestamp()),
result_limit => 1000,
task_name=>'TASKNAME'));
Is there a efficient way to see the logs of task run in Snowflake?
Depending of meaning of workd efficient, Snowflake offers UI to monitor tasks dependencies and run history.
Run History
Task run history includes details about each execution of a given task. You can view the scheduled time, the actual start time, duration of a task and other information.
Account Level Task History:
ask history displays task information at the account level, and is divided into three sections:
Selection (1) - Defines the set of task history to display, and includes types of tasks, date range and other information
Histogram (2) - Displays a bar graph of task runs over.
Task list (3) - a list of selected tasks.
Is there a possibility to wipe off the history from here?
Task History
This Account Usage view enables you to retrieve the history of task usage within the last 365 days (1 year). The view displays one row for each run of a task in the history.

Auto updating access database (can't be linked)

I've got a CSV file that refreshes every 60 seconds with live data from the internet. I want to automatically update my Access database (on a 60 second or so interval) with the new rows that get downloaded, however I can't simply link the DB to the CSV.
The CSV comes with exactly 365 days of data, so when another day ticks over, a day of data drops off. If i was to link to the CSV my DB would only ever have those 365 days of data, whereas i want to append the existing database with the new data added.
Any help with this would be appreciated.
Thanks.
As per the comments the first step is to link your CSV to the database. Not as your main table but as a secondary table that will be used to update your main table.
Once you do that you have two problems to solve:
Identify the new records
I assume there is a way to do so by timestamp or ID, so all you have to do is hold on to the last ID or timestamp imported (that will require an additional mini-table to hold the value persistently).
Make it happen every 60 seconds. To get that update on a regular interval you have two options:
A form's 'OnTimer' event is the easy way but requires very specific conditions. You have to make sure the form that triggers the event is only open once. This is possible even in a multi-user environment with some smart tracking.
If having an Access form open to do the updating is not workable, then you have to work with Windows scheduled tasks. You can set up an Access Macro to run as a Windows scheduled task.

Persist Data in SSIS for Next Execution

I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.

Change Data Capture (CDC) cleanup job only removes a few records at a time

I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.
I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?
I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.
Thanks,
Kim
Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:
create table dbo.ProcessingStatus (
CaptureInstance sysname,
LSN numeric(25,0),
IsProcessed bit
)
create unique index [UQ_ProcessingStatus]
on dbo.ProcessingStatus (CaptureInstance)
where IsProcessed = 0
Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
Update IsProcessed = 1 for this run.
As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".

How to output in Salesforce?

I'm writing an Apex program that reads through a database and processes record. Each time I process a record, I want to output a message. Currently I'm using System.Debug to do this, but the debug log is cluttered with so much that this doesn't seem like the right approach.
What other ways can I generate screen or logfile output in SalesForce?
Keep using System.Debug() but when you want to view only your output messages, just filter by DEBUG. Otherwise the only other option is to create a view and then that is more clutter than what it's worth.
Please Open the Log in Raw format under Setup>> Administration Setup >> Monitoring >>Debug Logs. Under Monitored Users go to Filters and enable all the filter levels. Now use apex code as given
System.debug('StackOverflow >>1234'+ e.getMessage)
and search the detailed debug logs for StackOverflow >>1234 the unique message. It may also happen that your system.debug might not have been executed in that specific Debug logs so do not forget to check all the recent debug logs. :)
You could think about creating your own Logging__c object. And create a record for it for each record processed. You have to be creative to work round the governor limits though.
If it's not essential that you output the message in between processing each record, then you could build up a collection of Logging__c records as processing continues and then either insert them periodically, or when there's an exception in your process.
Note that if inserting them periodically, you still have to make sure the jobs not so large that you're going to hit the DML limit of 150 together with the processing work you're doing. Also, if storing the records to all be inserted at the end of processing, bear in mind the heap size is 6MB.
Alternatively, have a look at Batch Apex http://www.salesforce.com/us/developer/docs/apexcode/Content/apex_batch_interface.htm
This allows you to create a class to handle processing a job in asynchronous chunks. You can set the number of records processed in one go. So you could set this small (~20) and then insert a Logging_c record as each job record is processed to stay within the Batch Apex DML limit of 200. You can then monitor Logging_c records in real time to view progress.

Resources