How to shard input to Sagemaker ProcessingJob based on S3 folder? - amazon-sagemaker

I'm trying to use Sagemaker ProcessingJob to process a huge s3 bucket on multiple instances.
The s3 bucket is structured to have multiple input files of each job in the same folder, e.g.
job1/
a.jpg
b.json
c.proto
job2/
a.jpg
b.json
c.proto
...
Where a.jpg,b.json,c.proto are required together for processing.
How can i force Sagemaker to shard jobs according to the folder structure, instead of individual files?
I tried looking for appropriate sharding strategy, but found only ShardByS3Key or FullyReplicated.

Related

Reading files from S3 bucket with camel

I am new to Camel and need some guidance. I need to read some files from an S3 bucket. The structure is like so.
S3 Bucket
```
Incoming
+xls
-file1.xls
-file2.xls
-file3.xls
+doc
-file1.doc
-file2.doc
-file3.doc
Processed
+xls
...
+doc
...
When a particular excel file is dropped into the incoming/xls folder (say file1.xls), I need to pick up all the files, do some processing and drop them into a processed folder with the same directory structure.
What components do I need to use for this? I tried reading the documentation but its a little difficult to figure out what components I need. I understand that I will use the camel-aws-s3 plugin but there are not many examples of it out there.
On the https://camel.apache.org/components/latest/aws-s3-component.html there some examples about writing and reading from a S3 Bucket.
Next to reading and writing to S3, you might need some custom processor that uses Apache POI to transform the xsl files

Snowpipe Infastructure & s3 subfolders

I am trying to set up scalable snowpipe infrastructure. I have one AWS lambda function pulling data and putting the raw json files into their corresponding folders below.
Ideally I'd like to set up snowpipe to read in the data from each folder into it's own Snowflake table.
Ex)
The leads json file living in the leads folder is now piped into a
leads_json table within snowflake.
The opportunities json file living in the opportunities folder is now piped into a opportunitie_json table within snowflake.
How do I go about setting up the pipelines and stages to reduce the number of pipelines and stages needed?
Will I need one pipeline and stage per sub folder in the bucket?
I'm going to make use out of the AUTO_INGEST=true feature using SQS notifications.
You will need 1 PIPE for each TABLE that you are loading via Snowpipe. You could have a single STAGE pointing to the top folder of your S3 bucket, if you wish, or you could create 1 per table at a lower level folder. I hope that is answering your question.

Snowpipe: Importing historical data that may be modified without causing re-imports

To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..

Automatically retain latest datastore backup

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.
I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

How to upload .gz files into Google Big Query?

I have an idea of a 90 GB .csv file that I want to make on my local computer and then upload into Google BigQuery for analysis. I create this file by combining thousands of smaller .csv files into 10 medium-sized files and then combining those medium-sized files into the 90 GB file, which I then want to move to GBQ. I am struggling with this project because my computer keeps crashing from memory issues. From this video I understood that I should first transform the medium-sized .csv files (about 9 GB each) into .gz files (about 500MB each), and then upload those .gz files into Google Cloud Storage. Next, I would create an empty Table (in Google BigQuery / Datasets) and then append all of those files to the created Table. The issue I am having is finding some kind of tutorial about how to do this or and documentation of how to do this. I am new to the Google Platform so maybe this is a very easy job that can be done with 1 click somewhere, but all I was able to find was from the video that I linked above. Where can I find some help or documentation or tutorials or videos on how people do this? Do I have the correct idea on the workflow? Is there some better way (like using some downloadable GUI to upload stuff)?
See the instructions here:
https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile
As Abdou mentions in a comment, you don't need to combine them ahead of time. Just gzip all of your small CSV files, upload them to a GCS bucket, and use the "bq.py load" command to create a new table. Note that you can use a wildcard syntax to avoid listing all of the individual file names to load.
The --autodetect flag may allow you to avoid specifying a schema manually, although this relies on sampling from your input and may need to be corrected if it fails to detect in certain cases.

Resources