how to set filepath for amazon glacier uploads? - amazon-glacier

Amazon Glacier does not have the concept of filepaths. However when I upload files to glacier via client tools like Cloudberry then my uploads do have a path structure.
If I am programmatically uploading an archive to Amazon Glacier, how can I upload it so it has a filepath and filename in Cloudberry? I think I may need add something to the 'x-amz-archive-description' field here http://docs.aws.amazon.com/amazonglacier/latest/dev/api-archive-post.html, but I do not know how to format it.
I am using the Amazon Javascript SDK: http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/examples.html. I think I've been able to upload archives fine, though I haven't been able to see them in Cloudberry yet.
UPDATE: After getting it working, I put the code I was using here in case a sample is needed: https://github.com/fschwiet/mysql-glacier-backup

Our Glacier archive description metadata is a simple JSON with the following fields:
"Path": the full path of the source file. E.g., "c:\myfolder\myfile.txt" for file copied from local disk or "mybucket/myfolder/myfile.txt" for files copied from cloud storage like Amazon S3. The path is UTF7-encoded.
"UTCDateModified": ISO8601 utc date without milliseconds (format: "yyyyMMddTHHmmssZ"). This is modification date of the original file (not the archive creation date).
"Flags": integer flags value. 1 - compressed, 2 - encrypted.
Thanks,
Andy

I've been zipping the tree up (for ease of restore) and storing all the tree info in the archive. Thus photos_2012.zip or whatever. The long list of files just wasn't working for me from an ease-of-checking-things-were-cactually-backed-up perspective.
It's more costly to restore, because I'll have to pull a whole tree down, but given that my goal is never to need this archive, I'm OK with that.

Related

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.
The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.
There's some code here, but it requires an S3 account and access (although I do like Python).
Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.
Thanks!
It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/
take the latest crawl (eg. April 2019)
navigate to the WARC file list and download it (same for WAT or WET)
unzip the file and randomly select one line (file path)
prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it
You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.

How to upload .gz files into Google Big Query?

I have an idea of a 90 GB .csv file that I want to make on my local computer and then upload into Google BigQuery for analysis. I create this file by combining thousands of smaller .csv files into 10 medium-sized files and then combining those medium-sized files into the 90 GB file, which I then want to move to GBQ. I am struggling with this project because my computer keeps crashing from memory issues. From this video I understood that I should first transform the medium-sized .csv files (about 9 GB each) into .gz files (about 500MB each), and then upload those .gz files into Google Cloud Storage. Next, I would create an empty Table (in Google BigQuery / Datasets) and then append all of those files to the created Table. The issue I am having is finding some kind of tutorial about how to do this or and documentation of how to do this. I am new to the Google Platform so maybe this is a very easy job that can be done with 1 click somewhere, but all I was able to find was from the video that I linked above. Where can I find some help or documentation or tutorials or videos on how people do this? Do I have the correct idea on the workflow? Is there some better way (like using some downloadable GUI to upload stuff)?
See the instructions here:
https://cloud.google.com/bigquery/bq-command-line-tool#creatingtablefromfile
As Abdou mentions in a comment, you don't need to combine them ahead of time. Just gzip all of your small CSV files, upload them to a GCS bucket, and use the "bq.py load" command to create a new table. Note that you can use a wildcard syntax to avoid listing all of the individual file names to load.
The --autodetect flag may allow you to avoid specifying a schema manually, although this relies on sampling from your input and may need to be corrected if it fails to detect in certain cases.

Zip file format for uploading pre-annotated data in Watson Knowledge Studio

I am trying to figure out how to include a pre-annotated model in Watson Knowledge Studio. I have followed the information found here but it doesn't seem to generalize. As a start I have tried exporting an annotated set from Knowledge Studio to re-upload (using the "Import corpus documents and include ground truth" option). If I re-upload the exported zip as-is this works but if I unzip the folder and then recompress it I get the following error:
A file could not be imported: The imported ZIP file is not in the expected format. Check whether the file was exported from another project. The type system from the same project must be imported first. (You selected 'Import corpus documents and include ground truth').
I have tried using the zip command in Linux (both with and without the -k flag which tries to force to MS-DOS style naming) and also used the compress utility in Windows but I get the same error each time. This is without making any changes to the contents of the folder.
Any help would be greatly appreciated!
Would you please check internal structure of your created ZIP with comparing the original ZIP ? Sometime I got the similar trouble report and found that their created ZIP contains root folder in ZIP structure. WKS expects the same folder structure in the ZIP file.

AzureXplorer - local blob storage - unexpected hidden fies when manually creating folders

AzureXplorer - local blob storage - unexpected hidden fies when manually creating folders
Our C# ListBlobs method that works OK up in Azure revealed one extra file per folder locally named "$$$.$$$" that are not visible in AzureXplorer or ClumsyLeaf. Neither Google nor MSDN has turned up any note of this so I was wondering if any one else has seen this. The workaround for this defect in AzureXplorer is to manually create local blob folders with ClumsyLeaf which does NOT produce these hidden files allowing us to continue testing locally without specifically coding around these files.
Windows Azure Blob Storage does not support folders. All and any software that fakes folders creates files (blobs). Folders (or directories) are simulated by setting prefixes in the blob names. Because the slash character is valid one for blob name.
You can find more information about blob service in the following resources:
http://msdn.microsoft.com/en-us/library/windowsazure/dd179376.aspx
http://msdn.microsoft.com/en-us/library/windowsazure/dd135715.aspx
From the second resource:
The Blob service is based on a flat storage scheme, not a hierarchical
scheme. However, you may specify a character or string delimiter
within a blob name to create a virtual hierarchy.
That is why, in order to have a "folder" you must have at least one blob (file) within that "folder" (I quote the word "folder", because it is not real one, but just part of the name of the blob itself).

grails file upload

Hey. I need to upload some files (images/pdf/pp) to my SQLS Database and thereafter, download it again. I'm not sure what is the best solution - store it as bytes, or store it as file (not sure if possible). I need later to databind multiple domain classes together with that file upload.
Any help would be very much apreciated,
JM
saving files in the file system or in the DB is a general question which is asked here several times.
check this: Store images(jpg,gif,png) in filesystem or DB?
I recommend to save the files in the file system and just save the path in the DB.
(if you want to work with google app-engine though you have to save the file as byte array in the DB as saving files in the file system is not possible with google app-engine)
To upload file with grails check this: http://www.grails.org/Controllers+-+File+Uploads

Resources