AWS Glacier File Validation - amazon-glacier

I'm new to AWS Glacier and I've uploaded a pretty large encrypted file (500GB). It took around 4 days. The file exactly matches by byte to the original. My only concern is, how do I know it was damaged in the upload? Anyone know how to verify that?
Thank you!

Q. How can i verify that file was uploaded correctly?
A. When you upload the files, FastGlacier checks file integrity by calculating sha256 hash and sha256-tree-hash of an each part of the file and sha256-tree-hash of the entire file when completing the upload. If the hashes do not match, you will see corresponding error message and file will not be written.
https://fastglacier.com/faq.aspx
See Computing Checksums in the Glacier API Reference documentation for how x-amz-sha256-tree-hash and x-amz-content-sha256 actually work.
Glacier re-calculates these hashes as the data comes in and will refuse to store anything that doesn't match what the client software sent... so the statement that FastGlacier checks file integrity is strictly not accurate -- it's actually the Glacier service that does the checking, against hashes generated by FastGlacier.

Related

MD5 encrypt selected GDrive files

not being a coder myself but need to know for a task of an AppSheet app project, if it basically is possible to MD5 or other encrypt actual GDoc, GSheet, image, etc. files.
So if someone logs into that given Gdrive, they would not be able to open the documents, without decryption.
Thx
Frank

Server backend: how to generate file paths for uploaded files?

I am trying to create a site where users can upload images, videos and other types of files.
I did some research and people seem to suggest that saving the files as BLOB in database is a Bad idea; instead, save the file paths in database.
My questions are, if I save the file paths in a database:
1. How do I generate the file names?
I thought about computing the MD5 value of the file name, but what if two files have the same name? Adding the username and time-stamp etc. to file name? Does it even make sense?
2. What is the best directory structure?
If a user uploads images at 12/17/2013, 12/18/2018, can I just put it in user_ABC/images/, then create time-stamped sub-directories 20131217, 20131218 etc. ? What is the best structure for all these stuff?
3. How do all these come together?
It seems like maintaining this system is such a pain, because the file system manipulation scripts are tightly coupled with the database operations(may also need the worry about database transactions? Say in one transaction I updated the database but failed to modify the file system so I need to roll back my database?).
And I think this system doesn't scale (what if my machine runs out of hard disk so I need to upload the files to a second machine? What if my contents are on a cluster?)
I think my real question is:
4. Is there any existing framework/design pattern/db that handles this problem?
What is the standard way of handling this kind of problems?
Thanks in advance for your answers.
I've actually asked this same question when I was designing a social website for food chefs. I decided to store the url of the image in a MySQL database along with recipe. If you plan on storing multiple images for one recipe, in my example, maybe having a comma separated value would work. When the recipe loaded on the page, I would fetch the image associated with that recipe onto the screen.
Since it was a hackathon and wasn't meant for production purposes, I didn't encode the file name into something unique. However, if I were developing for productional purposes, I would append the time-stamp to the media file name when storing it into the server and database/backend.
I believe what I've proposed is the best data structure of handling this scenario. Storing the image onto the server is not only faster, but it should also take less space. I have found that when converting a standard jpg file of reasonable resolution to base64 encoding, the encoded text file representation took 30% more space. There is also the time of encoding the file and decoding the file for storage and resolving when using some BLOB type of data format instead of straight up storing the file on the server.
Using some sort of backend server scripting like PHP, you'll be able to do some pretty neat stuff with the information you have available. Fetch the result from the database, and load it in from the page using HTML.
As far as I know, there isn't a standard way of fetching media from a database yet. Perhaps there will be one day.
There is not standard way to do that, it is different to the different application. The idea is you need generate a different Path+FileName for every upload, here is a way:
HashId = sha1(microsecond + random(1,1000000));
Path = /[user_id]/[HashId{0,2}]/[HashId{-2}];
FileName = HashId

Processing a large (>32mb) xml file over appengine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

How to update/write to a text file stored in blobstore 2

I need to do 10,000 times datastore read and 3000 times of datastore write everyday, which costs me some money.
My current solution is just to upload a text file to GAE and read the text file in every request.
My text file is
productid--- price--- description---xxx----xxxx-xxxx
However, my need is that I also want to write/edit/update text file. Is it possible?
Is these any advice for me? I dont want to use datastore.
If you are going to use Blobstore to store your files then you won't be able to modify them since blobs are immutable on Google App Engine:
Blobs can't be modified after they're created, though they can be deleted.
You should use datastore instead and more specific the ndb.TextProperty to store your text files since there is no length limit and you can easily create/update/delete. Since it's necessary to do all these requests per day there is nothing you can do about the fact that you will have to pay for it. Just make sure that you are following the best practices and also take a look on the Appstats so you'll be able to monitor your read/writes.
You can use the files API to create blobs. As already noted, you can't edit a blob, but you can do essentially the same thing by creating a new blob with the files API, copying/editing the data from the original blob to the new blob, and then replacing the old blob with the new one.
It works, but it is not ideal. The files API seems to cause a fair number of exceptions so you need to make sure to have good error checking in that part of your code.

Whats a "[CS Format=A]" header is for?

I'm trying to identify a type of file that contents starts with "[CS Format=A]".
I've extracted files from blobs from a database I was handed. I do not have access to the software that created this database. There is a column that I assume signifies compression (it's called COMPRESS). Also in said database were the names of the files and their extensions. I've extracted all the files out of the database and everything works except anything that's marked as compress is not readable as it's own file type (I.E. if it was a PDF before it was stored in this DB now that I've pulled them all back out it is not parsable as a pdf like the other non-"COMPRESS" pdfs). When I crack them open and look at them the first 13 bytes always are "[CS Format=A]" (which I swear I've seen somewhere before, but can't for the life of me remember what) followed by binary data. Magic can't tell me what I'm looking at and google is not being very helpful with my very strict search term. These were stored in an MSSQL database before I was given the files, most likely 2005 by the time it was pulled.
Probably not helpful, but just to make sure... Oracle will decompress automatically on select.
If it's still compressed afterwards then you're looking at some 3rd party component which can be almost anything, but I'd start with testing Mac/Win first before you run through all the 3rd party compression tools.

Resources