I have a large SQL database (~1TB) that I'm trying to backup.
I can back it up fine but we want to store it offsite on Amazon S3, where the maximum object size is 5GB.
I thought I could split it by using multiple files, but it seems the maximum is 64 so I'm still ending up with 16GB chunks which are too big for S3.
Is there any other way to do it?
The maximum blob size for S3 is 5TB, not 5GB. 5GB is only the largest object that can be uploaded with a single HTTP PUT.
All cloud providers follow the same pattern: instead of uploading one huge file and storing it as a single blob, they break it apart into blocks that they replicate across many disks. When you ask for data, the provider retrieves it from all these blocks. To the client though, the blob appears as a single object.
Uploading a large file requires blocks too. Instead of uploading a large file with a single upload operation (HTTP PUT) all providers require that you upload individual blocks and finally notify the provider that these blocks constitute one object. This way, you can re-upload only a single failed block in case of failure, the provider can commit each block while you send the next, they don't have to track and lock a large blob (on a large disk) waiting for you to finish uploading etc.
In your case, you'll have to use an uploader that understands cloud storage and uses multiple blocks, perhaps something like Cyberduck, or S3 specific command-line tools. Or write a utility that uses Amazon's SDK to upload the backup file in parts.
Amazon's documentation site offers examples for multipart uploads at Uploading Objects Using Multipart Upload API. The high-level examples demonstrate various ways to upload a large file. All calls though use multi-part uploads, eg the simplest call :
var client= new AmazonS3Client(Amazon.RegionEndpoint.USEast1);
var fileTransferUtility = new TransferUtility(client);
fileTransferUtility.Upload(filePath, existingBucketName);
will upload the file using multiple parts and use the file's path as its key. The most advanced example allows you to specify the part size, a different key, redundancy options etc:
var fileTransferUtilityRequest = new TransferUtilityUploadRequest
{
BucketName = existingBucketName,
FilePath = filePath,
StorageClass = S3StorageClass.ReducedRedundancy,
PartSize = 6291456, // 6 MB.
Key = keyName,
CannedACL = S3CannedACL.PublicRead
};
fileTransferUtilityRequest.Metadata.Add("param1", "Value1");
fileTransferUtilityRequest.Metadata.Add("param2", "Value2");
fileTransferUtility.Upload(fileTransferUtilityRequest);
Related
Hello i am new to Azure logic app. Currently i am running into problem with PDF compression.
My problem is there are few files already stored in data lake and i want to check their size and if size exceeds 20mb then i need to compress and replace original file with compressed file in the data lake.
Firstly i am fetching data from data-lake and getting the content.And then i am getting metadata of the file and from the metadata i am extracting Size. Then if the size is greater then 20 then i am compressing.
I am using List of file size for "True". My file sizes are usually more than 150MB. I feel this is the prime factor for the failure of compressor.
You can use 3rd party connector to do this called Compress PDF document of plumsail connector. All you need to do is just login into plumsail and get the api for creating the connection. While in the logic app flow, As per your requirement we have used When a blob is added or modified (properties only) (V2) and checking if its size is greater than 20 mb using Condition connector. If yes, then we are using plumsails Compress PDF document and then as per the requirement you can save the file in the same folder with different name or you can save it in another folder of same. For instance i'm saving in container called documents >20mb. Here are few screenshots of my logic app for your reference:-
Result:
In Storage Account before Compression
In Storage Account after Compression
Scenario when the compressed file is still more than 20mb
If the file is so large that even after compression it is still greater than 20MB then you can save it in the same folder but with a different name and then delete it after the requirement is completed.
I have a 2gb Tensorflow model that I'd like to add to a Flask project I have on App Engine but I can't seem to find any documentation stating what I'm trying to do is possible.
Since App Engine doesn't allow writing to the file system, I'm storing my model's files in a Google Bucket and attempting to restore the model from there. These are the files there:
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
checkpoint
Working locally, I can just use
with tf.Session() as sess:
logger.info("Importing model into TF")
saver = tf.train.import_meta_graph('model.ckpt.meta')
saver.restore(sess, model.ckpt)
Where the model is loaded into memory using Flask's #before_first_request.
Once it's on App Engine, I assumed I could to this:
blob = bucket.get_blob('blob_name')
filename = os.path.join(model_dir, blob.name)
blob.download_to_filename(filename)
Then do the same restore. But App Engine won't allow it.
Is there a way to stream these files into Tensorflow's restore functions so the files don't have to be written to the file system?
After some tips from Dan Cornilescu and digging into it I found that Tensorflow builds the MetaGraphDef with a function called ParseFromString, so here's what I ended up doing:
from google.cloud import storage
from tensorflow import MetaGraphDef
client = storage.Client()
bucket = client.get_bucket(Config.MODEL_BUCKET)
blob = bucket.get_blob('model.ckpt.meta')
model_graph = blob.download_as_string()
mgd = MetaGraphDef()
mgd.ParseFromString(model_graph)
with tf.Session() as sess:
saver = tf.train.import_meta_graph(mgd)
I didn't actually use Tensorflow, the answer is based on docs and GAE-related knowledge.
In general using GCS objects as files in GAE to avoid the lack of a writable filesystem access relies on one of 2 alternate approaches instead of just passing a filename to be directly read/written (which can't be done with GCS objects) by your app code (and/or any 3rd party utility/library it may be using):
using an already open file-like handler for reading/writing the data from/to GCS. Which your app would obtain from using either of:
the open call from a GCS client library instead of the generic one typically used for a regular filesystem. See, for example Write a CSV to store in Google Cloud Storage or pickling python objects to google cloud storage
some in-memory faking of a file, using something like StringIO, see How to zip or tar a static folder without writing anything to the filesystem in python?. The in-memory fake file also gives easy access to the raw data in case it needs to be persisted in GCS, see below.
directly using or producing just the respective raw data which your app would be entirely responsible for actually reading from/writing to GCS (again using a GCS client library's open calls), see How to open gzip file on gae cloud?
In your particular case it seems the tf.train.import_meta_graph() call supports passing a MetaGraphDef protocol buffer (i.e. raw data) instead of the filename from which it should be loaded:
Args:
meta_graph_or_file: MetaGraphDef protocol buffer or filename (including the path) containing a MetaGraphDef.
So restoring models from GCS should be possible, something along these lines:
import cloudstorage
with cloudstorage.open('gcs_path_to_meta_graph_file', 'r') as fd:
meta_graph = fd.read()
# and later:
saver = tf.train.import_meta_graph(meta_graph)
However from the quick doc scan saving/checkpointing the modes back to GCS may be tricky, save() seem to want to want to write the data to disk itself. But I didn't dig too deep.
Locally I am successfully able to (in a task):
Open the csv
Scan through each line (using Scanner.Scan)
Map the parsed CSV line to my desired struct
Save the struct to datastore
I see that blobstore has a reader that would allow me toread the value directly using a streaming file-like interface. -- but that seems to have a limit of 32MB. I also see there's a bulk upload tool -- bulk_uploader.py -- but it won't do all the data-massaging I require and I'd like to limit writes (and really cost) of this bulk insert.
How would one effectively read and parse a very large (500mb+) csv file without the benefit of reading from local storage?
You will need to look at the following options and see if it works for you :
Looking at the large file size, you should consider using Google Cloud Storage for the file. You can use the command line utilities that GCS provides to upload your file to your bucket. Once uploaded, you can look at using the JSON API directly to work with the file and import it into your datastore layer. Take a look at the following: https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples
If this is like a one time import of a large file, another option could be spinning up a Google Compute VM, writing an App there to read from GCS and pass on the data via smaller chunks to a Service running in App Engine Go, that can then accept and persist the data.
Not a the solution I hoped for, but I ended up splitting the large files into 32MB pieces, uploading each to blob storage, then parsing each in a task.
It aint' pretty. But it took less time than the other options.
I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!
I am trying to upload data to Google App Engine (using GWT). I am using the FileUploader widget and the servlet uses an InputStream to read the data and insert directly to the datastore. Running it locally, I can upload large files successfully, but when I deploy it to GAE, I am limited by the 30 second request time. Is there any way around this? Or is there any way that I can split the file into smaller chunks and send the smaller chunks?
By using the BlobStore you have a 1 GB size limit and a special handler, called unsurprisingly BlobstoreUpload Handler that shouldn't give you timeout problems on upload.
Also check out http://demofileuploadgae.appspot.com/ (sourcecode, source answer) which does exactly what you are asking.
Also, check out the rest of GWT-Examples.
Currently, GAE imposes a limit of 10 MB on file upload (and response size) as well as 1 MB limits on many other things; so even if you had a network connection fast enough to pump up more than 10 MB within a 30 secs window, that would be to no avail. Google has said (I heard Guido van Rossum mention that yesterday here at Pycon Italia Tre) that it has plans to overcome these limitations in the future (at least for users of GAE which pay per-use to exceed quotas -- not sure whether the plans extend to users of GAE who are not paying, and generally need to accept smaller quotas to get their free use of GAE).
you would need to do the upload to another server - i believe that the 30 second timeout cannot be worked around. If there is a way, please correct me! I'd love to know how!
If your request is running out of request time, there is little you can do. Maybe your files are too big and you will need to chunk them on the client (with something like Flash or Java or an upload framework like pupload).
Once you get the file to the application there is another issue - the datastore limitations. Here you have two options:
you can use the BlobStore service which has quite nice API for handling up 50megabytes large uploads
you can use something like bigblobae which can store virtually unlimited size blobs in the regular appengine datastore.
The 30 second response time limit only applies to code execution. So the uploading of the actual file as part of the request body is excluded from that. The timer will only start once the request is fully sent to the server by the client, and your code starts handling the submitted request. Hence it doesn't matter how slow your client's connection is.
Uploading file on Google App Engine using Datastore and 30 sec response time limitation
The closest you could get would be to split it into chunks as you store it in GAE and then when you download it, piece it together by issuing separate AJAX requests.
I would agree with chunking data to smaller Blobs and have two tables, one contains th metadata (filename, size, num of downloads, ...etc) and other contains chunks, these chunks are associated with the metadata table by a foreign key, I think it is doable...
Or when you upload all the chunks you can simply put them together in one blob having one table.
But the problem is, you will need a thick client to serve chunking-data, like a Java Applet, which needs to be signed and trusted by your clients so it can access the local file-system