I have to upload large files to the AWS S3 bucket, approximately 500,000 lines long and 30MB. What would be the best way to do this, taking speed and number of calls into account?
You can upload large objects using the AWS SDK. For example, assume you are using Java. To upload large files to Amazon S3 bucket, use createMultipartUpload().
To see an example of how to use this method, see:
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/s3/src/main/java/com/example/s3/S3ObjectOperations.java
You can try uploading these files to s3 either from console or programmatically if you face any speed related issue and upload takes long time then you should consider using S3 multipart upload which upload files in multiple parts. There are several factors to consider when you use multipart. go through them before using it
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
From my point of view uploading 30 MB file will be any issue. you can try directly.
Related
I need to display/stream large video files in reactjs. These files are being uploaded to private s3 bucket by user using react form and flask.
I tried getObject method, but my file size is too large. get a signed url method required me to download the file.
I am new to AWS-python-react setup. What is the best/most efficient/least costly approach to display large video files in react?
AWS offers other streaming specific services but if you really want to get them off S3 you could retrieve the files using torrent which, with the right client/videoplayer would allow you to start playing them without having to download the whole file.
Since you mentioned you're using Python, you could do this using AWS SDK like so:
import boto3
s3 = boto3.client('s3')
response = client.get_object_torrent(
Bucket='my_bucket',
Key='/some_prefix/my_video.mp4'
)
The response object will have this format:
{
'Body': StreamingBody()
}
Full docs here.
Then you could use something like webtorrent to stream it on the frontend.
Two things to note about this approach (quoting docs):
Amazon S3 does not support the BitTorrent protocol in AWS Regions launched after May 30, 2016.
You can only get a torrent file for objects that are less than 5 GBs in size.
I've recently been using filepond for an enterprise web application that allows end-users to upload a maximum of 1,500 images (medium size, avg 200Kb, max 500Kb).
There is a very low degree of backend processing once an image is uploaded other than its temporary storage in a database. We later perform asynchronous processing picking up the files from that temporary storage. But the current challenge we are seeing is that the browser serialization is extending the upload up to 2 hours! We've been able to decrease this time close to 1 hour by increasing the max parallel uploads in filepond already, but this is still far from acceptable (the target is 20min), and we still see the serialization occurring in Chrome Dev Tools with such a volume of images being uploaded.
With this in mind, I'm currently looking for a new filepond plugin to zip the dropped files and then upload a single archive to the backend, without the user bothering to do that himself. I couldn't find anything related at filepond's plugins page and most listed there seem to be related to image transformation. Hopefully the jszip library could do the trick. Am I on the right track? Any further suggestions?
Other things in the radar our team is exploring:
creating multiple DNS endpoints to increase the number of parallel requests by the browser;
researching CDN services alternatives;
Thanks a bunch!
What is the best practice for uploading images and videos into S3 buckets. In my use case users can upload their vidoes and images and I have to store those images and videos in effective way into S3 bucket (without data loss). I read some related posts but I could not find out the better solution. I am using React JS and I have to upload it from React JS code. And each video's size would be more than 200 MB. So I am worrying about how to send those videos into S3 bucket in very less time and effective way. Please anyone suggest me a good approach to overcome this problem.
Thanks in advance!!!
S3 will not lose your data. If you receive a 200 response from S3, you can be confident there will be no data lost.
As for best practices, you should use PUT Object for files that are smaller than 5 GB. You can also use POST Object to allow your users to upload files directly from the browser. The 5 GB size limit still applies in the case of POST Object.
Once you reach the 5 GB limit, your only choice is use S3's multipart upload API. Using the multipart upload API, you can upload files of to 5 TB in size.
On my appspot website, I use a third party API to query a large amount of data. The user then downloads the data in CSV. I know how to generate a csv and download it. The problem is that because the file is huge, I get the DeadlineExceededError.
I have tried tried increasing the fetch deadline to 60 (urlfetch.set_default_fetch_deadline(60)). It doesn't seem reasonable to increase it any further.
What is the appropriate way to tackle this problem on Google App Engine? Is this something where I have to use Task Queue?
Thanks.
DeadlineExceededError means that your incoming request took longer than 60 secs, not your UrlFetch call.
Deploy the code to generate the CSV file into a different module that you setup with basic or manual scaling. The URL to download your CSV will become http://module.domain.com
Requests can run indefinitely on modules with basic or manual scaling.
Alternately, consider creating a file dynamically in Google Cloud Storage (GCS) with your CSV content. At that point, the file resides in GCS and you have the ability to generate a URL from which they can download the file directly. There are also other options for different auth methods.
You can see documentation on doing this at
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/
and
https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/functions
Important note: do not use the Files API (which was a common way of dynamically create files in blobstore/gcs) as it has been depracated. Use the above referenced Google Cloud Storage Client API instead.
Of course, you can delete the generated files after they've been successfully downloaded and/or you could run a cron job to expire links/files after a certain time period.
Depending on your specific use case, this might be a more effective path.
Is it possible to copy images into a static directory under my app engine project domain?
For example, when a user signs up for my app, I want them to supply an image for themselves, and I would copy it to a static directory but rename the image using their username, like:
www.mysite.com/imgs/username.jpg
www.mysite.com/imgs/john.jpg
www.mysite.com/imgs/jane.jpg
but I don't know where to start with this, since the JDO api doesn't really deal with this sort of thing (I think using JDO, they'd want me to store the image data as a blob associated with my User objects). Can I just upload the images to a static directory like this?
Thanks
No. App Engine has a provision for static files, but only static files you upload along with your code. If users can upload the data, it is not really "static" in the app engine context. Depending on how large a picture you want users to be able to upload, you will want to use either the regular datastore (for storing up to 1MB) or the Blobstore for bigger files (up to 2 gig)
I'm almost certain you need to use the blobstore for dynamic upload. Even if you need not, for reasons of session independence you probably want to. As blobstore operations are expensive relative to a static file, you could have a task queue move the (now static) images into static store.