I have a large dataset comprised of over 1000 .csv files that are very large in size (approximately 700 mb each) and I had scheduled it for upload to DocumentDB with AWS Glue, and after 48 hours it timed out. I know some of the data had made the upload but some was left out because of the time out.
I haven’t tried anything because I am not sure where to go from here. I only want one copy of the data in the DocuDB and if I reupload I will likely get 1.5 times the amount of data. I also know that the connection was not at fault because I had seen the DocuDB CPU spike, and checked to see if data was moving there.
Related
We're using AWS Lightsail PostgreSQL Database. We've been experiencing errors with our C# application timing out when using the connection to database. As I'm trying to debug the issue, I went to look at the Metric graphs in AWS. I noticed that many of the graphs have frequent gaps in the data, labeled No data available. See image below.
This graph (and most of the other metrics) shows frequent gaps in the data. I'm trying to understand if this is normal, or could be a symptom of the problem. If I go back to 2 weeks timescale, there does not appear to be any other strange behaviors in any of the metric data. For example, I do not see a point in time in the past where the CPU or memory usage went crazy. The issue started happening about a week ago, so I was hoping the metrics would have helped explained why the connections to the PostgreSQL database are failing from C#.
🔶 So I guess my question is, are those frequent gaps of No data available normal for a AWS Lightsail Postgres Database?
Other Data about the machine:
1 GB RAM, 1 vCPU, 40 GB SSD
PostgreSQL database (12.11)
In the last two weeks (the average metrics show):
CPU utilization has never gone over 20%
Database connections have never gone over 35 (usually less than 5) (actually, usually 0)
Disk queue depth never goes over 0.2
Free storage space hovers around 36.5 GB
Network receive throughput is mostly less than 1 kB/s (with one spike to 141kB/s)
Network transmit throughput is mostly less than 11kB/s with all spikes less than 11.5kB/s
I would love to view the AWS logs, but they are a month old, and when trying to view them they are filled with checkpoint starting/complete logs. They start at one month ago and each page update only takes me 2 hours forward in time (and taking ~6 seconds to fetch the logs). This would require me to do ~360 page updates, and when trying, my auth timed out. 😢
So we never figured out the reason why, but this seems like it was a problem with the AWS LightSail DB. We ended up using a snapshot to create a new clone of the DB, and wiring the C# servers to the new DB. The latency issues we were having disappeared and the metric graphs looked normal (without the strange gaps).
I wish we were able to figure out the root of the problem. ATM, we are just hoping the problem does not return.
When in doubt, clone everything! 🙃
Im using rclone to tranfer data between a minio bucket and a shared storage. Im migrating a store and The amount of data is around 200GB of product pictures. Every single picture have his own folder/path. So there are a lot of folders that needs to create to. Rclone is installed on the new server and the storage is connected to the server via san. The transfer is running over a week and we are at 170GB right now. Everything works fine but it is really slow in my opinion. Is it normal that a transfer out of a bucket into a classic filesystem is that slow?
(Doing the math, the speed is only 2.3Mbps. I am honestly not going to pay anything for that speed.)
Perhaps you should break down the issue and diagnose part by part. Below are several common places to look out for slow transfer (generally speaking for any file transfer):
First of all, network and file systems are usually not performant with lots of small files, so to isolate the issue, upload a bigger file to minio first (1GB+). And for each step, test with big file first.
Is the speed of the source fast enough? Try copying the files from minio to a local storage or Ramdisk (/tmp is usually tmpfs and in turn stored in RAM, use mount to check).
Is the speed of the destination fast enough? Try dd or other disk performance testing utility.
Is the network latency to source high? Try pinging or curling the API (with timing)
Is the network latency to destination high? Try iostat
Maybe the CPU is the bottleneck? As encoding and decoding stuff takes quite a lot of computing power. Try top when a copy is running.
Again, try these steps with the big file and fragmented file separately. The is quite a chance that fragmented files is an issue. If that is the case, I would try to look for concurrency option in rclone.
I had the same problem copying hundreds of thousands of small files from a S3-compatible storage to a local storage. Originally I was using s3fs+rsync. Very (very) slow, and it was getting stuck on the largest folders. Then I discovered rclone, and finished the migration within a few hours with these parameters:
rclone copy source:/bucket /destination/folder --checkers 256 --transfers 256 --fast-list --size-only --progress
Explanation of the options (from https://rclone.org/flags/)
--checkers 256 Number of checkers to run in parallel (default 8)
--transfers 256 Number of file transfers to run in parallel (default 4)
--fast-list Use recursive list if available; uses more memory but fewer transactions
--size-only Skip based on size only, not mod-time or checksum (wouldn't apply in your case if copying to an empty destination)
--progress Show progress during transfer
I want to architect a system that will allow thousands of users to upload images from a tablet to a content management system. In one upload, each user can upload up to 12 images at a time and there could be up to 20,000 uploads per day. As the numbers are <240,000 images per day, I've been wondering what is the best approach to avoid bottle necking during peak times.
I'm thinking along the lines of using a web server farm (IIS) to upload the images though HTTP POST. Where each image is less than 200kB and I could store the images on a file system. This would be 48GB per day and only 16TB per year.
Then I could store the image metadata in SQL Server DB along with other textual data. At a later time, the users will want to recall the images and other (text) data from a DB to the tablet for further processing.
On a small scale, this is no problem ,but I'm interested in what everyone thinks is the best approach for uploading/retrieving such a large number of images/records per day?
I've been wondering what is the best approach to avoid bottle necking during peak times.
Enough hardware. Period.
I'm thinking along the lines of using a web server farm (IIS) to upload the images though
HTTP POST.
No alternative to that that is worth mentioning.
This would be 48GB per day and only 16TB per year.
Yes. Modern storage is just fantastic ;)
Then I could store the image metadata in SQL Server DB along with other textual data.
Which makes this ia pretty smal ldatbase - which is good. At the end that means the problem runs down into the image storage, the database is not really that big.
On a small scale, this is no problem ,but I'm interested in what everyone thinks is the
best approach for uploading/retrieving such a large number of images/records per day?
I am not sure you are on a large scale yet. Problems will be around:
Number of files. You need to split them into multiple folders and best have the concept of buckets in the database so you can split them into multiple buckets each being their own server(s) - good for long term maintenance.
Backup / restore is a problem, but a lot less when you use (a) tapes and (b) buckets as said above - the chance of a full problem is tiny. ALso "3-4 copies on separate machines" may work well enough.
Except the bucket problem - i.e. yo can not put all those files into a simple folder, that iwll be seirously unwieldely - you are totally fine. This is not exactly super big. Keep the web level stateless so you can scale it, same on the storage backend, then use the database to tie it all together and make sure you do FREQUENT database backups (like all 15 minutes).
One of possible way is upload from client directly to Amazon S3. It will scale and receive any amount of files thrown at it. After upload to S3 is complete, save a link to S3 object along with useful meta to your DB. In this setup you will avoid file upload bottleneck and only have to be able to save ~240,000 records per day to your DB which should not be a problem.
If you want to build service that adds a value and save some (huge amount actually) time on file uploads, consider using existing 3rd party solutions that are built to solve this particular issue. For example - Uploadcare and some of its competitors.
Let's say I wanted to make an appengine application that stores a 50,000 word dictionary and also equivalent dictionaries in 10 other languages of similar size.
I got this working locally on my dev server, but when I went to load the first dictionary into the real appserver I immediately went over my writes per day quota. I had no idea how many dictionary entries made it into the datastore. So, 24 hours later, I went and tried to bulk download the dictionary and see how many entries I actually had, but doing that I hit the reads per day quota and got nothing back for my trouble. I tried enabling billing at setting a daily maximum of $1.00 and hit that quota through the bulkloader and didn't get any data for my trouble or my $1.00.
Anyway, so I looked at my datastore viewer and it shows that each of my dictionary words required 8 writes to the datastore.
So, does this mean that this sort of application is inappropriate for appengine? Should I not be trying to store a dictionary in there? Is there a smarter way to do this? For instance, could I store the dictionary in file form in the blob store somehow and then do things with it programatically from there?
Thank you for any suggestions
It's likely that you'll be reading much less then writing so the problem is getting the data in, not so much reading it.
So all you need to do to use your current configuration is to slow down the write rate. Then, presumably you'll be getting each word by it's ID (the word itself I hope!) and so reads will be fast and small, exactly as you want.
You could do this: Chop your source data into 1 file per letter . Upload those files with your application and create a task to read each file in turn then slowly write that data to the datastore. Once that task completes, it's last action is to call the next task.
It might take a week to complete but once it's there it'll be alot more handy then an opaque blob you have to get from the blob store, reading N words for every single 1 you are actually interested in, and then unpack and process for every word.
You can also use the bulk downloader to upload data as well!
We are developing file processing system where several File Processing applications pick up files from queue, do processing and put back file to queue as response. Now we use Windows file system(share a folder on network) as queue. We share one folder and put files in it, the File Processing Servers applications pick up files from it and put back after processing.
We are thinking to move the whole queue engine system from windows file system to SQL Server. Is it good idea to store files into SQL Server and use SQL Server as file queue backend? The files are about 1-20mb in size and our system process about 10 000 files per day.
You can do that, but I'd prefer a queue - either a remote instance or an in-memory object. I would prefer a real queue because I could pool listeners and have the queue hand off requests to them and manage their life cycle. You'll have to write all that code if you put them in a database.
10,000 files per day means you need to process one every 8.64 seconds for 24 hours a day. What are your typical processing times for a 1-20MB file?
The processing of the files should be asynchronous.
If you have 50 listeners, each handling one 20MB file, your total memory footprint will be on the order of 1GB.
As far as speed goes, the worst case is the 15 minutes for processing time. That's four per hour, 96 per day. So you'll need at least 104 processors to get through 10,000 in a single day. That's a lot of servers.
You're not thinking about network latency, either. There's transfer time back and forth for each file. It's four network hops: one from the client to the database, another from the database to the processor, and back again. 20MB could introduce a lot of latency.
I'd recommend that you look into Netty. I'll bet it could help to handle this load.
The file size is quite nasty - unless you can e.g. significantly compress the files, the storage requirements in SQL might outweigh any benefits you perceive.
What you might consider is a hybrid solution, i.e. modelling each incoming file in SQL (fileName, timestamp, uploadedby, processedYN... etc), queuing the file record in SQL after each upload, and then use SQL to do the queueing / ordering (and you can then run audits, reports etc out of SQL)
The downside of the hybrid is that if your file system crashes, you have SQL but not your files.