rclone slow transfer from bucket to filesystem - data-transfer

Im using rclone to tranfer data between a minio bucket and a shared storage. Im migrating a store and The amount of data is around 200GB of product pictures. Every single picture have his own folder/path. So there are a lot of folders that needs to create to. Rclone is installed on the new server and the storage is connected to the server via san. The transfer is running over a week and we are at 170GB right now. Everything works fine but it is really slow in my opinion. Is it normal that a transfer out of a bucket into a classic filesystem is that slow?

(Doing the math, the speed is only 2.3Mbps. I am honestly not going to pay anything for that speed.)
Perhaps you should break down the issue and diagnose part by part. Below are several common places to look out for slow transfer (generally speaking for any file transfer):
First of all, network and file systems are usually not performant with lots of small files, so to isolate the issue, upload a bigger file to minio first (1GB+). And for each step, test with big file first.
Is the speed of the source fast enough? Try copying the files from minio to a local storage or Ramdisk (/tmp is usually tmpfs and in turn stored in RAM, use mount to check).
Is the speed of the destination fast enough? Try dd or other disk performance testing utility.
Is the network latency to source high? Try pinging or curling the API (with timing)
Is the network latency to destination high? Try iostat
Maybe the CPU is the bottleneck? As encoding and decoding stuff takes quite a lot of computing power. Try top when a copy is running.
Again, try these steps with the big file and fragmented file separately. The is quite a chance that fragmented files is an issue. If that is the case, I would try to look for concurrency option in rclone.

I had the same problem copying hundreds of thousands of small files from a S3-compatible storage to a local storage. Originally I was using s3fs+rsync. Very (very) slow, and it was getting stuck on the largest folders. Then I discovered rclone, and finished the migration within a few hours with these parameters:
rclone copy source:/bucket /destination/folder --checkers 256 --transfers 256 --fast-list --size-only --progress
Explanation of the options (from https://rclone.org/flags/)
--checkers 256 Number of checkers to run in parallel (default 8)
--transfers 256 Number of file transfers to run in parallel (default 4)
--fast-list Use recursive list if available; uses more memory but fewer transactions
--size-only Skip based on size only, not mod-time or checksum (wouldn't apply in your case if copying to an empty destination)
--progress Show progress during transfer

Related

Transferring millions of documents to an external hard drive

I have 13 million documents on Azure Blob Storage that I can azcopy to my desktop memory within 24 hours. However, as soon as I try to transfer these files to my external hard drive, the time needed to complete the transfer jumps to 60 days. The files aren't large - each 100 kb - so the entire transfer is about 1.3 TB. I have tried:
Zipping the files, transfer, unzip. Problem: Unzipping takes just as long
azcopy directly into the SSD hard drive
robocopy files from internal to external drive
Simple ctrl-c ctrl-v.
Each of the above options take months to complete the transfer. Any ideas on how to speed this up??? Why would azcopy be so much faster for an internal drive than an external one?
There could be several reasons for the performance issue.
You can run a performance benchmark test on specific blob containers or file shares to view general performance statistics and to identify performance bottlenecks. You can run the test by uploading or downloading generated test data.
Use the following command to run a performance benchmark test.
Syntax
azcopy benchmark 'https://<storage-account-name>.blob.core.windows.net/<container-name>'
Optimize the performance of AzCopy with Azure Storage
There are several options for transferring data to and from Azure, depending on your needs: Transfer data to and from Azure
Azcopy Fast Data Transfer is a tool for fast upload of data into Azure – up to 4 terabytes per hour from a single client machine. It moves data from your premises to Blob Storage, to a clustered file system, or direct to an Azure VM. It can also move data between Azure regions.
The tool works by maximizing utilization of the network link. It efficiently uses all available bandwidth, even over long-distance links. On a 10 Gbps link, it reaches around 4 TB per hour, which makes it about 3 to 10 times faster than competing tools we’ve tested. On slower links, Fast Data Transfer typically achieves over 90% of the link’s theoretical maximum, while other tools may achieve substantially less.
For example, on a 250 Mbps link, the theoretical maximum throughput is about 100 GB per hour. Even with no other traffic on the link, other tools may achieve substantially less than that. In the same conditions (250 Mbps, with no competing traffic) Fast Data Transfer can be expected to transfer at least 90 GB per hour. (If there is competing traffic on the link, Fast Data Transfer will reduce its own throughput accordingly, in order to avoid disrupting your existing traffic.)
Fast Data Transfer runs on Windows and Linux. Its client-side portion is a command-line application that runs on-premises, on your own machine. A single client-side instance supports up to 10 Gbps. Its server-side portion runs on Azure VM(s) in your own subscription. Depending on the target speed, between 1 and 4 Azure VMs are required. An Azure Resource Manager template is supplied to automatically create the necessary VM(s).
Your files are very small (e.g. each file is only 10s of KB).
You have an ExpressRoute with private peering.
You want to throttle your transfers to use only a set amount of network bandwidth.
You want to load directly to the disk of a destination VM (or to a clustered file system). Most Azure data loading tools can’t send data direct to VMs. Tools such as Robocopy can, but they’re not designed for long-distance links. We have reports of Fast Data Transfer being over 10 times faster.
You are reading from spinning hard disks and want to minimize the overhead of seek times. In our testing, we were able to double disk read performance by following the tuning tips in Fast Data Transfer’s instructions.

Use SQL Server for file queue storage

We are developing file processing system where several File Processing applications pick up files from queue, do processing and put back file to queue as response. Now we use Windows file system(share a folder on network) as queue. We share one folder and put files in it, the File Processing Servers applications pick up files from it and put back after processing.
We are thinking to move the whole queue engine system from windows file system to SQL Server. Is it good idea to store files into SQL Server and use SQL Server as file queue backend? The files are about 1-20mb in size and our system process about 10 000 files per day.
You can do that, but I'd prefer a queue - either a remote instance or an in-memory object. I would prefer a real queue because I could pool listeners and have the queue hand off requests to them and manage their life cycle. You'll have to write all that code if you put them in a database.
10,000 files per day means you need to process one every 8.64 seconds for 24 hours a day. What are your typical processing times for a 1-20MB file?
The processing of the files should be asynchronous.
If you have 50 listeners, each handling one 20MB file, your total memory footprint will be on the order of 1GB.
As far as speed goes, the worst case is the 15 minutes for processing time. That's four per hour, 96 per day. So you'll need at least 104 processors to get through 10,000 in a single day. That's a lot of servers.
You're not thinking about network latency, either. There's transfer time back and forth for each file. It's four network hops: one from the client to the database, another from the database to the processor, and back again. 20MB could introduce a lot of latency.
I'd recommend that you look into Netty. I'll bet it could help to handle this load.
The file size is quite nasty - unless you can e.g. significantly compress the files, the storage requirements in SQL might outweigh any benefits you perceive.
What you might consider is a hybrid solution, i.e. modelling each incoming file in SQL (fileName, timestamp, uploadedby, processedYN... etc), queuing the file record in SQL after each upload, and then use SQL to do the queueing / ordering (and you can then run audits, reports etc out of SQL)
The downside of the hybrid is that if your file system crashes, you have SQL but not your files.

Memory leak using SQL FileStream

I have an application that uses a SQL FILESTREAM to store images. I insert a LOT of images (several millions images per days).
After a while, the machine stops responding and seem to be out of memory... Looking at the memory usage of the PC, we don't see any process taking a lot of memory (neither SQL or our application). We tried to kill our process and it didn't restore our machine... We then kill the SQL services and it didn't not restore to system. As a last resort, we even killed all processes (except the system ones) and the memory still remained high (we are looking in the task manager's performance tab). Only a reboot does the job at that point. We have tried on Win7, WinXP, Win2K3 server with always the same results.
Unfortunately, this isn't a one-shot deal, it happens every time.
Has anybody seen that kind of behaviour before? Are we doing something wrong using the SQL FILESTREAMS?
You say you insert a lot of images per day. What else do you do with the images? Do you update them, many reads?
Is your file system optimized for FILESTREAMs?
How do you read out the images?
If you do a lot of updates, remember that SQL Server will not modify the filestream object but create a new one and mark the old for deletion by the garbage collector. At some time the GC will trigger and start cleaning up the old mess. The problem with FILESTREAM is that it doesn't log a lot to the transaction log and thus the GC can be seriously delayed. If this is the problem it might be solved by forcing GC more often to maintain responsiveness. This can be done using the CHECKPOINT statement.
UPDATE: You shouldn't use FILESTREAM for small files (less than 1 MB). Millions of small files will cause problems for the filesystem and the Master File Table. Use varbinary in stead. See also Designing and implementing FILESTREAM storage
UPDATE 2: If you still insist on using the FILESTREAM for storage (you shouldn't for large amounts of small files), you must at least configure the file system accordingly.
Optimize the file system for large amount of small files (use these as tips and make sure you understand what they do before you apply)
Change the Master File Table
reservation to maximum in registry (FSUTIL.exe behavior set mftzone 4)
Disable 8.3 file names (fsutil.exe behavior set disable8dot3 1)
Disable last access update(fsutil.exe behavior set disablelastaccess 1)
Reboot and create a new partition
Format the storage volumes using a
block size that will fit most of the
files (2k or 4k depending on you
image files).

File writes per second

I want to log visits to my website with a high visits rate to file. How much writes to log file can I perform per second?
If you can't use Analytics, why wouldn't you use your webserver's existing logging system? If you are using a real webserver, it almost certainly as a logging mechanism that is already optimized for maximum throughput.
Your question is impossible to answer in all other respects. The number of possible writes is governed by hardware, operating system and contention from other running software.
Don't do that, use Google Analytics instead. You'd end up running into many problems trying to open files, write to them, close them, so on and so forth. Problems would arise when you overwrite data that hasn't yet been committed, etc.
If you need your own local solution (within a private network, etc) you can look into an option like AWStats which operates off of crawling through your log files.
Or just analyze the Apache access log files. For example with AWStats.
File writes are not expensive until you actually flush the data to disk. Usually your operating system will cache things aggressively so you can have very good write performance if you don't try to fsync() your data manually (but of course you might lose the latest log entries if there's a crash).
Another problem however is that file I/O is not necessarily thread-safe, and writing to the same file from multiple threads or processes (which will probably happen if we're talking about a Web app) might produce the wrong results: missing or duplicate or intermingled log lines, for example.
If your hard disk drive can write 40 MB/s and your log file lines are approx. 300 bytes in length, I'd assume that you can write 140000 HTTP requests per second to your logfile if you keep it open.
Anyway, you should not do that on your own, since most web servers already write to logfiles and they know very good how to do that, how to roll the files if a maximum limit is reached and how to format the log lines according to some well-known patterns.
File access is very expensive, especially when doing writes. I would recommend saving them to RAM (using whatever cache method suits you best) and periodically writing the results to disk.
You could also use a database for this. Something like:
UPDATE stats SET hits = hits + 1
Try out a couple different solutions, benchmark the performance, and implement whichever works fast enough with minimal resource usage.
If using Apache, I'd recommend using the rotatelogs utility supplied as a part of the standard kit.
We use this to allow rotating the server logs out on a daily basis without having to stop and start the server. N.B. Use the new "||" syntax when declaring the log directive.
The site I'm involved with is one of the largest on the Internet with hit rates peaking in the millions per second for extended periods of time.
Edit: I forgot to say that the site uses standard Apache logging directives and we have not needed to customise the Apache logging code at all.
Edit: BTW Unless you really need it, don't log bytes served as this causes all sorts of issues around the midnight boundary.
Let Apache do it; do the analysis work on the back-end.

Very large Jar files and FAT32

I am doing some analysis for a chunk of a desktop app we're working on.
One requirement is that it be able to do i/o of some legacy file formats, which in practice are running as large as 800Mb each. An import might reasonably be expected to be on the order of 5Gb in size.
Ideally, I'd just stick whatever files I want into a jar file, sign the thing, and send it off for re-import at a some later time.
But our app must support XP Pro (FAT32), which has a max file size limit of around 4Gb, from what I can tell.
Must I break my data up into multiple chunks? (And therefore take on the complexity of keeping track of what's going on?)
There's no other way of storing 5GB of data on FAT32 than splitting it in chunks.
Write a routine that will deal with archives of more than 4GB, i.e. split and merge. Encapsulate it in some util class or util file and call it from your save/load method.
The max file size on fat 32 is 4gb (actually 4Gb - 2bytes), so if you have to use it and your single file is over that you will have to split it.

Resources