Transferring millions of documents to an external hard drive

Transferring millions of documents to an external hard drive - file

I have 13 million documents on Azure Blob Storage that I can azcopy to my desktop memory within 24 hours. However, as soon as I try to transfer these files to my external hard drive, the time needed to complete the transfer jumps to 60 days. The files aren't large - each 100 kb - so the entire transfer is about 1.3 TB. I have tried:
Zipping the files, transfer, unzip. Problem: Unzipping takes just as long
azcopy directly into the SSD hard drive
robocopy files from internal to external drive
Simple ctrl-c ctrl-v.
Each of the above options take months to complete the transfer. Any ideas on how to speed this up??? Why would azcopy be so much faster for an internal drive than an external one?

There could be several reasons for the performance issue.
You can run a performance benchmark test on specific blob containers or file shares to view general performance statistics and to identify performance bottlenecks. You can run the test by uploading or downloading generated test data.
Use the following command to run a performance benchmark test.
Syntax
azcopy benchmark 'https://<storage-account-name>.blob.core.windows.net/<container-name>'
Optimize the performance of AzCopy with Azure Storage
There are several options for transferring data to and from Azure, depending on your needs: Transfer data to and from Azure
Azcopy Fast Data Transfer is a tool for fast upload of data into Azure – up to 4 terabytes per hour from a single client machine. It moves data from your premises to Blob Storage, to a clustered file system, or direct to an Azure VM. It can also move data between Azure regions.
The tool works by maximizing utilization of the network link. It efficiently uses all available bandwidth, even over long-distance links. On a 10 Gbps link, it reaches around 4 TB per hour, which makes it about 3 to 10 times faster than competing tools we’ve tested. On slower links, Fast Data Transfer typically achieves over 90% of the link’s theoretical maximum, while other tools may achieve substantially less.
For example, on a 250 Mbps link, the theoretical maximum throughput is about 100 GB per hour. Even with no other traffic on the link, other tools may achieve substantially less than that. In the same conditions (250 Mbps, with no competing traffic) Fast Data Transfer can be expected to transfer at least 90 GB per hour. (If there is competing traffic on the link, Fast Data Transfer will reduce its own throughput accordingly, in order to avoid disrupting your existing traffic.)
Fast Data Transfer runs on Windows and Linux. Its client-side portion is a command-line application that runs on-premises, on your own machine. A single client-side instance supports up to 10 Gbps. Its server-side portion runs on Azure VM(s) in your own subscription. Depending on the target speed, between 1 and 4 Azure VMs are required. An Azure Resource Manager template is supplied to automatically create the necessary VM(s).
Your files are very small (e.g. each file is only 10s of KB).
You have an ExpressRoute with private peering.
You want to throttle your transfers to use only a set amount of network bandwidth.
You want to load directly to the disk of a destination VM (or to a clustered file system). Most Azure data loading tools can’t send data direct to VMs. Tools such as Robocopy can, but they’re not designed for long-distance links. We have reports of Fast Data Transfer being over 10 times faster.
You are reading from spinning hard disks and want to minimize the overhead of seek times. In our testing, we were able to double disk read performance by following the tuning tips in Fast Data Transfer’s instructions.

Related

rclone slow transfer from bucket to filesystem

Im using rclone to tranfer data between a minio bucket and a shared storage. Im migrating a store and The amount of data is around 200GB of product pictures. Every single picture have his own folder/path. So there are a lot of folders that needs to create to. Rclone is installed on the new server and the storage is connected to the server via san. The transfer is running over a week and we are at 170GB right now. Everything works fine but it is really slow in my opinion. Is it normal that a transfer out of a bucket into a classic filesystem is that slow?

(Doing the math, the speed is only 2.3Mbps. I am honestly not going to pay anything for that speed.)
Perhaps you should break down the issue and diagnose part by part. Below are several common places to look out for slow transfer (generally speaking for any file transfer):
First of all, network and file systems are usually not performant with lots of small files, so to isolate the issue, upload a bigger file to minio first (1GB+). And for each step, test with big file first.
Is the speed of the source fast enough? Try copying the files from minio to a local storage or Ramdisk (/tmp is usually tmpfs and in turn stored in RAM, use mount to check).
Is the speed of the destination fast enough? Try dd or other disk performance testing utility.
Is the network latency to source high? Try pinging or curling the API (with timing)
Is the network latency to destination high? Try iostat
Maybe the CPU is the bottleneck? As encoding and decoding stuff takes quite a lot of computing power. Try top when a copy is running.
Again, try these steps with the big file and fragmented file separately. The is quite a chance that fragmented files is an issue. If that is the case, I would try to look for concurrency option in rclone.

I had the same problem copying hundreds of thousands of small files from a S3-compatible storage to a local storage. Originally I was using s3fs+rsync. Very (very) slow, and it was getting stuck on the largest folders. Then I discovered rclone, and finished the migration within a few hours with these parameters:
rclone copy source:/bucket /destination/folder --checkers 256 --transfers 256 --fast-list --size-only --progress
Explanation of the options (from https://rclone.org/flags/)
--checkers 256 Number of checkers to run in parallel (default 8)
--transfers 256 Number of file transfers to run in parallel (default 4)
--fast-list Use recursive list if available; uses more memory but fewer transactions
--size-only Skip based on size only, not mod-time or checksum (wouldn't apply in your case if copying to an empty destination)
--progress Show progress during transfer

What is the maxium replication rate of Couchbase XDCR

We are currently using Couchbase for data caching and there is talk of doing cross-data center replication with it. However, we will need up to 1000 documents replicated to multiple locations every second. Documents will be between 2 and 64K each.
Is there anyone out there with XDCR experience who can tell me whether this is even feasible, or if we will have to use other means to replicate this data at that speed. The only "benchmark" in the documentation at Couchbase implies that the rate of XDCR is only about 100TPS. (149 ms to replicate 11 documents.)

The replication rate of XDCR is limited by network bandwidth and latency first, then CPU and disk IO. Assuming you have enough bandwidth between the datacenters and your clusters are provisioned properly, Couchbase will replicate hundreds of thousands of documents per second, or more. It's a pretty simple experiment to run, just set up XDCR between two singles node clusters and use one of the load generator tools that come with Couchbase to create some traffic. (cbworkloadgen in the Couchbase bin folder or cbc-pillowfight that comes with libcouchbase.)
There are several config settings you can play with to optimize throughput, such as increasing batch size, changing the optimistic replication threshold, etc.

Why is the CPU not maxed out?

I have an application that I'd like to make more efficient - it isn't taxing any one resource enough that I can identify it as a bottleneck, so perhaps the app is doing something that is preventing full efficiency.
The application pulls data from a database on one SQL Server instance, does some manipulation on it, then writes it to a database on another SQL Server instance - all on one machine. It doesn't do anything in parallel.
While the app is running (it can take several hours), none of the 4 CPU cores are maxed out (they hover around 40-60% utilization each), the disks are almost idle and very little RAM is used.
Reported values:
Target SQL Server instance: ~10% CPU utilization, 1.3GB RAM
Source SQL Server instance: ~10% CPU utilization, 300MB RAM
Application: ~6% CPU utilization, 45MB RAM
All the work is happening on one disk, which writes around 100KB/s during the operation, on average. 'Active time' according to task manager is usually 0%, occasionally flickering up to between 1 and 5% for a second or so. Average response time, again according to task manager, moves betweeen 0ms and 20ms, mainly showing between 0.5 and 2ms.

Databases are notorious for IO limitations. Now, seriously, as you say:
The application pulls data from a database on one SQL Server instance,
does some manipulation on it, then writes it to a database on another
SQL Server instance - all on one machine.
I somehow get the idea this is a end user level mashine, maybe a workstation. Your linear code (a bad idea to get full utilization btw, as you never run all 3 parts - read, process, write - in parallel) will be seriously limited by whatever IO subsystem you have.
But that will not come into play as long as you can state:
It doesn't do anything in parallel.
What it must do is do things in parallel:
One task is reading the next data
One task does the data processing
One task does the data writing
You can definitely max out a lot more than your 4 cores. Last time I did something like that (read / manipulate / write) we were maxing out 48 cores with around 96 or so processing threads running in parallel (and a smaller amount doing the writes). But a core of that is that your application msut start actually using multiple CPU's.
If you do not parallelize:
You only will max out one core max,
YOu basically waste time waiting for databases on both ends. The latency while you wait for data to be read or committed is latency you are not processing anything.
;) And once you fix that you will get IO problems. Promised.

I recommend reading How to analyse SQL Server performance. You need to capture and analyze the wait stats. These will tell you what is the execution doing that prevents it from going max out on CPU. You already have a feeling that the workload is causing the SQL engine to wait rather than run, but only after you understand the wait stats you'll be able to get a feel what is waiting for. Follow the article linked for specific analysis techniques.

Estimating IOPS requirements of a production SQL Server system

We're working on an application that's going to serve thousands of users daily (90% of them will be active during the working hours, using the system constantly during their workday). The main purpose of the system is to query multiple databases and combine the information from the databases into a single response to the user. Depending on the user input, our query load could be around 500 queries per second for a system with 1000 users. 80% of those queries are read queries.
Now, I did some profiling using the SQL Server Profiler tool and I get on average ~300 logical reads for the read queries (I did not bother with the write queries yet). That would amount to 150k logical reads per second for 1k users. Full production system is expected to have ~10k users.
How do I estimate actual read requirement on the storage for those databases? I am pretty sure that actual physical reads will amount to much less than that, but how do I estimate that? Of course, I can't do an actual run in the production environment as the production environment is not there yet, and I need to tell the hardware guys how much IOPS we're going to need for the system so that they know what to buy.
I tried the HP sizing tool suggested in the previous answers, but it only suggests HP products, without actual performance estimates. Any insight is appreciated.
EDIT: Main read-only dataset (where most of the queries will go) is a couple of gigs (order of magnitude 4gigs) on the disk. This will probably significantly affect the logical vs physical reads. Any insight how to get this ratio?

Disk I/O demand varies tremendously based on many factors, including:
How much data is already in RAM
Structure of your schema (indexes, row width, data types, triggers, etc)
Nature of your queries (joins, multiple single-row vs. row range, etc)
Data access methodology (ORM vs. set-oriented, single command vs. batching)
Ratio of reads vs. writes
Disk (database, table, index) fragmentation status
Use of SSDs vs. rotating media
For those reasons, the best way to estimate production disk load is usually by building a small prototype and benchmarking it. Use a copy of production data if you can; otherwise, use a data generation tool to build a similarly sized DB.
With the sample data in place, build a simple benchmark app that produces a mix of the types of queries you're expecting. Scale memory size if you need to.
Measure the results with Windows performance counters. The most useful stats are for the Physical Disk: time per transfer, transfers per second, queue depth, etc.
You can then apply some heuristics (also known as "experience") to those results and extrapolate them to a first-cut estimate for production I/O requirements.
If you absolutely can't build a prototype, then it's possible to make some educated guesses based on initial measurements, but it still takes work. For starters, turn on statistics:
SET STATISTICS IO ON
Before you run a test query, clear the RAM cache:
CHECKPOINT
DBCC DROPCLEANBUFFERS
Then, run your query, and look at physical reads + read-ahead reads to see the physical disk I/O demand. Repeat in some mix without clearing the RAM cache first to get an idea of how much caching will help.
Having said that, I would recommend against using IOPS alone as a target. I realize that SAN vendors and IT managers seem to love IOPS, but they are a very misleading measure of disk subsystem performance. As an example, there can be a 40:1 difference in deliverable IOPS when you switch from sequential I/O to random.

You certainly cannot derive your estimates from logical reads. This counter really is not that helpful because it is often unclear how much of it is physical and also the CPU cost of each of these accesses is unknown. I do not look at this number at all.
You need to gather virtual file stats which will show you the physical IO. For example: http://sqlserverio.com/2011/02/08/gather-virtual-file-statistics-using-t-sql-tsql2sday-15/
Google for "virtual file stats sql server".
Please note that you can only extrapolate IOs from the user count if you assume that cache hit ratio of the buffer pool will stay the same. Estimating this is much harder. Basically you need to estimate the working set of pages you will have under full load.
If you can ensure that your buffer pool can always take all hot data you can basically live without any reads. Then you only have to scale writes (for example with an SSD drive).

Use SQL Server for file queue storage

We are developing file processing system where several File Processing applications pick up files from queue, do processing and put back file to queue as response. Now we use Windows file system(share a folder on network) as queue. We share one folder and put files in it, the File Processing Servers applications pick up files from it and put back after processing.
We are thinking to move the whole queue engine system from windows file system to SQL Server. Is it good idea to store files into SQL Server and use SQL Server as file queue backend? The files are about 1-20mb in size and our system process about 10 000 files per day.

You can do that, but I'd prefer a queue - either a remote instance or an in-memory object. I would prefer a real queue because I could pool listeners and have the queue hand off requests to them and manage their life cycle. You'll have to write all that code if you put them in a database.
10,000 files per day means you need to process one every 8.64 seconds for 24 hours a day. What are your typical processing times for a 1-20MB file?
The processing of the files should be asynchronous.
If you have 50 listeners, each handling one 20MB file, your total memory footprint will be on the order of 1GB.
As far as speed goes, the worst case is the 15 minutes for processing time. That's four per hour, 96 per day. So you'll need at least 104 processors to get through 10,000 in a single day. That's a lot of servers.
You're not thinking about network latency, either. There's transfer time back and forth for each file. It's four network hops: one from the client to the database, another from the database to the processor, and back again. 20MB could introduce a lot of latency.
I'd recommend that you look into Netty. I'll bet it could help to handle this load.

The file size is quite nasty - unless you can e.g. significantly compress the files, the storage requirements in SQL might outweigh any benefits you perceive.
What you might consider is a hybrid solution, i.e. modelling each incoming file in SQL (fileName, timestamp, uploadedby, processedYN... etc), queuing the file record in SQL after each upload, and then use SQL to do the queueing / ordering (and you can then run audits, reports etc out of SQL)
The downside of the hybrid is that if your file system crashes, you have SQL but not your files.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight