How are sparse files handled in Google Cloud Storage? - file

We have a 200GB sparse file which is about 80GB in actual size (VMware disk).
How does Google calculate the space for this file, 200GB or 80GB?
What would be the best practice to store it in the Google Cloud using gsutil (similar to rsync -S)
Would it be solved by using tar cSf, and then upload via gsutil? How slow could it be?

We have a 200GB sparse file which is about 80GB in actual size (VMware disk).
How does Google calculate the space for this file, 200GB or 80GB?
Google Cloud Storage does not introspect your files to understand what they are, so it's the actual size (80GB) that it takes on disk that matters.
What would be the best practice to store it in the Google Cloud using gsutil (similar to rsync -S)
There's gsutil rsync but it does not support -S so that won't be very efficient. Also, Google Cloud Storage is not storing files as blocks which can be accessed and rewritten randomly, but as blobs keyed by the bucket name + object name, so you'll essentially be uploading the entire 80GB file every time.
One alternative you might consider is to use Persistent Disks which provide block-level access to your files with the following workflow:
One-time setup:
create a persistent disk and use it only for storage of your VM image
Pre-sync setup:
create a Linux VM instance with its own boot disk
attach the persistent disk in read-write mode to the instance
mount the attached disk as a file system
Synchronize:
use ssh+rsync to synchronize your VM image to the persistent disk on the VM
Post-sync teardown:
unmount the disk within the instance
detach the persistent disk from the instance
delete the VM instance
You can automate the setup and teardown steps with scripts so it should be very easy to run on a regular basis whenever you want to do the synchronization.
Would it be solved by using tar cSf, and then upload via gsutil? How slow could it be?
The method above will be limited by your network connection, and would be no different from ssh+rsync to any other server. You can test it out by, say, throttling your bandwidth artificially to another server on your own network to match your external upload speed and running rsync over ssh to test it out.
Something not covered above is pricing, so I'll just leave these pointers for you to consider that as well, as that may be relevant for you to consider in your analysis.
Using Google Cloud Storage mode, you'll incur:
Google Cloud Storage pricing: currently $0.026 / GB / month
Network egress (ingress is free): varies by total amount of data
Using the Persistent Disk approach, you'll incur:
Persistent Disk pricing: currently $0.04 / GB / month
VM instance: needs to be up only while you're running the sync
Network egress (ingress is free): varies by total amount of data
The actual amount of data you will download should be small, since that's what rsync is supposed to minimize, so most of the data should be uploaded rather than downloaded, and hence your network cost should be low, but that is based on the actual rsync implementation which I cannot speak for.
Hope this helps.

Related

Transferring millions of documents to an external hard drive

I have 13 million documents on Azure Blob Storage that I can azcopy to my desktop memory within 24 hours. However, as soon as I try to transfer these files to my external hard drive, the time needed to complete the transfer jumps to 60 days. The files aren't large - each 100 kb - so the entire transfer is about 1.3 TB. I have tried:
Zipping the files, transfer, unzip. Problem: Unzipping takes just as long
azcopy directly into the SSD hard drive
robocopy files from internal to external drive
Simple ctrl-c ctrl-v.
Each of the above options take months to complete the transfer. Any ideas on how to speed this up??? Why would azcopy be so much faster for an internal drive than an external one?
There could be several reasons for the performance issue.
You can run a performance benchmark test on specific blob containers or file shares to view general performance statistics and to identify performance bottlenecks. You can run the test by uploading or downloading generated test data.
Use the following command to run a performance benchmark test.
Syntax
azcopy benchmark 'https://<storage-account-name>.blob.core.windows.net/<container-name>'
Optimize the performance of AzCopy with Azure Storage
There are several options for transferring data to and from Azure, depending on your needs: Transfer data to and from Azure
Azcopy Fast Data Transfer is a tool for fast upload of data into Azure – up to 4 terabytes per hour from a single client machine. It moves data from your premises to Blob Storage, to a clustered file system, or direct to an Azure VM. It can also move data between Azure regions.
The tool works by maximizing utilization of the network link. It efficiently uses all available bandwidth, even over long-distance links. On a 10 Gbps link, it reaches around 4 TB per hour, which makes it about 3 to 10 times faster than competing tools we’ve tested. On slower links, Fast Data Transfer typically achieves over 90% of the link’s theoretical maximum, while other tools may achieve substantially less.
For example, on a 250 Mbps link, the theoretical maximum throughput is about 100 GB per hour. Even with no other traffic on the link, other tools may achieve substantially less than that. In the same conditions (250 Mbps, with no competing traffic) Fast Data Transfer can be expected to transfer at least 90 GB per hour. (If there is competing traffic on the link, Fast Data Transfer will reduce its own throughput accordingly, in order to avoid disrupting your existing traffic.)
Fast Data Transfer runs on Windows and Linux. Its client-side portion is a command-line application that runs on-premises, on your own machine. A single client-side instance supports up to 10 Gbps. Its server-side portion runs on Azure VM(s) in your own subscription. Depending on the target speed, between 1 and 4 Azure VMs are required. An Azure Resource Manager template is supplied to automatically create the necessary VM(s).
Your files are very small (e.g. each file is only 10s of KB).
You have an ExpressRoute with private peering.
You want to throttle your transfers to use only a set amount of network bandwidth.
You want to load directly to the disk of a destination VM (or to a clustered file system). Most Azure data loading tools can’t send data direct to VMs. Tools such as Robocopy can, but they’re not designed for long-distance links. We have reports of Fast Data Transfer being over 10 times faster.
You are reading from spinning hard disks and want to minimize the overhead of seek times. In our testing, we were able to double disk read performance by following the tuning tips in Fast Data Transfer’s instructions.

How to access terabytes of data sitting in cloud quickly?

We have Terabytes of data sitting in the google hard drives. Initially, since we were using google cloud VMs, so we were doing development work in the cloud and were able to access the data.
Now, we bought our own servers where our application is running and we are bringing the data to our local disks which would be accessed by our application. The things is transferring the data especially terabytes on network using scp is quite slow. Can anyone suggest a way to fix this issue?
What I am thinking is there isn't a way that we can keep running a script waiting for a request on the google cloud instance(it send the requested data over HTTP!), and from local_server, we can request for data at a time!
I know this again is happening over the network but, I think we can scale in this approach, but I could be wrong!. it's kind of client-server(1:1) layout using in building interaction between frontend and backend! any suggestions?
Would that be slow? slower than bringing the data using SCP!
You could download the full VM disk and mount int on you servers or download the disk then just copy the data and delete the VM disk. For any case you should follow the next steps:
Create a snapshot of your VM which will have all the data.
Build and export the VM image to your servers.
Run the image on your servers according to GCE requirements.
It would take a lot less of time, since you're doing it on on premises and avoiding network traffic.

Best way to store many small files from memory to cloud storage

I wish to store a lot of images from memory on the cloud. I'm looking for the best cloud storage option for these images where I would have a minimum of http calls to make.
I currently use google cloud storage but batch upload from memory is not possible. To use google cloud storage I have to use either their client librairies and make one HTTP request per item I want to upload (very slow) or write all my files on disk (also very slow) to upload in batch.
What would be my best option?
DATA TRANSFER SERVICES, could be the best choice,whether you have access to a T1 line or a 10 Gbps network connection, Google offers solutions to meet your unique data transfer needs and get your data on the cloud quickly and securely.
Fast offline, online, or cloud-to-cloud data transfer.

To save big files, why use Google Storage rather than Java project?

When making a contents-file based bulletin with Java, why it is needed to use Google Storage, a separate storage service rather than just creating a folder within Java project itself?
Is there any file size limit that a Java project can become?
Is there any file size limit of a Java project to be uploaded for Google-App-Engine?
In App Engine, the local filesystem that your application is deployed to is not writeable. The flexible environment does allow you to write to a directory on a local disk, but written files are ephemeral, due to the fact that the disk gets initialized on each VM startup, so your files will get deleted on this occasion. You can read more related detail on the "Choosing an App Engine Environment" online documentation page.
You can profit from the facilities offered by Compute Engine, if writing to disc is a must. By default, each Compute Engine instance has a single root persistent disk that contains the operating system, as stated on the "Storage Options" page. For more insight, you may implement the proposed example from the "Running the Java Bookshelf on Compute Engine" page.
The Compute Engine does not handle scaling up of your applications automatically, so you'll have to take care of it yourself, as suggested in the "Autoscaling Groups of Instances" document.
There is a maximum size limit of 5 TB for individual objects stored in Cloud Storage. An application is limited to 10,000 uploaded files per version. Each file is limited to a maximum size of 32 megabytes. You may find related details on the "Quotas | App Engine Documentation | Google Cloud" page.

Is there some kind of persistent local storage in aws sagemaker model training?

I did some experimentation with aws sagemaker, and the download time of large data sets from S3 is very problematic, especially when the model is still in development, and you want some kind of initial feedback relatively fast
Is there some kind of local storage or other way to speed things up?
EDIT
I refer to the batch training service, that allows you to submit a job as a docker container.
While this service is intended for already validated jobs that typically run for a long time (which makes the download time less significant) there's still a need for quick feedback
There's no other way to do the "integration" testing of your job with the sagemaker infrastructure (configuration files, data files, etc.)
When experimenting with different variations to the model, it's important to be able to get initial feedback relatively fast
SageMaker has a few distinct services in it, and each is optimized for a specific use case. If you are talking about the development environment, you are probably using the notebook service. The notebook instance is coming with a local EBS (5GB) that you can use to copy some data into it and run the fast development iterations without copying the data every time from S3. The way to do it is by running wget or aws s3 cp from the notebook cells or from the terminal that you can open from the directory list page.
Nevertheless, it is not recommended to copy too much data into the notebook instance, as it will cause your training and experiments to take too long. Instead, you should utilize the second part of SageMaker, which is the training service. Once you have a good sense of the model that you want to train, based on the quick iterations of the small datasets on the notebook instance, you can point your model definition to go over larger datasets in parallel across a cluster of training instances. When you are sending a training job, you can also define how much local storage will be used by each training instance, but you will most benefit from the distributed mode of the training.
When you want to optimize your training job you have a few options for the storage. First, you can define the size of the EBS volume that you want your model to train on, for each one of the cluster instances. You can specify it when you launch the training Job (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html ):
...
"ResourceConfig": {
"InstanceCount": number,
"InstanceType": "string",
"VolumeKmsKeyId": "string",
"VolumeSizeInGB": number
},
...
Next, you need to decide what kind of models you want to train. If you are training your own models, you know how these models are getting their data, in terms of format, compression, source and other factors that can impact the performance of loading that data into the model input. If you prefer to use the built-in algorithms that SageMaker has, which are optimized to process protobuf RecordIO format. See more information here: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
Another aspect that you can benefit from (or learn if you want to implement your own models in a more scalable and optimized way) is the TrainingInputMode (https://docs.aws.amazon.com/sagemaker/latest/dg/API_AlgorithmSpecification.html#SageMaker-Type-AlgorithmSpecification-TrainingInputMode):
Type: String
Valid Values: Pipe | File
Required: Yes
You can use the File mode to read the data files from S3. However, you can also use the Pipe mode which opens up a lot of options to process data in a streaming mode. It doesn't mean only real-time data, using streaming services such as AWS Kinesis or Kafka, but also you can read your data from S3 and stream it to the models, and completely avoid the need to store the data locally on the training instances.
Customize your notebook volume size, up to 16 TB, with Amazon SageMaker
Blockquote Amazon SageMaker now allows you to customize the notebook storage volume when you need to store larger amounts of data.
Blockquote Allocating the right storage volume for your notebook instance is important while you develop machine learning models. You can use the storage volume to locally process a large dataset or to temporarily store other data to work with.
Blockquote Every notebook instance you create with Amazon SageMaker comes with a default storage volume of 5 GB. You can choose any size between 5 GB and 16384 GB, in 1 GB increments.
When you create notebook instances using the Amazon SageMaker console, you can define the storage volume:
see the steps

Resources