PDF Compression from blob storage - azure-logic-apps

Hello i am new to Azure logic app. Currently i am running into problem with PDF compression.
My problem is there are few files already stored in data lake and i want to check their size and if size exceeds 20mb then i need to compress and replace original file with compressed file in the data lake.
Firstly i am fetching data from data-lake and getting the content.And then i am getting metadata of the file and from the metadata i am extracting Size. Then if the size is greater then 20 then i am compressing.
I am using List of file size for "True". My file sizes are usually more than 150MB. I feel this is the prime factor for the failure of compressor.

You can use 3rd party connector to do this called Compress PDF document of plumsail connector. All you need to do is just login into plumsail and get the api for creating the connection. While in the logic app flow, As per your requirement we have used When a blob is added or modified (properties only) (V2) and checking if its size is greater than 20 mb using Condition connector. If yes, then we are using plumsails Compress PDF document and then as per the requirement you can save the file in the same folder with different name or you can save it in another folder of same. For instance i'm saving in container called documents >20mb. Here are few screenshots of my logic app for your reference:-
Result:
In Storage Account before Compression
In Storage Account after Compression
Scenario when the compressed file is still more than 20mb
If the file is so large that even after compression it is still greater than 20MB then you can save it in the same folder but with a different name and then delete it after the requirement is completed.

Related

How to check record count in a csv file uploaded in azure blob storage?

So i am uploading a 2gb csv file to my BLOB storage, and i want the record count (no of rows) of this file, so that i can validate after it gets loaded to ADW. Is there any way to get record count(like column count) in azure itself.
Thanks in advance
Azure Blobs are not like local files: You'd have to download (or stream) your blob to something that works through the file to perform any calculation you're trying to do.
Alternatively, you could mount your blob storage to something like Databricks (Spark cluster) and write your code there (same basic concept).
Or... you could do your record counts prior to (or during) your upload to blob storage.
Ultimately, how you perform this counting is really up to you. Blob storage is just bulk storage and knows nothing about file formats.

Split large SQL backup

I have a large SQL database (~1TB) that I'm trying to backup.
I can back it up fine but we want to store it offsite on Amazon S3, where the maximum object size is 5GB.
I thought I could split it by using multiple files, but it seems the maximum is 64 so I'm still ending up with 16GB chunks which are too big for S3.
Is there any other way to do it?
The maximum blob size for S3 is 5TB, not 5GB. 5GB is only the largest object that can be uploaded with a single HTTP PUT.
All cloud providers follow the same pattern: instead of uploading one huge file and storing it as a single blob, they break it apart into blocks that they replicate across many disks. When you ask for data, the provider retrieves it from all these blocks. To the client though, the blob appears as a single object.
Uploading a large file requires blocks too. Instead of uploading a large file with a single upload operation (HTTP PUT) all providers require that you upload individual blocks and finally notify the provider that these blocks constitute one object. This way, you can re-upload only a single failed block in case of failure, the provider can commit each block while you send the next, they don't have to track and lock a large blob (on a large disk) waiting for you to finish uploading etc.
In your case, you'll have to use an uploader that understands cloud storage and uses multiple blocks, perhaps something like Cyberduck, or S3 specific command-line tools. Or write a utility that uses Amazon's SDK to upload the backup file in parts.
Amazon's documentation site offers examples for multipart uploads at Uploading Objects Using Multipart Upload API. The high-level examples demonstrate various ways to upload a large file. All calls though use multi-part uploads, eg the simplest call :
var client= new AmazonS3Client(Amazon.RegionEndpoint.USEast1);
var fileTransferUtility = new TransferUtility(client);
fileTransferUtility.Upload(filePath, existingBucketName);
will upload the file using multiple parts and use the file's path as its key. The most advanced example allows you to specify the part size, a different key, redundancy options etc:
var fileTransferUtilityRequest = new TransferUtilityUploadRequest
{
BucketName = existingBucketName,
FilePath = filePath,
StorageClass = S3StorageClass.ReducedRedundancy,
PartSize = 6291456, // 6 MB.
Key = keyName,
CannedACL = S3CannedACL.PublicRead
};
fileTransferUtilityRequest.Metadata.Add("param1", "Value1");
fileTransferUtilityRequest.Metadata.Add("param2", "Value2");
fileTransferUtility.Upload(fileTransferUtilityRequest);

Storing images in MSSQL vs Disk

I am developing an inhouse application in C#.NET, based on an MSSQL database.
Users can upload images linked to a specific entity. These images have usually around 4 to 5 MB and cannot be compressed for storage.
Right now these images are stored in the filesystem where the application is. There are lag issues when printing a subset of images, maybe 30. They are filtered via DB and loaded via locationstring I get from the DB and then
Sytem.Drawing.Image.FromFile(locationstring)
I can (and have to) resize the images for printing (400x255px), for viewing them in the application they need to have the original size and quality.
I am supposed to redesign the way pictures are stored and implement a feature where the user can change the entity an image is linked to (that one I already figured out).
I rename the images when they are stored via the application - they get a GUID as name - and the original names and which entity they belong to are stored in the DB.
I was thinking about storing these images in the DB instead of the filesytem. They are in a directory where the users do not have access, so it wouldn't make any difference to them.
Does it make sense to store the images in the DB instead of the filesystem in this case?
Right now the application DB is MSSQL2008R2 but can be switched to MSSQL2014. MSSQL server and filesytem are hosted externally and accessed via Citrix. The application is also hosted via Citrix.
You shoud store images in disk and its path should be stored in database.
If you want to re size image(use your C# codes) put it into another folder and store its path in db
For more detail you can see
Storing Images in DB - Yea or Nay?
PHP to store images in MySQL or not?

Importing and parsing a large CSV file with go and app engine's datastore

Locally I am successfully able to (in a task):
Open the csv
Scan through each line (using Scanner.Scan)
Map the parsed CSV line to my desired struct
Save the struct to datastore
I see that blobstore has a reader that would allow me toread the value directly using a streaming file-like interface. -- but that seems to have a limit of 32MB. I also see there's a bulk upload tool -- bulk_uploader.py -- but it won't do all the data-massaging I require and I'd like to limit writes (and really cost) of this bulk insert.
How would one effectively read and parse a very large (500mb+) csv file without the benefit of reading from local storage?
You will need to look at the following options and see if it works for you :
Looking at the large file size, you should consider using Google Cloud Storage for the file. You can use the command line utilities that GCS provides to upload your file to your bucket. Once uploaded, you can look at using the JSON API directly to work with the file and import it into your datastore layer. Take a look at the following: https://developers.google.com/storage/docs/json_api/v1/json-api-go-samples
If this is like a one time import of a large file, another option could be spinning up a Google Compute VM, writing an App there to read from GCS and pass on the data via smaller chunks to a Service running in App Engine Go, that can then accept and persist the data.
Not a the solution I hoped for, but I ended up splitting the large files into 32MB pieces, uploading each to blob storage, then parsing each in a task.
It aint' pretty. But it took less time than the other options.

Uploading Large Amounts of Data from C# Windows Service to Azure Blobs

Can someone please point me in the right direction.
I need to create a windows timer service that will upload files in the local file system to Azure blobs.
Each file (video) may be anywhere between 2GB and 16GB. Is there a limit on the size? Do I need to split the file?
Because the files are very large can I throttle the upload speed to azure?
Is it possible in another application (WPF) to see the progress of the uploaded file? i.e. a progress bar and how much data has been transferred and what speed it is transferring at?
The upper limit for a block blob, the type you want here, is 200GB. Page blobs, used for VHDs, can go up to 1TB.
Block blobs are so called because upload is a two-step process - upload a set of blocks and then commit that block list. Client APIs can hide some of this complexity. Since you want to control the uploads and keep track of their status you should look at uploading the files in blocks - the maximum size of which is 4MB - and manage that flow and success as desired. At the end of the upload you commit the block list.
Kevin Williamson, who has done a number of spectacular blog posts, has a post showing how to do "Asynchronous Parallel Blob Transfers with Progress Change Notification 2.0."

Resources