Snowflake PUT command, AUTO_COMPRESS vs gzip compressed file performance - snowflake-cloud-data-platform

Can someone suggest which of below option will be more performant with PUT command:
Uploading file with AUTO_COMPRESS=true.
Uploading compressed file(gzip) AUTO_COMPRESS=false.

There's no harm to leaving AUTO_COMPRESS=true because if a file is already compressed, the PUT command won't try to double compress it. There is an important caveat to note though. If a file is already compressed, it must be compressed in a supported compression method. You can get a list of supported methods here: https://docs.snowflake.com/en/sql-reference/sql/put.html
Using compression either before or auto_compress is advisable since it will reduce network transfer times and bandwidth consumption. This will use CPU and IO on the server doing the PUT operation. If the server doing the PUT is maxed out (I've seen some cases of VMs on oversubscribed systems for example), it would be better to perform the compression before sending to the machine doing the PUT. This is because there's already a lot of CPU and IO on the PUT operation because it's encrypting the files prior to upload.

Related

rsync on fat32 and ntfs

A little background: I have tried to use rsync to backup my wife's home directory to an external usb drive with the command
rsync -va /home/wife /run/media/wife
but kept getting error messages that mkstemp failed, and that rsync failed to set times, becuase of a read-only filesystem. Worse, it seems that rsync is unable to tell when files don't need syncing, and winds up copying a lot of stuff it doesn't need to, resulting in rediculously slow backup times.
So I tried using rsync -rtvO instead, based on this guy's advice. Okay, no more warnings, but the backups still seem too slow, and esp on big media files that already exists -- i.e. it's still copying stuff unnecessarily.
Is my analysis correct?
Is there a workaround?
Will the problem be fixed if I use an NTFS drive for here backups?
I could of course use a linux filesytem, but on rare occasions she would like to be able to take the drive to work and access it from the Windows machines there.
Try using --modify-window=1
In particular, when transferring to or from an MS Windows FAT
filesystem (which represents times with a 2-second resolution),
--modify-window=1 is useful (allowing times to differ by up to 1 second).
https://download.samba.org/pub/rsync/rsync.html
You could also try using --size-only
skip files that match in size
For rsync to FAT, this is what I use and it seems to work pretty well:
rsync -rtv --modify-window=1 source/ destination/
Source: https://serverfault.com/a/144475/58568

Adding Capability to NFS Server - Compressing/Decompressing Stored/Retrieved Files

I need to build a custom Suse Linux NFS Server that does compression on certain files that are stored on the disk, and decompresses files as they are read from the disk. This needs to be transparent to the remote users of the file system, meaning that if a user saves a 10MB file named XYZZY.tif on /archiveDirectoryOnNFSServer, that when they do a ls -l on that mounted directory, they will see a 10MB file called XYZZY.tif, even though the actual file stored on the disk on the NFS server will be XYZZY.tif.compressed, and it will be 2MB in size.
I'm expecting that I need to build this as a driver that sits below the NFS Server software stack, but, I'm having difficulty finding where to start. Are there existing NFS Servers that provide this level of customization through APIs? Will I need to modify source of an open source NFS Server, and, if so, is there one that would be easiest to start with, and are they modularly structured such that this will be straight forward? I'm having difficult locating relevant content on the internet, and any pointers will be greatly appreciated.
IMO that kind of functionality is absolutely not the NFS server's responsibility (an nfs server should, well, serve files over nfs), but the underlying filesystem's. However, there's not that much choice in Linuxland but you could start by checking out fusecompress and btrfs.
This post is a bit old so you may already be aware of some options here, but there are a couple others (both for server side).
http://zfsonlinux.org/
zfs filesystem has built-in compression. I typically use lzjb as it is the fastest compression algorithm and does a reasonable job (MySQL DB's get 2-4x compression, filesystems with non-compressed data get around 4). you have a choice of algorithm depending on how much CPU time you wish to offer the compression.
if you want different file types compressed then you may consider laying gluster on top of a set of zfs filesystems.
gluster will allow you to store certain file types (by extension) on different underlying filesystems.
in this case, you specify the underlying filesystem as a zfs volume with the particular options you need (for example, .zip and .png go on an uncompressed filesystem, while things you write once and read many like static html files might go on a higher compression--you'll pay once when it's written but reads should be really fast since it scans fewer disk blocks and decompression is very fast)
zfs will manage the nfs mounts if you use it as your nfs server--you wont want this if you lay gluster on top.
it's easy to specify dynamically other attributes per filesystem (atime/noatime, # of copies if you want redundancy other than your normal raid, you can add SSD's as cache devices to get more performance).
in these solutions, you still send the full uncompressed files over the wire, so it doesn't make up for network performance but gives a lot of options if you're trying to speed up Disk IO or get more utilization out of your drives.

Garbage data in file after sudden power loss

I am using a flash with FAT32 system. I am continuosly writing data to a file using file system APIs from rtos(SMX). However, after sudden poweroffs, the file contains garbage values just above the first file entry on system reboot.
I run chkdsk utility, but it doesn't fix any problem.
Any idea how can i get rid of these garbage entries even on unclean power offs?
If you expect sudden power loss, you'll need to disable all caching/buffering on file writes. Of course you'll also need to deal with partially-written files, but that should at least prevent trailing garbage.
I don't know the API you're using, but this might be done by mounting the drive "synchronously" (e.g., mount -o sync in Linux) or by opening individual files with specific options. If you do disable buffering on individual file writes, you may still run the risk of corrupting the FAT, however, and losing all the files.

Premptively getting files into Windows page cache

I have a program written in C that allows the user to scroll through a display of about a zillion small files. Each file needs to undergo a certain amount of processing (read only) before it's displayed to the user. I've implemented a buffer that preprocesses the files in a certain radius around the user's position, so if they're working linearly through them, there's not much delay. For various reasons, I can only actually run my processing algorithm on one file at a time (though I can have multiple files open, and I can read from them) so my buffer loads sequentially.
My processing algorithms are as optimized as they're going to get, but I'm running into I/O problems. At first, my loading process is slow, but when the files have been accessed a few times, it speeds up by about 5x. Therefore I strongly suspect that what's slowing me down is waiting for the Windows page cache to pull my files into memory. I know very little about that sort of thing. If I could ensure my files were in the cache before my processing algorithm needed them, I'd be in business.
My question is: is there a way to persuade/cajole/trick/intimidate Windows into loading my files into the page cache before I actually get around to reading/processing them?
There's only one way to get a file into the file system cache: reading it. That's a chicken-and-egg problem. You can get the egg first by using a helper thread that reads files. It would have to have some kind of smarts about what file is likely to be next. And not read too much.
On a POSIX system, you'd use posix_fadvise:
POSIX_FADV_WILLNEED
        Specifies that the application expects to access the specified data in the near future.
However, that doesn't seem to exist on Windows. What is fadvise/madvise equivalent on windows ? - Stack Overflow has some alternatives.

Read data from damaged media

Is it possible to read damaged media (cd, hdd, dvd,...) even if windows explorer bombs out?
What I mean to ask is, whether there is a set of APIs or something that can access the disk at a very low level (below explorer?) and read whatever can be retrieved even if it is only partial, especially if you can still see the file is there from explorer, but can't do anything with it because it is damaged somehow (scratch on cd, etc)?
The main problem with Windows Explorer is that it doesn't support resuming copying after a read error. Most superficially scratched CDs, for example, will fail on different areas of the disk every time you eject and reinsert them.
Therefore, with a utility that supports resuming copy operations, it is possible to read the entire contents of a damaged CD with by doing "eject/reload/resume" a few times.
In fact, this is what a utility I wrote does, and I've never needed anything fancier to read scratched disks. (It simply uses ReadFile and WriteFile.)
One step lower would be opening the raw partition (i.e. disk image) by passing a string such as "\.\F:" (note: slashes are literal here) to CreateFile. It would allow you to read raw sectors from a drive, but reconstructing files from that data would be hard.
In fact, the "\.\" syntax allows you to open devices in the "\GLOBAL??" branch of the Windows Object Manager namespace as if they were files. It's not unlike calling dd with /dev/x as a parameter. There is also a "\Device" branch, but that's only accessible via DeviceIoControl() (i.e. ioctl()), meaning there's no simple ReadFile()/WriteFile() interface.
Anything lower level than that would be device-specific, I guess; like reading raw CD-ROM data (including ECC bits) the way some CD-burning programs do. You'd have to do some research on the specific media (CD, flash, DVD) and what your hardware allows you to do on them.
Note: The backslashes seem to get lost on the way to the web page; you need to pass "backslash backslash dot backslash DeviceName" to CreateFile. You need to escape them, too, of course.
If you want to do it, do it from the Linux side - see: http://sourceforge.net/projects/monkeycity/ opensource
or ready made app and freeware too: http://www.theabsolute.net/sware/dskinv.html
the first step is dd_rescue. After that, you're free to try anything to reconstruct the data.
And there's GNU ddrescue
GNU ddrescue is a data recovery tool. It copies data from one file or block device (hard disc, cdrom, etc) to another, trying to rescue the good parts first in case of read errors.
Make sure to use the 3-arg version (manual):
ddrescue [options] infile outfile [mapfile]
That is, do use a mapfile even if it's optional, because:
If you use the mapfile feature of ddrescue, the data is rescued very efficiently, (only the needed blocks are read). Also you can interrupt the rescue at any time and resume it later at the same point. The mapfile is an essential part of ddrescue's effectiveness. Use it unless you know what you are doing.
And it's also included in Cygwin and Homebrew.
I don't know what layer exists between Windows Explorer and the Win32 APIs. You can try to write a program with the Win32 File I/O stuff. If that doesn't work, then you have to write your own device driver to get any lower.
I've had some luck from the linux side, or using BartPE (http://www.nu2.nu/pebuilder/), but just seeing the file doesn't always mean the file is going to be recoverable, whether you're trying from Windows or Linux. You're best bet might be to use a trial of a recovery program.
I have had two disks start to disintegrate on me. From the pattern of unreadable sectors I think they had internal flaking of their emulsion. WinXP Explorer just threw up its hands and said the drive didn't even exist.
In both cases I used "GetDataBack for NTFS" from Runtime Software (http://www.runtime.org/). You can download a free trial which will show you what you could get back if you paid for it. When I bought it it was $49, but I see it is now $79.
This program is amazing. It's not necessarily fast as it will reread some sectors over and over, trying to get a consensus value from multiple tries, but when it's done you can get back stuff that you thought was gone forever. I had one drive that it took over 10 hours to analyze, but when it was done I got back over 97% of a 500GB drive. Definitely worth the price.
Another great tool is Beyond Compare. I have rev 2.5.3, but it is currently at 3.?? and costs $30. They have a full-functionality, 30-day trail. It does a great job of copying large quantities of files (and only those that need to be copied) and, unlike Explorer, it doesn't blow up if something fails. It's sort of like a visual rsync for Windows, if you're familiar with that program from the Samba people.
I have no connection with either of the comapnies mentioned other than being a very satisfied customer.
The gold standard for recovering data from a magnetic storage device would have to be SpinRite. It's a commerical app though, so you probably wouldn't learn much from it.
If you have a Linux machine around, I can recommend dvdisaster. It is originally meant for creating error correction files, but it also reads DVDs into an image and ignores read errors; and you can use different drives one after another to get missing sectors filled in the image.

Resources