Adding Capability to NFS Server - Compressing/Decompressing Stored/Retrieved Files - filesystems

I need to build a custom Suse Linux NFS Server that does compression on certain files that are stored on the disk, and decompresses files as they are read from the disk. This needs to be transparent to the remote users of the file system, meaning that if a user saves a 10MB file named XYZZY.tif on /archiveDirectoryOnNFSServer, that when they do a ls -l on that mounted directory, they will see a 10MB file called XYZZY.tif, even though the actual file stored on the disk on the NFS server will be XYZZY.tif.compressed, and it will be 2MB in size.
I'm expecting that I need to build this as a driver that sits below the NFS Server software stack, but, I'm having difficulty finding where to start. Are there existing NFS Servers that provide this level of customization through APIs? Will I need to modify source of an open source NFS Server, and, if so, is there one that would be easiest to start with, and are they modularly structured such that this will be straight forward? I'm having difficult locating relevant content on the internet, and any pointers will be greatly appreciated.

IMO that kind of functionality is absolutely not the NFS server's responsibility (an nfs server should, well, serve files over nfs), but the underlying filesystem's. However, there's not that much choice in Linuxland but you could start by checking out fusecompress and btrfs.

This post is a bit old so you may already be aware of some options here, but there are a couple others (both for server side).
http://zfsonlinux.org/
zfs filesystem has built-in compression. I typically use lzjb as it is the fastest compression algorithm and does a reasonable job (MySQL DB's get 2-4x compression, filesystems with non-compressed data get around 4). you have a choice of algorithm depending on how much CPU time you wish to offer the compression.
if you want different file types compressed then you may consider laying gluster on top of a set of zfs filesystems.
gluster will allow you to store certain file types (by extension) on different underlying filesystems.
in this case, you specify the underlying filesystem as a zfs volume with the particular options you need (for example, .zip and .png go on an uncompressed filesystem, while things you write once and read many like static html files might go on a higher compression--you'll pay once when it's written but reads should be really fast since it scans fewer disk blocks and decompression is very fast)
zfs will manage the nfs mounts if you use it as your nfs server--you wont want this if you lay gluster on top.
it's easy to specify dynamically other attributes per filesystem (atime/noatime, # of copies if you want redundancy other than your normal raid, you can add SSD's as cache devices to get more performance).
in these solutions, you still send the full uncompressed files over the wire, so it doesn't make up for network performance but gives a lot of options if you're trying to speed up Disk IO or get more utilization out of your drives.

Related

Scan a USB for Folders which have mp3 file in them using ELM CHAN fatfs?

I am tying to scan usb msc in stm32 for audio files. This mp3 files are scattered in many folders which are unknown to the application.
First I scan for directories in the root folder and find folders then I scan folders for mp3 files.
This is very time consuming and for depth of 8 folders with many files in each folder.
Is there any Way to Scan for just for folders which have mp3 file in them using a better approach.
Directory structure for testing is something like this:
It is not clear what your problem might be since you have not provided any code to see how you are scanning, or quantative information on the file/folder structure ("many files" is rather vague), or even specified the media type used.
However a solution that might overcome all the variables of filesystem performance, hardware I/O, driver implementation and media type and make access more deterministic regardless, is to maintain a separate index file or database in a single file in the root directory to map each MP3 file to its path so you need only search the index/database for the MP3 you need (or use it to directly list all MP3's without scanning the file system).
If you maintain that file sorted (or a separate index file that is sorted) then you can use a binary search to find a specific file. Or simply use a real database - though that might be a rather heavyweight solution for this purpose. You might even load the metadata into memory for even faster access, and write it to the filesystem only if it changes.
Either way, the solution I suggest is to isolate your application from the variability of the filesystem/media, and the lack of scalability of FAT in general by maintaining your own "metadata" file(s) indicating what is stored and where so that you can use that to access the files directly without file system scanning using findfirst/findnext semantics, and recursion which is always best avoided, but is the obvious way to scan a directory tree.
Incidentally this is precisely how iTunes works for example. The "iTunes Library.xml" contains meta-data about "songs" including their location. Clearly you need not have anything quite that detailed, but the principle is the same and there may be merit in using XML or JSON for your application given a suitable library for updating and accessing such a file.
By doing that, the performance is more directly within your control rather than dependent on the filesystem, media and/or device driver level. However you still have some control/responsibility over the media and its interface (SPI, SDIO, USB or whatever), and the device I/O layer (DMA, interrupts, polled, bit-banged), and while you may have little control over the choice of FAT and the ELM FatFs implementation, you can certainly impact its performance greatly at the device driver, hardware interface and physical media level.

Snowflake PUT command, AUTO_COMPRESS vs gzip compressed file performance

Can someone suggest which of below option will be more performant with PUT command:
Uploading file with AUTO_COMPRESS=true.
Uploading compressed file(gzip) AUTO_COMPRESS=false.
There's no harm to leaving AUTO_COMPRESS=true because if a file is already compressed, the PUT command won't try to double compress it. There is an important caveat to note though. If a file is already compressed, it must be compressed in a supported compression method. You can get a list of supported methods here: https://docs.snowflake.com/en/sql-reference/sql/put.html
Using compression either before or auto_compress is advisable since it will reduce network transfer times and bandwidth consumption. This will use CPU and IO on the server doing the PUT operation. If the server doing the PUT is maxed out (I've seen some cases of VMs on oversubscribed systems for example), it would be better to perform the compression before sending to the machine doing the PUT. This is because there's already a lot of CPU and IO on the PUT operation because it's encrypting the files prior to upload.

mkstemp and hard disk stress

Are temporary files created with mkstemp synced to disk?
Here is what I have:
Program creates temporary file using mkstemp and sends fd to another program.
This temporary file is mmap-ped by both programs and used heavily (up to 400 MB/sec of writes and 400 MB/sec of reads; up to 60 reads and writes per second).
I can't use memfd_create (may not be supported on target devices).
Lets also assume (and this is almost true) that I can't create this file on tmpfs (like in /tmp).
What I need is guarantee that such file will not stress hard disk. I can't allow it to be written to disk even if this only happens once every 5 seconds. If I can't get such guarantee, I will look for another way.
Additional info (not important):
I am writing wayland compositor for Android devices. Currently temporary files (wayland surfaces actually) are created on tmpfs. And everything works fine as long as SELinux is not enabled. But if I enable SELinux, it prevents fd's from being transferred from client to compositor. Only solution I currently know is to create temporary files in app's home dir. But if such way is dangerous, I will find another.
Are temporary files created with mkstemp synced to disk?
The mkstemp function does not impart any special properties to files it opens that would prevent them from being synced to disk. The filesystem on which they are created might have such a property, but that's independent of file creation. In particular, files created via mkstemp() will persist indefinitely if not removed.
What I need is guarantee that such file will not stress hard disk. I can't allow it to be written to disk even if this only happens once every 5 seconds. If I can't get such guarantee, I will look for another way.
As far as I am aware, even tmpfs filesystems do not guarantee that their contents will remain locked in memory, as opposed to being paged out. They are backed by virtual memory. But if the actual file is comparatively small and all its pages are hot, then they are likely to remain in memory only.
With regard to the larger problem,
everything works fine as long as SELinux is not enabled. But if I
enable SELinux, it prevents fd's from being transferred from client to
compositor. Only solution I currently know is to create temporary
files in app's home dir.
By default, newly-created files inherit the SELinux type of their parent directory. Your Wayland clients presumably do not have sufficient privilege to modify the SELinux labels of the files they create, but you should be able to administratively create a directory wherever you like with a label conducive to your needs. For example, you could cause a subdirectory of /dev/shm to be created for the purpose (at every boot), and chconned to have an appropriate label. If the clients create their temp files there then they should inherit the SELinux type you choose.

Analyze VMDK (vmware virtual machine disk) files for changes

Is there a good way to analyze VMware delta VMDK files between snapshots to list changed blocks, so one can use a tool to tell which NTFS files are changed?
I do not know a tool that does this out of the block, but it should not be so difficult.
The VMDK file format specification is available and the format is not that complex. As far as I remember, a VMDK file consists of a lot of 64k block. At the beginning of the VMDK file there is some directory that contains the information where a logical block is stored in the physical file.
It should be pretty easy to detect there a logical block is stored in both files and than compare the data in the two version of the VMDK file.

Read data from damaged media

Is it possible to read damaged media (cd, hdd, dvd,...) even if windows explorer bombs out?
What I mean to ask is, whether there is a set of APIs or something that can access the disk at a very low level (below explorer?) and read whatever can be retrieved even if it is only partial, especially if you can still see the file is there from explorer, but can't do anything with it because it is damaged somehow (scratch on cd, etc)?
The main problem with Windows Explorer is that it doesn't support resuming copying after a read error. Most superficially scratched CDs, for example, will fail on different areas of the disk every time you eject and reinsert them.
Therefore, with a utility that supports resuming copy operations, it is possible to read the entire contents of a damaged CD with by doing "eject/reload/resume" a few times.
In fact, this is what a utility I wrote does, and I've never needed anything fancier to read scratched disks. (It simply uses ReadFile and WriteFile.)
One step lower would be opening the raw partition (i.e. disk image) by passing a string such as "\.\F:" (note: slashes are literal here) to CreateFile. It would allow you to read raw sectors from a drive, but reconstructing files from that data would be hard.
In fact, the "\.\" syntax allows you to open devices in the "\GLOBAL??" branch of the Windows Object Manager namespace as if they were files. It's not unlike calling dd with /dev/x as a parameter. There is also a "\Device" branch, but that's only accessible via DeviceIoControl() (i.e. ioctl()), meaning there's no simple ReadFile()/WriteFile() interface.
Anything lower level than that would be device-specific, I guess; like reading raw CD-ROM data (including ECC bits) the way some CD-burning programs do. You'd have to do some research on the specific media (CD, flash, DVD) and what your hardware allows you to do on them.
Note: The backslashes seem to get lost on the way to the web page; you need to pass "backslash backslash dot backslash DeviceName" to CreateFile. You need to escape them, too, of course.
If you want to do it, do it from the Linux side - see: http://sourceforge.net/projects/monkeycity/ opensource
or ready made app and freeware too: http://www.theabsolute.net/sware/dskinv.html
the first step is dd_rescue. After that, you're free to try anything to reconstruct the data.
And there's GNU ddrescue
GNU ddrescue is a data recovery tool. It copies data from one file or block device (hard disc, cdrom, etc) to another, trying to rescue the good parts first in case of read errors.
Make sure to use the 3-arg version (manual):
ddrescue [options] infile outfile [mapfile]
That is, do use a mapfile even if it's optional, because:
If you use the mapfile feature of ddrescue, the data is rescued very efficiently, (only the needed blocks are read). Also you can interrupt the rescue at any time and resume it later at the same point. The mapfile is an essential part of ddrescue's effectiveness. Use it unless you know what you are doing.
And it's also included in Cygwin and Homebrew.
I don't know what layer exists between Windows Explorer and the Win32 APIs. You can try to write a program with the Win32 File I/O stuff. If that doesn't work, then you have to write your own device driver to get any lower.
I've had some luck from the linux side, or using BartPE (http://www.nu2.nu/pebuilder/), but just seeing the file doesn't always mean the file is going to be recoverable, whether you're trying from Windows or Linux. You're best bet might be to use a trial of a recovery program.
I have had two disks start to disintegrate on me. From the pattern of unreadable sectors I think they had internal flaking of their emulsion. WinXP Explorer just threw up its hands and said the drive didn't even exist.
In both cases I used "GetDataBack for NTFS" from Runtime Software (http://www.runtime.org/). You can download a free trial which will show you what you could get back if you paid for it. When I bought it it was $49, but I see it is now $79.
This program is amazing. It's not necessarily fast as it will reread some sectors over and over, trying to get a consensus value from multiple tries, but when it's done you can get back stuff that you thought was gone forever. I had one drive that it took over 10 hours to analyze, but when it was done I got back over 97% of a 500GB drive. Definitely worth the price.
Another great tool is Beyond Compare. I have rev 2.5.3, but it is currently at 3.?? and costs $30. They have a full-functionality, 30-day trail. It does a great job of copying large quantities of files (and only those that need to be copied) and, unlike Explorer, it doesn't blow up if something fails. It's sort of like a visual rsync for Windows, if you're familiar with that program from the Samba people.
I have no connection with either of the comapnies mentioned other than being a very satisfied customer.
The gold standard for recovering data from a magnetic storage device would have to be SpinRite. It's a commerical app though, so you probably wouldn't learn much from it.
If you have a Linux machine around, I can recommend dvdisaster. It is originally meant for creating error correction files, but it also reads DVDs into an image and ignores read errors; and you can use different drives one after another to get missing sectors filled in the image.

Resources