How do I quickly fill up a multi-petabyte NAS? - filesystems

My company's product will produce petabytes of data each year at our client sites. I want to fill up a multi-petabyte NAS to simulate a system that has been running for a long time (3 months, 6 months, a year, etc). We want to analyze our software while it's running on a storage system under load.
I could write a script that creates this data (a single script could take weeks or months to execute). Are there recommendations on how to farm out the script (multiple machines, multiple threads)? The NAS has 3 load balanced incoming links... should I run directly on the NAS device?
Are there third-party products that I could use to create load? I don't even know how to start searching for products like this.
Does it matter if the data is realistic? Does anyone know anything about NAS/storage architecture? Can it just be random bits or does the regularity of the data matter? We fanning the data out on disk in this format
x:\<year>\<day-of-year>\<hour>\<minute>\<guid-file-name>.ext

You are going to be limited by the write speed of the NAS/disks - I can think of no way of getting round that.
So the challenge then is simply to write-saturate the disks for as long as needed. A script or set of scripts running on a reasonable machine should be able to do that without difficulty.
To get started, use something like Bonnie++ to find out how fast your disks can write. Then you could use the code from Bonnie as a starting point to saturate the writes - after all, to benchmark a disk Bonnie has to be able to write faster than the NAS.
Assuming you have 3x1GB ethernet connections, the max network input to the box is about 300 MB/s. A PC is capable of saturating a 1GB ethernet connection, so 3 PCs should work. Get each PC to write a section of the tree and voila.
Of course, to fill a petabyte at 300 MB/s will take about a month.
Alternatively, you could lie to your code about the state of the NAS. On Linux, you could write a user-space filesystem that pretended to have several petabytes of data by creating on-the fly metadata (filename, length etc) for a petabytes worth of files. When the product reads, then generate random data. When you product writes, write it to real disk and remember you've got "real" data if it's read again.
Since your product presumably won't read the whole petabyte during this test, nor write much of it, you could easily simulate an arbitrarily full NAS instantly.
Whether this takes more or less than a month to develop is an open question :)

Related

Is 2ms a realistic performance goal for small writes to LMDB on a Raspberry Pi 3?

I'm about to begin work on a new project on a Raspberry Pi 3. It will be controlled with a complex GUI so interactive performance is important.
I decided to use LMDB for persistence, as - outside performance - its special traits make my system a lot simpler and for me there's no downside.
The application will be written in Rust, using the lmdb crate.
The critical path will be a single-threaded part where I will get the current timestamp and write a total of 16 or 20 bytes (not sure yet, is 16 bytes really better?) to the database under a key I already have computed.
For doing so (beginning with timestamp and ending after the write transaction has been commited) I have a performance budget of 2 milliseconds.
As far as I heard writes are LMDB's worst criteria, and those are very small random writes, so this is probably the worst possible application. This thought made me ask this question.
Further information:
as this is a GUI application this path will be called at most 100 times per second, and also never more than 1000 times per hour (this is a rewrite from scratch, and those are 10 times the numbers measured in the current system).
I do not understand much about how LMDB works but AFAIU it is using memory mapped files. I was hoping this means the OS internals will write back the pages aggregated.
Is 2ms a realistic goal for such an application? What would I need to consider to keep myself inside this window? Do I need to cache those writes manually?
It depends on how much data you have in you LMDB.
2ms is definitely achievable for small DB. I have got over 100K writes per second with SSDs and small DB size (<1GB). But if the DB is larger than your memory, you probably should not use LMDB if you are worried about write performance.
To wrap up, if your DB is smaller than memory, you can go ahead and use LMDB, 2ms is definitely enough for a write operation. If your DB is larger than memory, then you better use LSM-based kv-stores like LevelDB or RocksDB.

Detecting content change on a 16 GB pen drive within 8 sec

I have to detect whether the playable media (audio, video and image) has changed on a 16GB pen drive with 30,000 files, within 8 seconds for subsequent insertions. Other files such as pdf or plain text are not to be considered; this is for a media player software.
I tried ls -l and md5 but it takes me 10-11 seconds. Has anyone ever solved this problem before or any strategy you can suggest?
The scenario when content can change is that the user may eject the pen drive, add more songs to it, and re-insert the same pen drive. If there is no content change then I can use the old database and thus save play-time.
I cannot rely on timestamps because renaming a file on a Windows system doesn't change the modification time.
Just check file sizes instead of md5 sums. This should be much faster and less resource-intensive.
I'm assume your hashing the output of ls here in order to trigger a hash change on renames, additions, size changes or timestamps (for the systems that do play nice), since I'm guessing hashing 16GB split over 30,000 files would take much longer than 11 seconds (although most of this advice should work either way)
Your probably going to end up having to write your own code using a lowerlevel API to access the file list. ls is designed to be human readable not for speed. You don't need to query the human readable perms, username, groups, and so on and your going to be incurring a memory copy by piping it to md5.
You could try using the find command which seems faster and can specify just files. It would still be less efficient than a real program without having a pipe. This one is non-recursive (but so is ls -l), you can also specify custom formatting output if you want more than the name:
find . -maxdepth 1 -type f | md5sum
You could also try an alternative hash to MD5. MD5 is a cryptographic hash, it's designed to be secure against deliberate malicious collisions but is slower as a result.
MurmurHash3 is one of the fastest or the newer xxhash. But it will depend on the hardware and size of the data (some hashes are optimized for small keys such as for a hashmap).
You could also try and thread it. Have one thread reading the list of files in from the drive continuously and another hashing them as fast as it can.
If your looking to do that with a standard shell however without writing your own code, it's going to be a pain.
Having said all that, your main bottleneck is probably the speed of the flash memory. All the tricks in the world won't help if your CPU is starved waiting for I/O. I'm not sure that it's a good 'challenge' as it will very a lot depending on drive manufacturer and USB version (unless that has been specified). But maybe doing all that might shave off a few seconds and bring you into your goal. Or just get a faster USB stick.

How to avoid damaging SD card for large writes?

Ok, first a little background to help make my question clear:
I am working on a device that collects certain data from sensors and posts them to a server using a GSM modem. As a GSM connection is not 100% reliable, it would contain a logging mechanism that would write unsent data to an SD card.
We are using Chan's FatFs module for providing us with a file system as we want the log to be readable on a PC.
Now I've been testing the FAT system for boundary conditions, i.e., trying to fill up the card completely.
In the first run I opened the file and set the code to keep writing a string until the drive was full. The program would synch after every write.
I left the code running overnight.
The next day, I examined the SD card. I found that the file was only 150 MB in size. There were about 1.2 million lines written to it. The card could still be read from but not written to or formatted.
Next time I tried the same type of test, but this time I used the f_lseek() function to pre-allocate the file to 1GB. It would then write to that file until that limit was reached. This time the data would be synced after 50 writes. It would then close that file and open another to do the same.
As you can guess another brave little card lost it's mind that day.
So these are what I would like help with :
How to prevent damage to the card while writing large amounts of data?
Does leaving the file open for extended periods have any negative effects?
Since the full code may be too long, here's the main part where the writing happens
for(file_count=3;file_count>=0;--file_count){
ax_log_msg(E_LOG_INFO,"===================================");
ax_log_msg(E_LOG_INFO,file_names[file_count]);
f_open(&file_ptr,file_names[file_count],FA_WRITE|FA_OPEN_ALWAYS);
if(result!=FR_OK){
ax_log_msg(E_LOG_INFO,"\n\rf_open Failed\n\rResult code");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
continue;
}
ax_log_msg(E_LOG_INFO,"\n\rf_open Sucessfull");
result=f_lseek(&file_ptr,FILE_SIZE_LIMIT_1GB);
if(result!=FR_OK){
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Failed for preallocation\n\rResult code");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
f_close(&file_ptr);
continue;
}
ax_log_msg(E_LOG_INFO,"\n\rf_lseek Sucessfull for preallocation");
f_lseek(&file_ptr,0);
bytes_to_write=sizeof(messages[file_count]);
write_count=0;
while( (f_tell(&file_ptr) < FILE_SIZE_LIMIT_1GB )){
result=f_write(&file_ptr,messages[file_count],bytes_to_write,&bytes_written);
if(result==FR_OK){
++write_count;
if(write_count%50==0){
f_sync(&file_ptr);
}
}else{
ax_log_msg(E_LOG_INFO,"\n\rWrite failed\n\rFRESULT=");
ax_log_msg(E_LOG_INFO,FRESULT_S[result]);
break;
}
}
f_close(&file_ptr);
}
Note :
ax_log_msg() is part of the device firmware to print on console.
FRESULT_S[result] is used to convert the enum result code to a string.
If there is any data missing, please do mention it.
Thank You
You probably need to buffer an entire block of data, perhaps 4 KB, to avoid flashing an entire block with every flush. But, the filesystem or driver should do this for you, as long as you don't call fflush explicitly, which is the real lesson.
Why do you need it to be synced so often? Perhaps a timer would work better than an interval per number of records?
Due to 100,000 write cycles limit per sector it is a really challenging task to extend a flash memory lifespan. One of my cards died over one night after I run writing tests on it. I then counted time periods, and that's indeed easy to perform 100,000 writes (in the same sector) just in one night (without taking into account a calculation it comes through experience).
At that time I was told that there is a smart monitors in some filesystems and they count and keep writes number for every sector in order to writings number per every sector was the same, I guess. I neither took nor tested one.
I now found some extremely popular/highly voted answer/suggestion for Raspberrypi and I quote it here now:
These methods should increase the lifespan of the SD card by minimising the number of read/writes in various ways:
Disable Swap
Swapping is the process of using part of the SD card as volatile memory. This will increase the amount of RAM available, but it will result in a high number of read/writes. It is unlikely to increase performance significantly.
Disable swap with the swapoff command:
sudo swapoff --all
You must also prevent it from coming back after a reboot:
For Raspbian which uses dphys-swapfile to manage a swap file (instead of a "normal" swap partition) you can simply sudo apt-get remove dphys-swapfile to remove it permanently. Best to remove because setting the CONF_SWAPSIZE to 0, as explained in this answer, doesn't seem to work and still creates a 100MB swap file after reboot.
For other distributions that use a swap partition instead of a swap file, remove the appropriate line from /etc/fstab
Disabling Journaling on the Filesystem
Using a journaling filesystem such as ext3 or ext4 WITHOUT a journal is an option to decrease read/writes. The obvious drawback of using a filesystem with journaling disabled is data loss as a result of an ungraceful dismount (i.e. post power failure, kernel lockup, etc.).
You can disable journaling on ext3 by mounting it as ext2
You can disable journaling on ext4 on an unmounted drive like this:
tune4fs -O ^has_journal /dev/sdaX
e4fsck –f /dev/sdaX
sudo reboot
The noatime Mount Flag
Assign the noatime mount flag to partitions residing on the SD card by adding it to the options section of the partition in /etc/fstab.
Reading accesses to the file system will no longer result in an update to the atime information associated with the file. The importance of the noatime setting is that it eliminates the need by the system to make writes to the file system for files which are simply being read. Since writes can be somewhat expensive as mentioned in previous section, this can result in measurable performance gains. Note that the write time information to a file will continue to be updated anytime the file is written to with this option enabled.
Directories in RAM
Highly used directories such as /var/tmp/ and possibly /var/log can be relocated to RAM in /etc/fstab like this:
tmpfs /var/tmp tmpfs nodev,nosuid,size=50M 0 0
This will allow /var/tmp to use 50MB of RAM as disk space. The only issue with doing this is that any drives mounted in RAM will not persist past a reboot. Thus if you mount /var/log and your system encounters an error that causes it to reboot, you will not be able to find out why.
Directories in external Hard Disk
You can also mount some directories on a persistent USB hard disk. More details of this can be found in this question.
The Raspberry Pi can also boot it's root partition from an external drive. This could be via USB or Ethernet and means that the SD card will only be used to delegate to different device during boot. This requires a bit of kernel hacking to accomplish, as I don't think the default kernel supports USB storage. You can find more information at this question, or this external blog post.
Here is one more interesting consideration from another answerer:
Excellent article about flash filesystems.
Important question when talking about flash filesystems is following: What is wear leveling? Wikipedia article. Basically, on flash disks you can write limited number of times until block goes bad. After that, filesystem (if there is no built-in wear leveling management on hardware, as in case of SSDs there usually is) must mark that block as invalid, and avoid using it anymore.
Typical filesystems (for example reiserfs, ntfs, ext3 and so on) are designed for hard disks, that do not have such limitations.
JFFS2
Includes compression and elegant wear leveling protection.
YAFFS2
Single thing that makes the difference: short mount times, after successful umount.
Implements write once property: once data is written to one block, there is no need to rewrite it. This is important for protecting against wear leveling.
LogFS
Not very mature, but already included in Linux kernel tree.
Supports larger filesystems than JFFS2/YAFFS2 without problems.
UBIFS
More mature than LogFS
Write caching support
On scalability: article. On large disks, better performance than with JFFS2
ext4
If no driver or card (for example SSD drives do have internal wear leveling, at least usually) handle wear leveling, then ext4 is not the best idea, as it is not intended for raw flash usage.
What is best one?
Of course, it depends on usage and support. From what I read from the internet, I would recommend UBIFS. Good support for large filesystems, mature development phase, adequate performance and no huge downsides.
Thanks to answerers:
How can I extend the life of my SD card?
Choice of filesystem for GNU/Linux on an SD card

Optimizing data stream to disk in C (also flash memory)

I have a C program running on Linux that acquires data from a USB device (sensor data), does some processing and streams the result to disk. Currently I save to a text file using fputs(), a line looks like this:
timestamp value1 value2 ... valueN
the sample rate being up to 250Hz.
The program should run on a RPi or similar board and possibly write the data to a flash memory (SD card).
I have following questions:
Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
Here some more thoughts:
I can consider using a binary format to decrease size, but I would prefer keeping the text format to simplify later data handling.
Using a hard drive is also an option in the final design, especially if a high acquisition rate is to be carried on over a long time.
The data rate being relatively low I do not expect bandwidth problem with either hard drive or SD card. It is possible that the rate will be higher in the future (kHz or more).
Thanks for your answers.
EDIT 20130128
Thank you for all the answers so far, they give me some good insight. I'll sum it up a bit:
In general I should not have bandwidth issues, however to avoid unnecessary large log files I might consider a binary format. Yes the log should be human readable, if not I'll make an export function or similar. Yes unwind's assumption is correct, about 10 or 15 data values each line.
The mentioned read/write cycles per cell should be enough for some time, at least in the testing phase, considering we don't always write and delete the same cells. I will play around with buffer size in setvbuf() and set the buffering mode to full buffering to see if I can optimize this while keeping a reasonable save interval (a few seconds or more also depending on sample rate).
In the final design I might use a hard drive to avoid most of the problems mentioned here, or a second SD card which can be easily replaced (might be also good to quickly retrieve the data). I will format this with one of the format suggested here (FAT or JFFS2/F2FS).
Following zmo's suggestion I will try to make the system as read only as possible (at least the system partition), I was already considering this.
A Beaglebone, also mentioned by zmo, is my next choice if I'm not happy with the RPi (I read that its USB bus is not always stable, USB is obviously very important for my application).
I have already implemented a UDP port to send data over network, still I would like to keep at least a local copy of that data and maybe only send a subset of or already processed data, as well as "control data".
Should I be optimizing the data stream or let the OS do the job? More specifically, should I be trying to minimize how often data is actually written to disk (also given the use of a flash memory)?
Well, you can usually assume that the OS does a pretty awesome job at buffering and handling output to the hard drive… As long as you don't do unbuffered writes.
Though, from my experience, you should not write logs to a SD Card, because it definitely kills the SD Card faster than you can imagine. On my first projects, I had installed linux on beaglebones, and between 6 months to 12 months after, all my SD Cards had to be replaced…
Since then, I've learned to run read only systems on the SD card and send any kind of regular updates over the network, the trick being to use a ramdisk for /tmp and /var.
In your case, using a hard drive is an easy solution (which will works smoothly), but you can also use a secondary SD Card where you write the logs. Then you'll be able to use a "stupid" filesystem such as a FAT one where you'll write your data aligned, as your data will be the only thing to be written on the SD. What is killing a SDCard is lots of little read/writes that happen a lot with temporary files, and defragmentation of the drive.
I have read about setbuf() and setvbuf(), as I understand they should effectively delay writing until a "block" is filled. Are these appropriate or is there a better way other than perhaps implementing my own buffer?
well, just keep it to full buffering, it will help write your data aligned on the filesystem.
Which output function is best suited for data streaming with the above in mind (fputs() / fprintf() / write())?
they should all behave similarly for your problematic.
Should I be trying to increase randomness (as to use all sectors) when writing to a SD card? If yes what's the best way to achieve this?
the firmware of the sdcard should be taking care of that for you. The only thing would be to use a simpler filesystem like FAT (or JFFS2/F2FS like ivan-voras suggets), because ext2/ext3/ext4 filesystems do automatic defragmentation which basically is moving around inodes to keep everything aligned. Though I'm not sure if it disables that behavior with SDcards and SSDs.
Writing to the SD card often will definitely kill it sooner, but it also means you can attempt to prolong this time by reducing the number of writes. As others have said, the best solution for you would be to write the logs over the network to a server or just another machine which has proper storage (in the simplest case, maybe you can use syslog(3) or just plain NFS).
If you want to continue with the original plan, then using setvbuf(3) to enable block buffered mode and setting a large buffer size (like 128 KiB or 256 KiB) would be best. A large buffer size also means that you will lose unwritten data from the buffer if power goes out, etc.
However, a large buffer only delays the inevitable and you should search for other options. It's not as alarming as Lundin's answer states because there are many cells and you're not writing always to the same one, so if you get the largest SD card you can buy, then using his method you can calculate approximately how many times you can rewrite the entire card before it fails. Using a flash-friendly file system such as F2FS or JFFS2 will be beneficial.
Here're my thoughts:
It might be a good idea to buffer some data in memory before writing to disk, but keep in mind that this might cause data loss in case of power failure.
I think this is highly dependent on the file system and type of storage you use. There is no generic answer but it could prove useful to implement and benchmark it on your specific configuration.
Considering the huge amount of data you're outputting, I'd choose a binary format (unless you want the file to be human readable)
The firmware of the flash drive should already take care of this. Basically this is the cornerstone of all modern SSDs. (SD card controllers should implement it too.)

How to read/write at Maximum Speed from Hard Disk.Multi Threaded program I coded cannot go above 15 mb/ sec

I have a 5 gb 256 Files in csv which I need to read at optimum speed and then write back
data in Binary form .
I made following arrangments to achieve it :-
For each file, there is one corresponding thread.
Am using C function fscanf,fwrite.
But in Resource Monitor,it shows not more then 12 MB/ Sec of Hard Disk and 100 % Acitve Highest Time.
Google says HardDisk can read/write till 100 MB/Sec.
Machine Configuration is :-
Intel i7 Core 3.4. Has 8 Cores.
Please give me your prespective.
My aim to complete this process within 1 Min .
** Using One Thread it took me 12 Mins**
If all the files reside on the same disk, using multiple threads is likely to be counter-productive. If you read from many files in parallel, the HDD heads will keep moving back and forth between different areas of the disk, drastically reducing throughput.
I would measure how long it takes a built-in OS utility to read the files (on Unix, something like dd or cat into /dev/null) and then use that as a baseline, bearing in mind that you also need to write stuff back. Writing can be costly both in terms of throughput and seek times.
I would then come up with a single-threaded implementation that reads and writes data in large chunks, and see whether I can get it to perform similarly the OS tools.
P.S. If you have 5GB of data and your HDD's top raw throughput is 100MB, and you also need to write the converted data back onto the same disk, you goal of 1 minute is not realistic.

Resources