Data distribution in btrfs single profile array: using file instead of block level? - btrfs

I have an array of 3 different drives which I use in single profile (no raid). I don't use raid because the data isn't that important to spend some extra money for additioinal drives.
But what I could not figure out exactly is on what granularity the data is distributed on the 3 drives.
I could find this on the wiki page:
When you have drives with differing sizes and want to use the full
capacity of each drive, you have to use the single profile for the
data blocks, rather than raid0
As far as I understand this means that not the whole files are distributed/allocated on one of the 3 drives but each of the file's data blocks.
This is unfortunate because losing only 1 drive will destroy the whole array. Is there a possibility to balance a single profile array at a file level?
I would be fine with the risk of losing all files on 1 drive in the array but not losing the whole array if 1 drive fails.

Related

What is the difference between having one data file or multiple data files for a tablespace?

Oracle allow to create table space with multiple data files. what is the different with one data file size 1TB and 2 data files 500GB each? Is it any performance gain?
Performance? Could be. If you have one large (or two smaller) datafiles on the same hard disk, that will probably run somewhat slower than having two smaller datafiles on diferent hard disks. You know, you & me accessing data at the same time. HDD head will have to "jump" from one place to another to send data to both of us. If those were two disks, there's a chance that each disk will provide data separately and that would be faster.

Read a file after write and closing it in C

My code does the following
do 100 times of
open a new file; write 10M data; close it
open the 100 files together, read and merge their data into a larger file
do steps 1 and 2 many times in a loop
I was wondering if I can keep the 100 open w/o opening and closing them too many times. What I can do is fopen them with w+. After writing I set position the beginning to read, after read I set position to the beginning to write, and so on.
The questions are:
if I read after write w/o closing, do we always read all the written data
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
Bases on the comments and discussion I will talk about why I need to do this in my work. It is also related to my other post
how to convert large row-based tables into column-based tables efficently
I have a calculation that generates a stream of results. So far the results are saved in a row-storage table. This table has 1M columns, each column could be 10M long. Actually each column is one attribute the calculation produces. At the calculation runs, I dump and append the intermediate results the table. The intermediate results could be 2 or 3 double values at each column. I wanted to dump it soon because it already consumes >16M memory. And the calculate needs more memoy. This ends up a table like the following
aabbcc...zzaabbcc..zz.........aabb...zz
A row of data are stored together. The problem happens when I want to analyze the data column by column. So I have to read 16 bytes then seek to the next row for reading 16 bytes then keep on going. There are too many seeks, it is much slower than if all columns are stored together so I can read them sequentially.
I can make the calculation dump less frequent. But to make the late read more efficent. I may want to have 4K data stored together since I assume each fread gets 4K by default even if I read only 16bytes. But this means I need to buffer 1M*4k = 4G in memory...
So I was thinking if I can merge fragment datas into larger chunks like that the post says
how to convert large row-based tables into column-based tables efficently
So I wanted to use files as offline buffers. I may need 256 files to get a 4K contiguous data after merge if each file contains 1M of 2 doubles. This work can be done as an asynchronous way in terms of the main calculation. But I wanted to ensure the merge overhead is small so when it runs in parallel it can finish before the main calculation is done. So I came up with this question.
I guess this is very related to how column based data base is constructed. When people create them, do they have the similar issues? Is there any description of how it works on creation?
You can use w+ as long as the maximum number of open files on your system allows it; this is usually 255 or 1024, and can be set (e.g. on Unix by ulimit).
But I'm not too sure this will be worth the effort.
On the other hand, 100 files of 10M each is one gigabyte; you might want to experiment with a RAM disk. Or with a large file system cache.
I suspect that huger savings might be reaped by analyzing your specific problem structure. Why is it 100 files? Why 10 M? What kind of "merge" are you doing? Are those 100 files always accessed in the same order and with the same frequency? Could some data be kept in RAM and never be written at all?
Update
So, you have several large buffers like,
ABCDEFG...
ABCDEFG...
ABCDEFG...
and you want to pivot them so they read
AAA...
BBB...
CCC...
If you already have the total size (i.e., you know that you are going to write 10 GB of data), you can do this with two files, pre-allocating the file and using fseek() to write to the output file. With memory-mapped files, this should be quite efficient. In practice, row Y, column X of 1,000,000 , has been dumped at address 16*X in file Y.dat; you need to write it to address 16*(Y*1,000,000 + X) into largeoutput.dat.
Actually, you could write the data even during the first calculation. Or you could have two processes communicating via a pipe, one calculating, one writing to both row-column and column-row files, so that you can monitor the performances of each.
Frankly, I think that adding more RAM and/or a fast I/O layer (SSD maybe?) could get you more bang for the same buck. Your time costs too, and the memory will remain available after this one work has been completed.
Yes. You can keep the 100 files open without doing the opening-closing-opening cycle. Most systems do have a limit on the number of open files though.
if I read after write w/o closing, do we always read all the written data
It depends on you. You can do an fseek goto wherever you want in the file and read data from there. It's all the way you and your logic.
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
This would definitely save some overhead, like additional unnecessary I/O operations and also in some systems, the content which you write to file is not immediately flushed to physical file, it may be buffered and flushed periodically and or done at the time of fclose.
So, such overheads are saved, but, the real question is what do you achieve by saving such overheads? How does it suit you in the overall picture of your application? This is the call which you must take before deciding on the logic.

Read times of really small files on Linux with ext4

I'm working on comparing the performance of ext4 and NTFS on different file operations. As part of the same, I'm benchmarking reads on really small files (a few bytes) using C. These are of special interest since such small files are stored within the MFT on Windows but on ext4 would have their own disk block. So I should be able to observe a significant difference between read times of such files on NTFS against ext4. The tests are on a set of 100 files 4 bytes each. I measure the time read takes on each file.
The plots for read times, however, show patterns that I'm not able to explain. Here's 4 graphs:
First graph was generated by running echo 1 > /proc/sys/vm/drop_caches before reading each file, second by echo 2 > /proc/sys/vm/drop_caches, third by echo 3 > /proc/sys/vm/drop_caches and fourth without freeing any cache. For the second and third cases, why do the times go up and down? Shouldn't the times be within the same range throughput?
My first approximation was that since these files are really small, a lot of them get stored in one block and thus multiple files get read in a single block read. Although I'm definitely not sure if this is right. Is it possible that some sort of prefetching of an entire file is going on (although I'd imagine this would not be a good thing to do)? I also thought of checking the actual disk blocks of these files using debugfs but I'm not sure if logical block numbers would correspond to the actual disk blocks.

How do I quickly fill up a multi-petabyte NAS?

My company's product will produce petabytes of data each year at our client sites. I want to fill up a multi-petabyte NAS to simulate a system that has been running for a long time (3 months, 6 months, a year, etc). We want to analyze our software while it's running on a storage system under load.
I could write a script that creates this data (a single script could take weeks or months to execute). Are there recommendations on how to farm out the script (multiple machines, multiple threads)? The NAS has 3 load balanced incoming links... should I run directly on the NAS device?
Are there third-party products that I could use to create load? I don't even know how to start searching for products like this.
Does it matter if the data is realistic? Does anyone know anything about NAS/storage architecture? Can it just be random bits or does the regularity of the data matter? We fanning the data out on disk in this format
x:\<year>\<day-of-year>\<hour>\<minute>\<guid-file-name>.ext
You are going to be limited by the write speed of the NAS/disks - I can think of no way of getting round that.
So the challenge then is simply to write-saturate the disks for as long as needed. A script or set of scripts running on a reasonable machine should be able to do that without difficulty.
To get started, use something like Bonnie++ to find out how fast your disks can write. Then you could use the code from Bonnie as a starting point to saturate the writes - after all, to benchmark a disk Bonnie has to be able to write faster than the NAS.
Assuming you have 3x1GB ethernet connections, the max network input to the box is about 300 MB/s. A PC is capable of saturating a 1GB ethernet connection, so 3 PCs should work. Get each PC to write a section of the tree and voila.
Of course, to fill a petabyte at 300 MB/s will take about a month.
Alternatively, you could lie to your code about the state of the NAS. On Linux, you could write a user-space filesystem that pretended to have several petabytes of data by creating on-the fly metadata (filename, length etc) for a petabytes worth of files. When the product reads, then generate random data. When you product writes, write it to real disk and remember you've got "real" data if it's read again.
Since your product presumably won't read the whole petabyte during this test, nor write much of it, you could easily simulate an arbitrarily full NAS instantly.
Whether this takes more or less than a month to develop is an open question :)

FAT File system and lots of writes

I am considering using a FAT file system for an embedded data logging application. The logger will only create one file to which it continually appends 40 bytes of data every minute. After a couple years of use this would be over one million write cycles. MY QUESTION IS: Does a FAT system change the File Allocation Table every time a file is appended? How does it keep track where the end of the file is? Does it just put an EndOfFile marker at the end or does it store the length in the FAT table? If it does change the FAT table every time I do a write, I would ware out the FLASH memory in just a couple of years. Is a FAT system the right thing to use for this application?
My other thought is that I could just store the raw data bytes in the memory card and put an EndOfFile marker at the end of my data every time I do a write. This is less desirable though because it means the only way of getting data out of the logger is through serial transfers and not via a PC and a card reader.
FAT updates the directory table when you modify the file (at least, it will if you close the file, I'm not sure what happens if you don't). It's not just the file size, it's also the last-modified date:
http://en.wikipedia.org/wiki/File_Allocation_Table#Directory_table
If your flash controller doesn't do transparent wear levelling, and your flash driver doesn't relocate things in an effort to level wear, then I guess you could cause wear. Consult your manual, but if you're using consumer hardware I would have thought that everything has wear-levelling somewhere.
On the plus side, if the event you're worried about only occurs every minute, then you should be able to speed that up considerably in a test to see whether 2 years worth of log entries really does trash your actual hardware. Might even be faster than trying to find the relevant manufacturer docs...
No, a flash file system driver is explicitly designed to minimize the wear and spread it across the memory cells. Taking advantage of the near-zero seek time. Your data rates are low, it's going to last a long time. Specifying a yearly replacement of the media is a simple way to minimize the risk.
If your only operation is appending to one file it may be simpler to forgo a filesystem and use the flash device as a data tape. You have to take into account the type of flash and its block size, though.
Large flash chips are divided into sub-pages that are a power-of-two multiple of 264 (256+8) bytes in size, pages that are a power-of-two multiple of that, and blocks which are a power-of-two multiple of that. A blank page will read as all FF's. One can write a page at a time; the smallest unit one can write is a sub-page. Once a sub-page is written, it may not be rewritten until the entire block containing it is erased. Note that on smaller flash chips, it's possible to write the bytes of a page individually, provided one only writes to blank bytes, but on many larger chips that is not possible. I think in present-generation chips, the sub-page size is 528 bytes, the page size is 2048+64 bytes, and the block size is 128K+4096 bytes.
An MMC, SD, CompactFlash, or other such card (basically anything other than SmartMedia) combines a flash chip with a processor to handle PC-style sector writes. Essentially what happens is that when a sector is written, the controller locates a blank page, writes a new version of that sector there along with up to 16 bytes of 'header' information indicating what sector it is, etc. The controller then keeps a map of where all the different pages of information are located.
A SmartMedia card exposes the flash interface directly, and relies upon the camera, card reader, or other device using it to perform such data management according to standard methods.
Note that keeping track of the whereabouts of all 4,000,000 pages on a 2 gig card would require either having 12-16 megs of RAM, or else using 12-16 meg of flash as a secondary lookup table. Using the latter approach would mean that every write to a flash page would also require a write to the lookup table. I wouldn't be at all surprised if slower flash devices use such an approach (so as to only have to track the whereabouts of about 16,000 'indirect' pages).
In any case, the most important observation is that flash write times are not predictable, but you shouldn't normally have to worry about flash wear.
Did you check what happens to the FAT file system consistency in case of a power failure or reset of your device?
When your device experience such a failure you must not lose only that log entry, that you are just writing. Older entries must stay valid.
No, FAT is not the right thing if you need to read back the data.
You should further consider what happens, if the flash memory is filled with data. How do you get space for new data? You need to define the requirements for this point.

Resources