Why do these 2 ZFS pools have different allocation and capacity, even thought they have the same files? - filesystems

I have two zpools on ZFD:
the zpool on top is 8 disks 2 TB each in raidZ3
the zpool on bottom is 4 disks 4 TB each in raidZ3
The data is EXACTLY the same. I even ran diff -qr /top/zpool/ /bottom/zpool/ to confirm.
Why is it that ALLOC and CAP fields differ, if data is exactly duplicated?

Dealing with ZFS space could be hard.
To be sure that The data is EXACTLY the same, give a try to zfs list -o space.
About the difference between ALLOC and CAP, docs.oracle.com says:
ALLOC: The amount of physical space allocated to all datasets and internal metadata. Note that this amount differs from the amount of disk space as reported at the file system level.
CAP (CAPACITY): The amount of disk space used, expressed as a percentage of the total disk space.
And for a more detailed answer you need to consider the block size and the average size of the data stored. A whole explanation can by found in Matt Ahrens explanation on Delphix blog.

Related

How to compute the FAT size in FAT File System

I am implementing a FAT file system in C. I am following the specs published by microsoft (http://read.pudn.com/downloads77/ebook/294884/FAT32%20Spec%20%28SDA%20Contribution%29.pdf)
But I don't understand how to compute the FAT size field of boot sector. In the specification document appear the following code on page 14.
RootDirSectors = ((BPB_RootEntCnt * 32) + (BPB_BytsPerSec – 1)) / BPB_BytsPerSec;
TmpVal1 = DskSize – (BPB_ResvdSecCnt + RootDirSectors);
TmpVal2 = (256 * BPB_SecPerClus) + BPB_NumFATs;
If(FATType == FAT32)
TmpVal2 = TmpVal2 / 2;
FATSz = (TMPVal1 + (TmpVal2 – 1)) / TmpVal2;
If(FATType == FAT32) {
BPB_FATSz16 = 0;
BPB_FATSz32 = FATSz;
} else {
BPB_FATSz16 = LOWORD(FATSz);
/* there is no BPB_FATSz32 in a FAT16 BPB */
}
From this code I don't understand
What is TmpVal2?
Why number 256 is used?
Why if it is FAT32 it divide by 2?
I am not sure why the constant of 256 was chosen however here are some thoughts on your other questions.
There is a note below the source code segment which states that the math is an approximation.
NOTE: The above math does not work perfectly. It will occasionally set
a FATSz that is up to 2 sectors too large for FAT16, and occasionally
up to 8 sectors too large for FAT32. It will never compute a FATSz
value that is too small, however. Because it is OK to have a FATSz
that is too large, at the expense of wasting a few sectors, the fact
that this computation is surprisingly simple more than makes up for it
being off in a safe way in some cases.
The way I read the code is that the calculation is for a FAT16 size and then there is an adjustment to the calculation if the target is actually FAT32.
The value of the variable TmpVal2 looks to be a unit size in that the amount of space calculated for the value of TmpVal1 is then divided by the unit size value of TmpVal2 in order to determine the number of units of disk space. However in the case of FAT32 the unit size is smaller than in FAT16 so there needs to be an adjustment.
It appears that FAT16 used a specific size for the File Allocation Table and as the hard disk space available for a volume was increased with improvements in disk technology, the cluster size was based on the volume size. So with a smaller volume size the cluster size, the number of disk sectors in an allocation unit, was smaller than the cluster size for a large volume size. See FAT16 vs. FAT32 in Microsoft TechNet as well as the tables in the source code on page 13 of the document you reference.
With FAT32, a standard cluster size of 4K was used and the File Allocation Table storage was changed from a fixed size to a variable size and was no longer at a fixed location on the disk.
This artice, File systems (FAT, FAT8, FAT16, FAT32, and NTFS) explained, goes into some details about the differences between these various file system versions.
The Wikipedia article, File Allocation Table, has quite a bit of technical information with links to other articles.
You may also find the following stackoverflow articles of interest.
Converting the cluster number stored in FAT table (of FAT12 filesystem) for reading from a floppy disk
Why did Windows use the FAT structure instead of a conventional linked list with a next pointer for each data block of a file?

Basic File System Implementation

I've been given 2k bytes to make a ultra minimalistic file system and I thought about making a stripped out version of FAT16.
My only problem is understanding how do I store the FAT in the volume. Let's say I use 2 bytes per block hence I'd have 1024 blocks. I need a table with 1024 rows and in each row I'll save the next block of a file.
As each of this block can address other 1023 blocks, I fail to see how this table would not use my entire 2k space. I do not understand how to save this table into my hard drive and use only a few bytes rather than just using 1024 block for writing a 1024 row table.
Given that you are allowed to implement a flat filesystem and have such a small space to work with, I would look at something like the Apple DOS 3.3 filesystem rather than a hierarchical filesystem like FAT16. Even the flat filesystem predecessor of FAT16, FAT12, is overly complex for your purposes.
I suggest that you divide your 2 kiB volume up into 256 byte "tracks" with 16 byte "sectors," to use the Apple DOS 3.3 nomenclature. Call them what you like in your own implementation. It just helps you to map the concepts if you reuse the same terms here at the design stage.
You don't need a DOS boot image, and you don't have the seek time of a moving disk drive head to be concerned about, so instead of setting aside tracks 0-2 and putting the VTOC track in the middle of the disk, let's put our VTOC on track 0. The VTOC which contains the free sector bitmap, the location of the first catalog sector, and other things.
If we reserve the entirety of track 0 for the VTOC, we would have 112 of our 16-byte sectors left. Those will pack up into only 14 bytes for the bitmap, which suggests that we really don't need the entirety of track 0 for this.
Let's set aside the first two sectors of track 0 instead, and include track 0 in the free sector bitmap. That causes a certain amount of redundancy, in that we will always have the first two sectors mapped as "used," but it makes the implementation simpler, since there are now no special cases.
Let's split Apple DOS 3.3's VTOC concept into two parts: the Volume Label Sector (VLS) and the volume free sector bitmap (VFSB).
We'll put the VLS on track 0 sector 0.
Let's set aside the first 2-4 bytes of the VLS for a magic number to identify this volume file as belonging to your filesystem. Without this, the only identifying characteristic of your volume files is that they are 2 kiB in size, which means your code could be induced to trash an innocent file that happened to be the same size. You want more insurance against data destruction than that.
The VLS should also name this volume. Apple DOS 3.3 just used a volume number, but maybe we want to use several bytes for an ASCII name instead.
The VLS also needs to point to the first catalog sector. We need at least 2 bytes for this. We have 128 tracks, which means we need at least 7 bits. Let's use two bytes: track and sector. This is where you get into the nitty-gritty of design choices. We can now consider moving to 4 kiB volume sizes by defining 256 tracks. Or, maybe at this point we decide that 16-byte sectors are too small, and increase them so we can move beyond 4 kiB later. Let's stick with 16 byte sectors for now, though.
We only need one sector for the VFSB: the 2 kiB volume ÷ 16 bytes per sector = 128 sectors ÷ 8 bits per byte = 16 bytes. But, with the above thoughts in mind, we might consider setting aside a byte in the VLS for the number of VFSB sectors following the VL, to allow for larger volumes.
The Apple DOS 3.3 catalog sector idea should translate pretty much directly over into this new filesystem, except that with only 16 bytes per sector to play with, we can't describe 7 files per sector. We need 2 bytes for the pointer to the next catalog sector, leaving 14 bytes. Each file should have a byte for flags: deleted, read-only, etc. That means we can have either a 13-byte file name for 1 file per catalog sector, or two 6-byte file names for 2 files per catalog sector. We could do 7 single-letter file names, but that's lame. If we go with your 3-character file name idea, that's 3 files per catalog sector after accounting for the flag byte per file, leaving 2 extra bytes to define. I'd go with 1 or 2 files per sector, though.
That's pretty much what you need. The rest is implementation and expansion.
One other idea for expansion: what if we want to use this as a bootable disk medium? Such things usually do need a boot loader, so do we need to move the VLS and VFSB sectors down 1, to leave track 0 sector 0 aside for a boot image? Or, maybe the VLS contains a pointer to the first catalog sector that describes the file containing the boot image instead.

What is a maximum size of SQLite database? [duplicate]

I have read their limits FAQ, they talk about many limits except limit of the whole database.
This is fairly easy to deduce from the implementation limits page:
An SQLite database file is organized as pages. The size of each page is a power of 2 between 512 and SQLITE_MAX_PAGE_SIZE. The default value for SQLITE_MAX_PAGE_SIZE is 32768.
...
The SQLITE_MAX_PAGE_COUNT parameter, which is normally set to 1073741823, is the maximum number of pages allowed in a single database file. An attempt to insert new data that would cause the database file to grow larger than this will return SQLITE_FULL.
So we have 32768 * 1073741823, which is 35,184,372,056,064 (35 trillion bytes)!
You can modify SQLITE_MAX_PAGE_COUNT or SQLITE_MAX_PAGE_SIZE in the source, but this of course will require a custom build of SQLite for your application. As far as I'm aware, there's no way to set a limit programmatically other than at compile time (but I'd be happy to be proven wrong).
It has new limits, now the database size limit is 256TB:
Every database consists of one or more "pages". Within a single database, every page is the same size, but different databases can have page sizes that are powers of two between 512 and 65536, inclusive. The maximum size of a database file is 4294967294 pages. At the maximum page size of 65536 bytes, this translates into a maximum database size of approximately 1.4e+14 bytes (281 terabytes, or 256 tebibytes, or 281474 gigabytes or 256,000 gibibytes).
This particular upper bound is untested since the developers do not have access to hardware capable of reaching this limit. However, tests do verify that SQLite behaves correctly and sanely when a database reaches the maximum file size of the underlying filesystem (which is usually much less than the maximum theoretical database size) and when a database is unable to grow due to disk space exhaustion.
The new limit is 281 terabytes. https://www.sqlite.org/limits.html
Though this is an old question, but let me share my findings for people who reach this question.
Although Sqlite documentation states that maximum size of database file is ~140 terabytes but your OS imposes its own restrictions on maximum file size for any type of file.
For e.g. if you are using FAT32 disk on Windows, maximum file size that I could achieve for sqLite db file was 2GB. (According to Microsoft site, limit on FAT 32 system is 4GB but still my sqlite db size was restricted to 2GB). While on Linux , I was able to reach 3 GB (where I stopped. it could have reached more size)
NOTE: I had written a small java program that will start populating sqlite db from 0 rows and go on populating until stop command is given.
The maximum number of bytes in a string or BLOB in SQLite is defined by the preprocessor macro SQLITE_MAX_LENGTH. The default value of this macro is 1 billion (1 thousand million or 1,000,000,000). 
The current implementation will only support a string or BLOB length up to 231-1 or 2147483647
The default setting for SQLITE_MAX_COLUMN is 2000. You can change it at compile time to values as large as 32767. On the other hand, many experienced database designers will argue that a well-normalized database will never need more than 100 columns in a table.
SQLite does not support joins containing more than 64 tables.
The theoretical maximum number of rows in a table is 2^64 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first.
Max size of DB : 140 terabytes
Please check URL for more info : https://www.sqlite.org/limits.html
I'm just starting to explore SQLite for a project I'm working on, but it seems to me that the effective size of a database is actually more flexible than the file system would seem to allow.
By utilizing the 'attach' capability, a database could be compiled that would exceed the file system's max file size by up to 125 times... so a FAT32 effective limit would actually be 500GB (125 x 4GB)... if the data could be balanced perfectly between the various files.

Binary Search on Large Disk File in C - Problems

This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.

Basic concepts in file system implementation

I am a unclear about file system implementation. Specifically (Operating Systems - Tannenbaum (Edition 3), Page 275) states "The first word of each block is used as a pointer to the next one. The rest of block is data".
Can anyone please explain to me the hierarchy of the division here? Like, each disk partition contains blocks, blocks contain words? and so on...
I don't have the book in front of me, but I'm suspect that quoted sentence isn't really talking about files, directories, or other file system structures. (Note that a partition isn't a file system concept, generally). I think your quoted sentence is really just pointing out something about how the data structures stored in disk blocks are chained together. It means just what it says. Each block (usually 4k, but maybe just 512B) looks very roughly like this:
+------------------+------------- . . . . --------------+
| next blk pointer | another 4k - 4 or 8 bytes of stuff |
+------------------+------------- . . . . --------------+
The stuff after the next block pointer depends on what's stored in this particular block. From just the sentence given, I can't tell how the code figures that out.
With regard to file system structures:
A disk is an array of sectors, almost always 512B in size. Internally, disks are built of platters, which are the spinning disk-shaped things covered in rust, and each platter is divided up into many concentric tracks. However, these details are entirely hidden from the operating system by the ATA or SCSI disk interface hardware.
The operating system divides the array of sectors up into partitions. Partitions are contiguous ranges of sectors, and partitions don't overlap. (In fact this is allowed on some operating systems, but it's just confusing to think about.)
So, a partition is also an array of sectors.
So far, the file system isn't really in the picture yet. Most file systems are built within a partition. The file system usually has the following concepts. (The names I'm using are those from the unix tradition, but other operating systems will have similar ideas.)
At some fixed location on the partition is the superblock. The superblock is the root of all the file system data structures, and contains enough information to point to all the other entities. (In fact, there are usually multiple superblocks scattered across the partition as a simple form of fault tolerance.)
The fundamental concept of the file system is the inode, said "eye-node". Inodes represent the various types of objects that make up the file system, the most important being plain files and directories. An inode might be it's own block, but some file system pack multiple inodes into a single block. Inodes can point to a set of data blocks that make up the actual contents of the file or directory. How the data blocks for a file is organized and indexed on disk is one of the key tasks of a file system. For a directory, the data blocks hold information about files and subdirectories contained within the directory, and for a plain file, the data blocks hold the contents of the file.
Data blocks are the bulk of the blocks on the partition. Some are allocated to various inodes (ie, to directories and files), while others are free. Another key file system task is allocating free data blocks as data is written to files, and freeing data blocks from files when they are truncated or deleted.
There are many many variations on all of these concepts, and I'm sure there are file systems where what I've said above doesn't line up with reality very well. However, with the above, you should be in a position to reason about how file systems do their job, and understand, at least a bit, the differences you run across in any specific file system.
I don't know the context of this sentence, but it appears to be describing a linked list of blocks. Generally speaking, a "block" is a small number of bytes (usually a power of two). It might be 4096 bytes, it might be 512 bytes, it depends. Hard drives are designed to retrieve data a block at a time; if you want to get the 1234567th byte, you'll have to get the entire block it's in. A "word" is much smaller and refers to a single number. It may be as low as 2 bytes (16-bit) or as high as 8 bytes (64-bit); again, it depends on the filesystem.
Of course, blocks and words isn't all there is to filesystems. Filesystems typically implement a B-tree of some sort to make lookups fast (it won't have to search the whole filesystem to find a file, just walk down the tree). In a filesystem B-tree, each node is stored in a block. Many filesystems use a variant of the B-tree called a B+-tree, which connects the leaves together with links to make traversal faster. The structure described here might be describing the leaves of a B+-tree, or it might be describing a chain of blocks used to store a single large file.
In summary, a disk is like a giant array of bytes which can be broken down into words, which are usually 2-8 bytes, and blocks, which are usually 512-4096 bytes. There are other ways to break it down, such as heads, cylinders, sectors, etc.. On top of these primitives, higher-level index structures are implemented. By understanding the constraints a filesystem developer needs to satisfy (emulate a tree of files efficiently by storing/retrieving blocks at a time), filesystem design should be quite intuitive.
Tracks >> Blocks >> Sectors >> Words >> Bytes >> Nibbles >> Bits
Tracks are concentric rings from inside to the outside of the disk platter.
Each track is divided into slices called sectors.
A block is a group of sectors (1, 2, 4, 8, 16, etc). The bigger the drive, the more sectors that a block will hold.
A word is the number of bits a CPU can handle at once (16-bit, 32-bit, 64-bit, etc), and in your example, stores the address (or perhaps offset) of the next block.
Bytes contain nibbles and bits. 1 Byte = 2 Nibbles; 1 Nibble = 4 Bits.

Resources