What does 128-bit file system mean? - filesystems

In the introduction to ZFS file system, I saw one statement:
ZFS file system is quite scalable, 128 bit filesystem
What does 128-bit filesystem mean? What makes it scalable?

ZFS is a “128 bit” file system, which means 128 bits is the largest size address for any unit within it. This size allows capacities and sizes not likely to become confining anytime in the foreseeable future. For instance, the theoretical limits it imposes include 2^48 entries per directory, a maximum file size of 16 EB (2^64 or ~16 * 2^18 bytes), and a maximum of 2^64 devices per “zpool”. Source: File System Char.
The ZFS 128-bit addressing scheme and can store 256 quadrillion zettabytes, which translates into a scalable file system that exceeds 1000s of PB (petabytes) of storage capacity, while allowing to be managed in single or multiple ZFS’s Z-RAID arrays. Source: zfs-unlimited-scalability

TLDR it can hold much larger files then a 64 bit F. such as. EXT.
ZFS is a 128-bit file system,[85] so it can address 1.84 × 1019 times more data than 64-bit systems such as Btrfs. The limitations of ZFS are designed to be so large that they should not be encountered in the foreseeable future.
Some theoretical limits in ZFS are:
248 — Number of entries in any individual directory[86]
16 Exbibytes (264 bytes) — Maximum size of a single file
16 Exbibytes — Maximum size of any attribute
256 Zebibytes (278 bytes) — Maximum size of any zpool
256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
264 — Number of devices in any zpool
264 — Number of zpools in a system
264 — Number of file systems in a zpool
More here.

Related

What determines the maximum size of array?

I know there is an upper limit on array size. I want to understand what determines the maximum size of an array? Can someone give a simple detailed explanation?
It depends, on the language, library, operating system.
For an in-memory array which is what most languages offer by default, the upper limit is the address space given to the process. For Windows this is 2 or 3 GB for 32-bit applications, and for 64-bit applications it is the smaller of 8 TB and (physical RAM + page file size limit).
For a custom library using disk space to (very slowly) access an array, the limit will probably be the size of the largest storage volume. With RAID and 10+ TB drives that could be a very large number.
Once you know the memory limit for an array, the upper limit on the number of elements is (memory / element size). The actual limit will often be less if the element is small, since the array addressing might use 32-bit unsigned integers which can only address 4 GB elements.
This is for simple contiguous, typed arrays offered by languages like C++. Languages like PHP where you can define a['hello'] = 'bobcat'; a[12] = 3.14; are more like maps and can use much more memory per element since they store a value for the index

Databases,RAM and rerformance

I have 5GB dictionary, where key is word and value is 300d vector of numbers but I have only 1GB RAM (minus 200MB of server) and 50GB ssd. My goal is relatively fast (1-3sec) retrieval of vector, for every word in input sentence.
What kind of storage system would be best for this kind of task? Is a nonsql database like Mongo a good option?
If so, is there a way to calculate minimal cache memory that will mongo need, and is this solution feasible with given hardware?
Thank you.
Assuming single precision float number with 32 bits each and 32 bit word keys, 5GB roughly sums up to 4.1 million vectors.
You could store a <word, word> dictionary with these 4.1 million entries in RAM. The value part of the dictionary points to a combination of file and offset within the file stored on SSD. In case your assumptions are different, the calculation should stay similar.
It is probably not practical to store the whole vectors in a single file. It might be sufficient to store the vectors in a database, provided the tablespace reside on SSD.
Example: You could have 32 files with 130.000 vectors each. Then, the highest 5 bits of the word value indicate the file, and the lowest 27 bits are the offset or vector number within the file.

How many bytes is a gigabyte (GB)?

When I convert between 1GB to byte using online tools, I get different answers. For instance, using Google Convertor: 1GB=1e+9 while in another converter I get 1GB= 1073741824. I suppose the unit is used in different fashion based on whether 1KB=1024B or 1KB=1000B (this is Google unit).
How can I know which unit my machine uses using a small C program or function? Does C have a macro for that? I want to do that as my program will possibly be run via various operating systems.
The two tools are converting two different units.
1 GB = 10^9 bytes while 1 GiB = 2^30 bytes.
Try using google converter with GiB instead of GB and the mystery will be solved.
The following will help you understand the conversion a little better.
Factor Name Symbol Origin Derivation Decimal
2^10 kibi Ki kilobinary: (2^10)^1 kilo: (10^3)^1
2^20 mebi Mi megabinary: (2^10)^2 mega: (10^3)^2
2^30 gibi Gi gigabinary: (2^10)^3 giga: (10^3)^3
2^40 tebi Ti terabinary: (2^10)^4 tera: (10^3)^4
2^50 pebi Pi petabinary: (2^10)^5 peta: (10^3)^5
2^60 exbi Ei exabinary: (2^10)^6 exa: (10^3)^6
Note that the new prefixes for binary multiples are not part of the International System of Units (SI). However, for ease of understanding and recall, they were derived from the SI prefixes for positive powers of ten. As shown in the table, the name of each new prefix is derived from the name of the corresponding SI prefix by retaining the first two letters of the SI prefix and adding the letters bi.
There's still a lot of confusion on the usage of GB and GiB in fact very often GB is used when GiB should or was intended to be.
Think about the hard drives world:
Your operating system assumes that 1 MB equals 1 048 576 bytes i.e. 1MiB. Drive manufacturers consider (correctly) 1 MB as equal to 1 000 000 bytes. Thus if the drive is advertised as 6.4 GB (6 400 000 000 bytes) the operating system sees it as approximately 6.1 GB 6 400 000 000/1 048 576 000 = ~6.1 GiB
Take a look at this for more info on prefixes for binary units
and this on metric prefixes.
This is just a confusion of units. There are actually two prefixes G for 10⁹ and Gi for 2³⁰. Bytes should usually be measured with the second, so the correct writing would be GiB.
The “gibibyte” is a multiple of the unit byte for Digital
Information.
The binary prefix gibi means 2^30, therefore one gibibyte is equal to
1073741824 bytes = 1024 mebibytes.
The unit symbol for the gibibyte is GiB. It is one of the units with
binary prefixes defined by the International Electrotechnical
Commission (IEC) in 1998.
The “gibibyte” is closely related to the Gigabyte (GB), which is
defined by the IEC as 10^9 bytes = 1000000000 bytes, 1GiB ≈ 1.024GB.
1024 Gibibytes are equal to One Tebibyte.
In the context of computer memory, Gigabyte and GB are customarily
used to mean 1024^3 (2^30) bytes, although not in the context of data
transmission and not necessarily for Hard Drive size.

What's biggest difference between 64bit file system and 32bit file system

May I ask what is the biggest difference between 64bit file system and 32bit file system?
More available inodes? Bigger partition?
There is no hard-and-fast standard for exactly what bit size means for filesystems, but I usually see it refer to the data type that stores block addresses. More bits translates to a larger maximum partition size and the ability to use bigger drives as a single filesystem. It can sometimes also mean larger maximum file size or a larger number of files allowed per directory.
It's not directly analogous to CPU bit size, and you'll find filesystems that are 48 bits and ones that are 128 bits. The bit size of a particular filesystem is usually very low in importance as it doesn't give you any indication of how fast, resilient, or manageable a filesystem is.

SCSI Read10 vs Read16

Which case would be considered correct?
Doing reads with a Read 16 command no matter if the LBA's are 32 or 64 bit.
If the max LBA is 32 bit then do a Read 10 command and if the max LBA is 64 bit then do a Read 16 command.
What are the pros and cons of each choice?
I know for a Read Capacity command it is correct to run a 10 and if it returns FFFFFFFFh then run a 16. Why is this the case? The Read Capacity 16 command works for both cases and avoids even needing the Read Capacity 10 at all.
Keep in mind that the reason that SCSI has multiple "sizes" of commands like this is, in many cases, because SCSI is a very old protocol. (It was first standardized in 1986, and was in development for some time before that!) At the time, a large SCSI device would range in the low hundreds of megabytes — even a 32-bit LBA was considered excessive at the time. The 64-bit LBA commands simply didn't exist until much later.
The question here is really just whether you want to support these old devices. If you do, you will need to use Read (10) for reads on "small" devices, as they may not recognize Read (16). Similarly, the use of Read Capacity (10) before Read Capacity (16) is because older devices won't recognize the larger version.

Resources