SCSI Read10 vs Read16 - c

Which case would be considered correct?
Doing reads with a Read 16 command no matter if the LBA's are 32 or 64 bit.
If the max LBA is 32 bit then do a Read 10 command and if the max LBA is 64 bit then do a Read 16 command.
What are the pros and cons of each choice?
I know for a Read Capacity command it is correct to run a 10 and if it returns FFFFFFFFh then run a 16. Why is this the case? The Read Capacity 16 command works for both cases and avoids even needing the Read Capacity 10 at all.

Keep in mind that the reason that SCSI has multiple "sizes" of commands like this is, in many cases, because SCSI is a very old protocol. (It was first standardized in 1986, and was in development for some time before that!) At the time, a large SCSI device would range in the low hundreds of megabytes — even a 32-bit LBA was considered excessive at the time. The 64-bit LBA commands simply didn't exist until much later.
The question here is really just whether you want to support these old devices. If you do, you will need to use Read (10) for reads on "small" devices, as they may not recognize Read (16). Similarly, the use of Read Capacity (10) before Read Capacity (16) is because older devices won't recognize the larger version.

Related

Limit of work items in OpenCL is not matching the theoretical limit of CL_DEVICE_ADDRESS_BITS. What is happening?

I have been working with OpenCL for a few months now and recently came across an issue regarding the maximum number of work items (global size) in a particular program that I developed.
First of all, let me go through the working sizes checklist:
The kernel I am running is one-dimensional.
The local work group size is set to CL_DEVICE_MAX_WORK_GROUP_SIZE. I also made sure that I can safely use this work size group since nearly no local memory is used.
The local work group size always divides the global item size.
Ok, now into the problem. The thing is that my one-dimensional kernel finishes execution with error code -36 (which is CL_INVALID_COMMAND_QUEUE) if and only if the number of work items (global item size) is above 2^31 (I add the work group size so to be above 2^31 but still a multiple). Now the first thing I do when I run my program is to check that CL_DEVICE_ADDRESS_BITS is 64 bits. See below along with some more information that I extract at the beginning of the program:
Device [3]: GeForce GTX 980
Profile : OpenCL 1.2 CUDA
is available?: 1
Global mem : 4234280960 (4038 MB)
Local mem : 49152 (48 KB)
Compute units: 16
Max work group size: 1024
Work size items: (1024, 1024, 64)
Address space in bits: 64
In case you are wondering, this is an application that computes a lot of small jobs. I already devised a solution to skip this, which is assign more work to each work item so that there are less work items. However, I am interested in why this is happening. As far as I know, it should not happen since the address space is 64 bits. I have also made sure that all variables in the program are 64 bits and that it is not getting truncated in any way. Just in case, the variables are:
size_t global_item_size;
Which I also print during the execution like this to make sure it contains the appropriate value:
fprintf(stdout, "Work items: %"PRIu64"\n", (uint64_t) global_item_size);
Finally, the kernel receives the number of work items as a parameter inside a struct, which is also 64 bits:
ulong t_work_items;
I can not think of any reason to fail, when the address space is 64 bits and all variables that handle the number of work items (2^31 + WORK_GROUP_SIZE) are also 64 bits.
However, I am assuming this is possible. Actually I have not been able to find a reference from anybody else (nor in the documentation) having executed kernels with a higher number of work items without failing.
In particular, I do not need it to work since as I said I can just increase the job size per work item. However I wanted to measure performance depending on three values (job size per work item and work group size) and came across this.
Thank you for your time reading into this!
EDIT1:
I have also checked CL_DEVICE_MAX_WORK_ITEM_SIZES and got Max work item sizes: (1024,1024,64). However, I believe this is not what limits the number of work items since according to the documentation OpenCL reference:
The values specified in global_work_size cannot exceed the range given by the sizeof(size_t) for the device on which the kernel execution will be enqueued. The sizeof(size_t) for a device can be determined using CL_DEVICE_ADDRESS_BITS in the table of OpenCL Device Queries for clGetDeviceInfo. If, for example, CL_DEVICE_ADDRESS_BITS = 32, i.e. the device uses a 32-bit address space, size_t is a 32-bit unsigned integer and global_work_size values must be in the range 1 .. 2^32 - 1. Values outside this range return a CL_OUT_OF_RESOURCES error.
I do not get a CL_OUT_OF_RESOURCES error since my address space is 64 bits. What could be wrong?

What's biggest difference between 64bit file system and 32bit file system

May I ask what is the biggest difference between 64bit file system and 32bit file system?
More available inodes? Bigger partition?
There is no hard-and-fast standard for exactly what bit size means for filesystems, but I usually see it refer to the data type that stores block addresses. More bits translates to a larger maximum partition size and the ability to use bigger drives as a single filesystem. It can sometimes also mean larger maximum file size or a larger number of files allowed per directory.
It's not directly analogous to CPU bit size, and you'll find filesystems that are 48 bits and ones that are 128 bits. The bit size of a particular filesystem is usually very low in importance as it doesn't give you any indication of how fast, resilient, or manageable a filesystem is.

What does 128-bit file system mean?

In the introduction to ZFS file system, I saw one statement:
ZFS file system is quite scalable, 128 bit filesystem
What does 128-bit filesystem mean? What makes it scalable?
ZFS is a “128 bit” file system, which means 128 bits is the largest size address for any unit within it. This size allows capacities and sizes not likely to become confining anytime in the foreseeable future. For instance, the theoretical limits it imposes include 2^48 entries per directory, a maximum file size of 16 EB (2^64 or ~16 * 2^18 bytes), and a maximum of 2^64 devices per “zpool”. Source: File System Char.
The ZFS 128-bit addressing scheme and can store 256 quadrillion zettabytes, which translates into a scalable file system that exceeds 1000s of PB (petabytes) of storage capacity, while allowing to be managed in single or multiple ZFS’s Z-RAID arrays. Source: zfs-unlimited-scalability
TLDR it can hold much larger files then a 64 bit F. such as. EXT.
ZFS is a 128-bit file system,[85] so it can address 1.84 × 1019 times more data than 64-bit systems such as Btrfs. The limitations of ZFS are designed to be so large that they should not be encountered in the foreseeable future.
Some theoretical limits in ZFS are:
248 — Number of entries in any individual directory[86]
16 Exbibytes (264 bytes) — Maximum size of a single file
16 Exbibytes — Maximum size of any attribute
256 Zebibytes (278 bytes) — Maximum size of any zpool
256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
264 — Number of devices in any zpool
264 — Number of zpools in a system
264 — Number of file systems in a zpool
More here.

File and networking portability among different byte sizes

In C, the fread function is like this:
size_t fread(void *buf, size_t max, FILE *file);
Usually char* arrays are used as buf. People usually assume that char = 8 bit. But what if it isn't true? What happens if files written in 8 bit byte systems are read on 10 bit byte systems? Is there any single standard on portability of files and network streams between systems with bytes of different size? And most importantly, how to write portable code in this regard?
With regard to network communications, the physical access protocols (like ethernet) define how many bits there go in a "unit of information" and it is up to the implementation to map this to an appropriate type. So, for network communications there is no problem with supporting weird architectures.
For file access, stuff gets more interesting if you want to support weird architectures, because there are no standards to refer to and even the method of putting the files on the system may influence how you can access them.
Fortunately, the only systems currently in use that don't support 8-bit bytes are DSP's and similar small embedded systems that don't support a filesystem at all, so the issue is essentially moot.
Systems with bit sizes other than 8 is pretty rare these days. But there are machines with other sizes, and files are not guaranteed to be portable to those machines.
If uberportability is required, then you will have to have some sort of encoding in your file that copes with char != 8 bits.
Do you have something in mind where this may have to run on a DEC 10 or really old IBM mainframes, in DSP's or some such, or are you just asking for the purpose of "I want to know". If the latter, I would just "ignore the case". It is pretty special machines that don't have 8-bit characters - and you most likely will have OTHER problems than bits per char to use your "files" on the system then - like how to get the file there in the first place, as you probably can't plug in a USB stick or transfer it with FTP (although the latter is perhaps the most likely one)

Understanding CPU cache and cache line

I am trying to understand how CPU cache is operating. Lets say we have this configuration (as an example).
Cache size 1024 bytes
Cache line 32 bytes
1024/32 = 32 cache lines all together.
Singel cache line can store 32/4 = 8 ints.
1) According to these configuration length of tag should be 32-5=27 bits, and size of index 5 bits (2^5 = 32 addresses for each byte in cache line).
If total cache size is 1024 and there are 32 cache lines, where is tags+indexes are stored? (There is another 4*32 = 128 bytes.) Does it means that actual size of the cache is 1024+128 = 1152?
2) If cache line is 32 bytes in this example, this means that 32 bytes getting copied in cache whenerever CPU need to get new byte from RAM. Am I right to assume that cache line position of the requested byte will be determined by its adress?
This is what I mean: if CPU requested byte at [FF FF 00 08], then available cache line will be filled with bytes from [FF FF 00 00] to [FF FF 00 1F]. And our requseted single byte will be at position [08].
3) If previous statement is correct, does it mean that 5 bits that used for index, are technically not needed since all 32 bytes are in the cache line anyway?
Please let me know if I got something wrong.
Thanks
A cache consists of data and tag RAM, arranged as a compromise of access time vs efficiency and physical layout. You're missing an important stat: number of ways (sets). You rarely have 1-way caches, because they perform pathologically badly with simple patterns. Anyway:
1) Yes, tags take extra space. This is part of the design compromise - you don't want it to be a large fraction of the total area, and why line size isn't just 1 byte or 1 word. Also, all tags for an index are simultaneously accessed, and that can affect efficiency and layout if there's a large number of ways. The size is slightly bigger than your estimate. There's usually also a few bits extra bits to mark validity and sometimes hints. More ways and smaller lines needs a larger fraction taken up by tags, so generally lines are large (32+ bytes) and ways are small (4-16).
2) Yes. Some caches also do a "critical word first" fetch, where they start with the word that caused the line fill, then fetch the rest. This reduces the number of cycles the CPU is waiting for the data it actually asked for. Some caches will "write thru" and not allocate a line if you miss on a write, which avoids having to read the entire cache line first, before writing to it (this isn't always a win).
3) The tags won't store the lower 5 bits as they're not needed to match a cache line. They just index into individual lines.
Wikipedia has a pretty good, if a bit intense, write-up on caches: http://en.wikipedia.org/wiki/CPU_cache - see "Implementation". There's a diagram of how data and tags are split. Me, I think everyone should learn this stuff because you really can improve performance of code when you know what the underlying machine is actually capable of.
The cache metadata is typically not counted as a part of the cache itself. It might not even be stored in the same part of the CPU (it could be in another cache, implemented using special CPU registers, etc).
This depends on whether your CPU will fetch unaligned addresses. If it will only fetch aligned addresses, then the example you gave would be correct. If the CPU fetches unaligned addresses, then it might fetch the range 0xFFFF0008 to 0xFFFF0027.
The index bytes are still useful, even when cache access is aligned. This gives the CPU a shorthand method for referencing a byte within a cache line that it can use in its internal bookkeeping. You could get the same information by knowing the address associated with the cache line and the address associated with the byte, but that's a whole lot more information to carry around.
Different CPUs implement caching very differently. For the best answer to your question, please give some additional details about the particular CPU (type, model, etc) that you are talking about.
This is based on my vague memory, you should read books like "Computer Architecture: A Quantitative Approach" by Hennessey and Patterson. Great book.
Assuming a 32-bit CPU... (otherwise your figures would need to use >4 bytes (maybe <8 bytes since some/most 64-bit CPU don't have all 64 bits of address line used)) for the address.
1) I believe it's at least 4*32 bytes. Depending on the CPU, the chip architects may have decided to keep track of other info besides the full address. But it's usually not considered part of the cache.
2) Yes, but how that mapping is done is different. See Wikipedia - CPU cache - associativity There's the simple direct mapped cache and the more complex associative mapped cache. You want to avoid the case where some code needs two piece of information but the two addresses map to the exact same cache line.

Resources