regular file system that only allows for writing only one time - filesystems

Hello I have a quick question if someone could answer it for me.
In a write only one time system how can a person implement a read system on it.

There are many ways to solve this, here is a simple overview of one.
A disc layout something like this:
0 root block -- identifies the file system, maybe a pointer to the end of metadata (n)
k first data block -- contains data, possibly stale written to media.
l last data block -- last data block written.
m last meta data block -- contains directories, inode tables, etc...
n first meta data block -- contains initial directories, inodes, etc...
When the media is first accessed, the filesystem needs to determine where these values are. A binary search of the media reveals where l and m and possible n are. There is generally no way to find them otherwise.
Looking at the metadata [m..n] in this order, more recent versions of files are encountered at lower block addresses than older versions. That is, the metadata grows from high to low addresses. If you need to update some metadata, you need only scan forward until the first instance of it; and you can copy that instance, with your modifications, to block [m-1].
Note that not all the metadata for a file needs to be replaced; if it has 5 blocks of data pointers, and you have only changed one, only that block needs replaced. Flexible tree structures help here.
The data doesn't matter so much; if you change a byte in block 3 of a file, the filesystem has to find where the pointer for block 3 is, and rewrite the tree of block pointers to reflect the new blocks. If you look at a tradition (eg v7 unix filesystem) this will be obvious. It works one more complex layouts, just harder to imagine.
If you were to execute code that did something like:
for (i = 0; i < N ; i++) {
write(fd, str+i, 1);
fsync(fd);
}
You might find your write-once media runs out of space pretty quickly; you would be committing at least a disk block per byte, and likely more like 3. 16k is a common disk block size for write only media, so that could be 48k per byte.
More frequently, the filesystem will hold of synchronizing the media until an idle period or only perform periodic snapshots from a log held in another media to prevent the pathological case above.
Many forms of write-once media aren't truly write once, rather each bit can be set or cleared once. A block can be written multiple times, but only transitions in one directed actually take effect. Also, many forms of media distinguish a never written block with a specific read error. These abilities make for a more efficient write-once filesystem, as they allow things like state flags in the metadata.
Even if a true block-based write once media were in use, this can still be followed, just a little less efficiently. For example, if an uninitialized block returned all zeros (or 0xffs), the file system could prefix each block with an identifier, reducing the efficiency of the IO, but permitting detection of uninitialized blocks.

Related

Inserting a block of bytes in the middle of a file without rewriting everything?

Since files are stored as blocks on disk, is it possible to insert a block in the middle of the chain of blocks?
This is because, without such an API, if I wanted to insert a 4kb block in the middle of a file at a certain position, using the traditional read/write apis, I would essentially have to rewrite everything in the file after that position and shfit them by 4kb.
I'm ok with an answer that only works for some OS or some file systems. It doesn't have to be cross-platform or work for every file system.
(I also understand that not all file systems or hardware use 4kb for blocks - answers that work for different numbers are also ok).
I am not sure about filesystems that would allow extending a file in the middle easily. Then again many modern filesystems do not actually have a chain of blocks. The chain of blocks was a thing on the FAT filesystem family. Instead the blocks in modern filesystems are often organized as a tree. Within a tree you can find the block containing any byte position in O(lg n) reads, with the logarithm having such a large base that it can be considered essentially constant.
While the chain would allow for the operation of "insert n blocks in between" with comparative ease, the tree unfortunately does not. This doesn't mean that the tree is the wrong structure - on the contrary, many database systems benefit greatly from the fast random access that it offers.
Note that the tree enables you to have another thing that might be useful instead of holes - Unix file systems have sparse files - any blocks of the file that are known to contain zeroes need not actually use disk space at all - instead these blocks are marked unallocated and considered containing zeroes in the tree structure itself.

C: Reading large files with limited memory

I am working on something that requires reading from and writing to a large file (or equivalent) but is allowed fairly minimal memory to do it (I don't have the exact spec, but let's call the "large" 15GB and the "minimal" 16K). The file is accessed randomly, usually in chunks of 512 Bytes and it is guaranteed that sometimes consecutive reads will be significant distance apart - possibly literally opposite ends of the disk (or a small number of MB from either end). Currently I'm using pread/pwrite to hit the locations I want in the file (I was previously using fseek, but abandoned it in favor of p(wread|write) because reasons.
Accessing the file this way is (perhaps obviously) slow, and I'm looking for ways to optimise/speed up the performance as much as possible (with as limited use (read: NO) as possible of external libraries).
I don't mean to be too cagey about exactly what we're doing, so it might help to think of it as a driver for a file system. At one end of the disk we're accessing the file and directory tables, and at the other raw data - so we need to write file information and then skiup to the data. But even within such zones don't assume anything about the layout. There is no guarantee that multiple files (or even multiple chunks of a single file) will be stored contiguously - or even close together. This also means that we can't make assumptions about the order that data will be read.
A couple of things I have considered include:
Opening Multiple File Descriptors for different parts of the file (but I'm not sure there's any state associated with the FD and whether this would even have an impact)
A few smarts around caching data that I expect to be accessed several times in a short amount of time
I was wondering whether others might have been in a similar boat and/or have opinions (or articles they can link) that discuss different strategies to minimise the impact of reading.
I guess I was always wondering whether pread is even the right choice in this situation....
Any thoughts/opinions/criticisms/etc more than welcome.
NOTE: The program will always run in a single thread (so options don't need to be thread-safe, but equally pushing the read to the background isn't an option either).

Binary Search on Large Disk File in C - Problems

This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.

ANSI C hash table implementation with data in one memory block

I am looking for an open source C implementation of a hash table that keeps all the data in one memory block, so it can be easily send over a network let say.
I can only find ones that allocate small pieces of memory for every key-value pair added to it.
Thank you very much in advance for all the inputs.
EDIT: It doesn't necessarily need to be a hash table, whatever key-value pair table would probably do.
The number of times you would serialize such data structure (and sending over network is serializing as well) vs the number of times you would use such data structure (in your program) is pretty low. So, most implementations focus more on the speed instead of the "maybe easier to serialize" side.
If all the data would be in one allocated memory block a lot of operations on that data structure would be a bit expensive because you would have to:
reallocate memory on add-operations
most likeley compress / vacuum on delete-operations (so that the one block you like so much is dense and has no holes)
Most network operations are buffered anyway, just iterate over the keys and send keys + values.
On a unix system I'd probably utilise a shared memory buffer (see shm_open()), or if that's not available a memory-mapped file with the MAP_SHARED flag, see the OS-specific differences though http://en.wikipedia.org/wiki/Mmap
If both shm_open and mmap aren't available you could still use a file on the disk (to some extent), you'd have to care about the proper locking, I'd send an unlock signal to the next process and maybe the seek of the updated portion of the file, then that process locks the file again, seeks to the interesting part and proceeds as usual (updates/deletes/etc.).
In any case, you could freely design the layout of the hashtable or whatever you want, like having fixed width key/seek pairs. That way you'd have the fast access to the keys of your hashtable and if necessary you seek to the data portion, then copy/delete/modify/etc.
Ideally this file should be on a ram disk, of course.
I agree completely with akira (+1). Just one more comment on data locality. Once the table gets larger, or if the satellite data is large enough, there's most certainly cache pollution which slows down any operation on the table additionally, or in other words you can rely on the level-1/2/3 cache chain to serve the key data promptly whilst putting up with a cache miss when you have to access the satellite data (e.g. for serialisation).
Libraries providing hashtables tend to hide the details and make the thing work efficiently (that is normally what programmers want when they use an hashtabe), so normally the way they handle the memory is hidden from the final programmer's eyes, and programmers shouldn't rely on the particular "memory layout", that may change in following version of the library.
Write your own function to serialize (and unserialize) the hashtable in the most convenient way for your usage. You can keep the serialized content if you need it several times (of course, when the hashtable is changed, you need to update the serialized "version" kept in memory).

Basic concepts in file system implementation

I am a unclear about file system implementation. Specifically (Operating Systems - Tannenbaum (Edition 3), Page 275) states "The first word of each block is used as a pointer to the next one. The rest of block is data".
Can anyone please explain to me the hierarchy of the division here? Like, each disk partition contains blocks, blocks contain words? and so on...
I don't have the book in front of me, but I'm suspect that quoted sentence isn't really talking about files, directories, or other file system structures. (Note that a partition isn't a file system concept, generally). I think your quoted sentence is really just pointing out something about how the data structures stored in disk blocks are chained together. It means just what it says. Each block (usually 4k, but maybe just 512B) looks very roughly like this:
+------------------+------------- . . . . --------------+
| next blk pointer | another 4k - 4 or 8 bytes of stuff |
+------------------+------------- . . . . --------------+
The stuff after the next block pointer depends on what's stored in this particular block. From just the sentence given, I can't tell how the code figures that out.
With regard to file system structures:
A disk is an array of sectors, almost always 512B in size. Internally, disks are built of platters, which are the spinning disk-shaped things covered in rust, and each platter is divided up into many concentric tracks. However, these details are entirely hidden from the operating system by the ATA or SCSI disk interface hardware.
The operating system divides the array of sectors up into partitions. Partitions are contiguous ranges of sectors, and partitions don't overlap. (In fact this is allowed on some operating systems, but it's just confusing to think about.)
So, a partition is also an array of sectors.
So far, the file system isn't really in the picture yet. Most file systems are built within a partition. The file system usually has the following concepts. (The names I'm using are those from the unix tradition, but other operating systems will have similar ideas.)
At some fixed location on the partition is the superblock. The superblock is the root of all the file system data structures, and contains enough information to point to all the other entities. (In fact, there are usually multiple superblocks scattered across the partition as a simple form of fault tolerance.)
The fundamental concept of the file system is the inode, said "eye-node". Inodes represent the various types of objects that make up the file system, the most important being plain files and directories. An inode might be it's own block, but some file system pack multiple inodes into a single block. Inodes can point to a set of data blocks that make up the actual contents of the file or directory. How the data blocks for a file is organized and indexed on disk is one of the key tasks of a file system. For a directory, the data blocks hold information about files and subdirectories contained within the directory, and for a plain file, the data blocks hold the contents of the file.
Data blocks are the bulk of the blocks on the partition. Some are allocated to various inodes (ie, to directories and files), while others are free. Another key file system task is allocating free data blocks as data is written to files, and freeing data blocks from files when they are truncated or deleted.
There are many many variations on all of these concepts, and I'm sure there are file systems where what I've said above doesn't line up with reality very well. However, with the above, you should be in a position to reason about how file systems do their job, and understand, at least a bit, the differences you run across in any specific file system.
I don't know the context of this sentence, but it appears to be describing a linked list of blocks. Generally speaking, a "block" is a small number of bytes (usually a power of two). It might be 4096 bytes, it might be 512 bytes, it depends. Hard drives are designed to retrieve data a block at a time; if you want to get the 1234567th byte, you'll have to get the entire block it's in. A "word" is much smaller and refers to a single number. It may be as low as 2 bytes (16-bit) or as high as 8 bytes (64-bit); again, it depends on the filesystem.
Of course, blocks and words isn't all there is to filesystems. Filesystems typically implement a B-tree of some sort to make lookups fast (it won't have to search the whole filesystem to find a file, just walk down the tree). In a filesystem B-tree, each node is stored in a block. Many filesystems use a variant of the B-tree called a B+-tree, which connects the leaves together with links to make traversal faster. The structure described here might be describing the leaves of a B+-tree, or it might be describing a chain of blocks used to store a single large file.
In summary, a disk is like a giant array of bytes which can be broken down into words, which are usually 2-8 bytes, and blocks, which are usually 512-4096 bytes. There are other ways to break it down, such as heads, cylinders, sectors, etc.. On top of these primitives, higher-level index structures are implemented. By understanding the constraints a filesystem developer needs to satisfy (emulate a tree of files efficiently by storing/retrieving blocks at a time), filesystem design should be quite intuitive.
Tracks >> Blocks >> Sectors >> Words >> Bytes >> Nibbles >> Bits
Tracks are concentric rings from inside to the outside of the disk platter.
Each track is divided into slices called sectors.
A block is a group of sectors (1, 2, 4, 8, 16, etc). The bigger the drive, the more sectors that a block will hold.
A word is the number of bits a CPU can handle at once (16-bit, 32-bit, 64-bit, etc), and in your example, stores the address (or perhaps offset) of the next block.
Bytes contain nibbles and bits. 1 Byte = 2 Nibbles; 1 Nibble = 4 Bits.

Resources