I have a binary file which contains data in 128-byte blocks that are spread across the file. Each block starts with a char array of length 8.
How do I reorganize the data in this binary file such that all 128-byte blocks are ordered sequentially and that there is no unused space between these blocks?
Unused/unallocated space is just represented by 0 in this file and strings are null terminated.
I'm quite lost.
Your question is not a complete specification, e.g.:
- what is unused space?
- do the char arrays contain string terminator or not?
- is the file small enough to read into memory, or is it large?
So I cannot write it to you. However you can program it easily if you follow these instructions:
1. Build a list of pointers to the used 128 bytes blocks.
2. Sort that list.
3. You can now overwrite the file looping through the ordered list of pointers.
Related
Examaple problem:
Given:
2KB block size
Inode consists of:
5 direct pointers
3 indirect pointers
2 doubly indirect pointers
1 triply indirect pointer
Assume a pointer is 8B.
What is the maximum size of a file that an inode can point to in this system? How many distinct blocks must be read to read the entirety of a 128MB file?
I've been seeing inconsistent answers online and am not sure who to trust on this.
Prepending to a large file is difficult, since it requires pushing all
other characters forward. However, could it be done by manipulating
the inode as follows?:
Allocate a new block on disk and fill with your prepend data.
Tweak the inode to tell it your new block is now the first
block, and to bump the former first block to the second block
position, former second block to the third position, and so on.
I realize this still requires bumping blocks forward, but it should be
more efficient than having to use a temp file.
I also realize the new first block will be a "short" block (not all the data in the block is part of the file), since your prepend data is unlikely to be exactly the same size as a block.
Or, if inode blocks are simply linked, it would require very little
work to do the above.
NOTE: my last experience directly manipulating disk data was with a
Commodore 1541, so my knowledge may be a bit out of date...
Modern-day operating systems should not allow a user to do that, as inode data structures are specific to the underlying file system.
If your file system/operating system supports it, you could make your file a sparse file by prepending empty data at the beginning, and then writing to the sparse blocks. In theory, this should give you what you want.
YMMV, I'm just throwing around ideas. ;)
This could work! Yes, userland programs should not be mucking around with inodes. Yes, it necessarily depends on whatever scheme used to track blocks by whatever file systems implement this function. None of this is a reason to reject this proposal out of hand.
Here is how it could work.
For the sake of illustration, suppose we have an inode that tracks blocks by an array of direct pointers to data blocks. Further suppose that the inode carries a starting-offset and and ending-offset that apply to the first and last blocks respectively, so you can have less-than-full blocks both at the beginning and end of a file.
Now, suppose you want to prepend data. It would go something like this.
IF (new data will fit into unused space in first data block)
write the new data to the beginning of the first data block
update the starting-offset
return success indication to caller
try to allocate a new data block
IF (block allocation failed)
return failure indication to caller
shift all existing data block pointers down by one
write the ID of the newly-allocated data block into the first slot of the array
write as much data as will fit into the second block (the old first block)
write the rest of data into the newly-allocated data block, shifted to the end
starting-offset := (data block size - length of data in first block)
return success indication to caller
When a file is written, the write function corresponding with .write in fuse_operations is called for each file segment.
This means that for a larger file (e.g. 12720 bytes), the write function could be called 4 times with
1. size=4096, offset=0
2. size=4096, offset=4096
3. size=4096, offset=8192
4. size=432, offset=12288
because it has 4 segments with max segment size of 4096 bytes.
Inside the write function, I'd like to determine when the last segment size is being writen. I'm intending to put all the segments into a buffer, and use this last written segment to signal that the buffer now contains the entire object, so that it can be put somewhere else (such as an object store). By knowing the size of the object being written before it's written, I can just do a simple equality test file_size == size + offset to determine when the last segment is being written.
Apparently, I can't. I can only put the entire object somewhere else (such as an object store) after the file handler is closed.
If the length of your chunk is less than 4096, you know that you're at the end of the file. Go ahead and write out the contents of your buffer!
If I am trying to scan a string of an unknown length, would it be a good approach to scan the input one char at a time and build a linked list of chars to create the string? The only problem I am currently facing is I'm not sure how to handle the string one char at a time without asking the user to enter the string one char at a time, which would be unreasonable. Is there a better approach? I would like to avoid mallocing an arbitrarily large size char array just to accommodate most strings.
In my suggestion having a linked list of chars will be very bad idea as it would consume too much memory for a single string.
Instead, you allocate a nominal sized buffer (say 128 bytes) and keep reading the characters. Once you feel, the buffer is almost full, allocate another buffer of double the current size and copy the contents of first buffer into second one, freeing the first buffer. Like this, you can continue till your string has been read completely.
Also, in most of the programs I have written or seen, an upper limit for string size will be maintained and if the string input appears to be exceeding the size, the program will return an error. The upper limit for string size is determined based on the application context. For Ex: If the string which you are reading is a name it generally cannot exceed more than 32 (or some x value) characters, if it does, then the name is truncated to fit the buffer. In this way the buffer can be allocated first time itself for the upper limit size.
This is just one idea. There may be many other ways in which this can be addressed rather than a linked list.
Ignoring the overkill memory usage of a node-per-char linked-list for a moment and supposing you actually built it and got your string into it. Can you actually work with it?
For example:
The non-contiguous buffer means many of the standard functions (e.g. printf(), strlen(), fwrite()) would simply be incompatible or be a pain to work with.
Non-sequence access on the string would be extremely inefficient.
As for a better approach: it really depends on what you're going to do with the string. (For example, if you can process the string as it comes in, maybe you don't even need to hold the entire thing in memory.)
Store it in an array. Initialize the array with a fixed sized array and while reading the input store them in the array. When array is full and new input comes then create a larger array of double size, and copy old array in newer one. Now keep adding new chars in this array. Repeat the process till you have read all data.You can optimize the process of copying chars from old array to new array by following approach
1)Initialize a variable old_idx to 0
2) When a new char comes (after the old array is full) then create a new array of double size of older one and copy the new char at old_size+1 index. Also copy the data at index old_idx in old array at old_idx in newer array.
3)Increment old_idx
At the end just check that if old_idx < old_array_size then copy rest of old data.
Amortized cost of the whole process is pretty low and this is how ArrayList in java also works.
Advantages of Array over linklist are obvious
1) Less memory footprint
2)Faster linear access (as in array all the memory allocations for data happen in contiguous manner)
As a school assignment I'm tasked with writing a program that opens any text file and performs a number of operations on the text. The text must be loaded using a linked list, meaning an array of structs containing the char pointer and the pointer to the next struct. One line per struct.
But I'm having problems actually loading the file. It seems the memory required to load the text into memory must be allocated before I actually read the text. Hence I have to open the file several times. Once to count the number of lines, then twice per line; once to count the characters in the line then once to read them. It seems absurd to open a file hundreds of times just to read it into memory.
Obviously there are better ways of doing this, I just don't know them :-)
Examples
Can the point from which fgetc fetches a character be moved without re-opening the file?
Can the number of lines or characters in a file be checked before it is "opened"?
Can I somehow read a line or string from a file and save it to memory without allocating a fixed amount of bytes?
etc
There is no need to open the file more than once, nor to pass through it more than once.
Look at the POSIX getline() function. It reads lines into allocated space. You can use it to read the lines, and then copy the results for your linked list.
There is no need with a linked list to know how many lines there are ahead of time; that's an advantage of lists.
So, the code can be done with a single pass. Even if you can't use getline(), you can use fgets() and monitor whether it reads to end of line each time, and if it doesn't you can allocate (and reallocate) space to hold the line as needed (malloc(), realloc() and eventually free() from <stdlib.h>).
Your specific questions are largely immaterial if you adopt anything of the approach I suggest, but:
Using fseek() (and in extremis rewind()) will move the read pointer (for fgetc() and all other functions), unless the 'file' does not support seeking (eg, a pipe provided as standard input).
Characters can be determined with stat() or fstat() or variants. Lines cannot be determined except by reading the file.
Since the file could be from zero bytes to gigabytes in size, there isn't a sensible way of doing fixed size allocations. You are pretty much forced into dynamic memory allocation with malloc() et al. (Behind the scenes, getline() uses malloc() and realloc().)
You cannot count the number of lines in a file without actually traversing it. You could get the total file size, but that's not whats intended here. The idea of using a linked list of lines is that you operate on the file one line at a time. You do not need to read anything in advance. While you haven't read the whole file, read a line, add it to its own node at the end of the linked list, move to the next line.
Regarding your first question: you can change the position in the file you are reading from with the fseek() function.
There are several ways you could do this. For example, you could have a fixed-size buffer, fill it with bytes from the file, copy lines from the buffer to the list, fill the buffer again and so on.