Two the same files, different size

Two the same files, different size - file

I have been playing with the files' sizes a bit as I use CheckSum to prevent from creating duplicates of the same file. CheckSum works absolutely fine, exactly as I would expect it to work. The problem I face is a fact that the same files have different sizes. Let me explain it, e.g. if I have a docx file and one of the words it contains is my first name "Szymon" and the size of this file is 436,854 bytes. Then, I will remove "Szymon" from the document and wrote it again, in exactly the same way, so "Szymon". In the very end I can see a slight difference of 10-20 bytes between the initial size of the document (436,854 bytes) and the second one (436,875 bytes). My question is, what is the reason for it to happen, cause both docx files contain exactly the same content?
Thanks in advance

Related

Why can't we strip the begining or end of files so fast?

I was reading the answers to this question and this one. comparing files to C arrays if we want to remove some elements from the middle of a C array we have to shift all other elements so it takes so much time. But if we want to remove the first few items or the last few ones, we can just change the pointers and deallocate the removed elements with takes almost no time, and it's independent of the length of the array. I was wondering why there is no such way for truncating files. I thought maybe there are some meta data about the file in the beginning or end of files that cause the issue but if that is the case I think it must be in either beginning or end of the file so we must be able to remove a few lines from at least one of beginning and end so fast. But it seems like it's not possible. Why is that? What am I missing here?
I need this because I have a 10GB file that I have to remove lines from it's beginning or end one by one until it's empty. I'm on ubuntu 16.04 but I would love to know if there are other solutions in other OS.

This is a very simplistic answer, but files are normally stored on disk by 'pages/blocks' which have a certain amount of bytes. So in theory it would be relatively easy to remove the exact page/block size. Because afaik all blocks are filled completely (except the last one). However, in practice, the chance this is used is very low that exactly the amount of bytes should be removed from the beginning of a file.
For the end of the file it would be easier, but also in this case it does not happen often, and therefore no 'generic' way is implemented for it.

How to properly work with file upon encoding and decoding it?

It doesn't matter how I exactly encrypt and decode files. I operate with file as a char massive, everything is almost fine, until I get file, which size is not divide to 8 bytes. Because I can encrypt and decode file each round 8 bytes, because of particular qualities of algorithm (size of block must be 64 bit).
So then, for example, I face .jpg and tried simply add spaces to end of file, result file can't be opened ( ofc. with .txt files nothing bad happen).
Is any way out here?
If you want information about algorithm http://en.wikipedia.org/wiki/GOST_(block_cipher).
UPD: I can't store how many bytes was added, because initial file can be deleted or moved. And, what we are suppose to do then we know only key and have encrypted file.

Do you need padding.
The best way to do this would be to use PKCS#7.
However GOST is not so good, better using AES-CBC.
There is an ongoing similar discussion in "python-channel".

Temporary File in C

I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?

A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.

Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.

ARM binary and hexedit

I have an ARM binary file and want to change some text.
I remove couple of text-symbols from comment.
But the binary won't start, with log:
link_image[1710]: 3013 missing essential tables CANNOT LINK EXECUTABLE
Does anybody have an idea how to edit ARM binary files?

I remove couple of text-symbols
Stop right there. If I am reading what you wrote correctly, you removed some characters, instead of replacing them with other characters.
This would shift the whole rest of the file. But binary files often have tables or offsets which point to other parts of the file. Shifting the contents of the file, even by a single byte, means these tables or offsets no longer point where they should. The code trying to read the file was rightly confused after that.
When editing binary files, you must never move the contents, unless you know what you are doing. If you are editing the text, your changes must not change the size of the text. If the new text is smaller, you must pad it so it keeps the same size; if the new text is larger, it will not fit and you must find a shorter text.
Of course, this assumes that the file format does not have checksums which would notice the change, or that you know how to recompute them.
Also, make sure you are using a proper editor. Normal text editors can silently add, remove, or replace characters, which could break the file, possibly in a hard-to-detect way.

How to represent a random-access text file in memory (C)

I'm working on a project in which I need to read text (source) file in memory and be able to perform random access into (say for instance, retrieve the address corresponding to line 3, column 15).
I would like to know if there is an established way to do this, or data structures that are particularly good for the job. I need to be able to perform a (probably amortized) constant time access. I'm working in C, but am willing to implement higher level data structures if it is worth it.
My first idea was to go with a linked list of large buffer that will hold the character data of the file. I would also make an array, whose index are line numbers and content are addresses corresponding to the begin of the line. This array would be reallocated on need.
Subsidiary question: does anyone have an idea the average size of a source file ? I was surprised not to find this on google.
To clarify:
The file I'm concerned about are source files, so their size should be manageable, they should not be modified and the lines have variables length (tough hopefully capped at some maximum).
The problem I'm working on needs mostly a read-only file representation, but I'm very interested in digging around the problem.
Conlusion:
There is a very interesting discussion of the data structures used to maintain a file (with read/insert/delete support) in the paper Data Structures for Text Sequences.
If you just need read-only, just get the file size, read it in memory with fread(), then you have to maintain a dynamic array which maps the line numbers (index) to pointer to the first character in the line. Someone below suggested to build this array lazily, which seems a good idea in many cases.

I'm not quite sure what the question is here, but there seems to be a bit of both "how do I keep the file in memory" and "how do I index it". Since you need random access to the file's contents, you're probably well advised to memory-map the file, unless you're tight on address space.
I don't think you'll be able to avoid a linear pass through the file once to find the line endings. As you said, you can create an index of the pointers to the beginning of each line. If you're not sure how much of the index you'll need, create it lazily (on demand). You can also store this index to disk (as offsets, not pointers) if you will need it on subsequent runs. You can estimate the size of the index based on the file size and the expected line length.

1) Read (or mmap) the entire file into one chunk of memory.
2) In a second pass create an array of pointers or offsets pointing to the beginnings of the lines (hint: one after the '\n' ) into that memory.
Now you can index the array to access a specific line.

It's impossible to make insertion, deletion, and reading at a particular line/column/character address all simultaneously O(1). The best you can get is simultaneous O(log n) for all of these operations, and it can be achieved using various sorts of balanced binary trees for storing the file in memory.
Of course, unless your files will be larger than 100 kB or so, you're probably best off not bothering with anything fancy and just using a flat linear buffer...

solution: If lines are about same size, make all lines equally long by appending needed number of metacharacters to each line. Then you can simply calculate the fseek() position from line number, making your search O(1).
If lines are sorted, then you can perform binary search, making your search O(log(nõLines)).
If neither, you can store the indexes of line begginings. But then, you have a problem if you modify file a lot, because if you insert let's say X characters somewhere, you have to calculate which line it is, and then add this X to the all next lines. Similar with with deletion. Yu essentially get O(nõLines). And code gets ugly.
If you want to store whole file in memory, just create aray of lines *char[]. You then get line by first dereference and character by second dereference.

As an alternate suggestion (although I do not fully understand the question), you might want to consider a struct based, dynamically linked list of dynamic strings. If you want to be astutely clever, you could build a dynamically linked list of chars which you then export as strings.
You'd have to use OO type design for this to be manageable.
So structs you'd likely want to build are:
DynamicArray;
DynamicListOfArrays;
CharList;
So it goes:
CharList(Gets Chars/Size) -> (SetSize)DynamicArray -> (AddArray)DynamicListOfArrays
If you build suitable helper functions for malloc and delete, and make it so the structs can either delete themselves automatically or manually. Using the above combinations won't get you O(1) read in (which isn't possible without the files have a static format), but it will get you good time.
If you know the file static length (at least individual line wise), IE no bigger than 256 chars per line, then all you need is the DynamicListOfArries - write directly to the array (preset to 256), create a new one, repeat. Downside is it wastes memory.
Note: You'd have to convert the DynamicListOfArrays into a 'static' ArrayOfArrays before you could get direct point-to-point access.
If you need source code to give you an idea (although mine is built towards C++ it wouldn't take long to rewrite), leave a comment about it. As with any other code I offer on stackoverflow, it can be used for any purpose, even commercially.

Average size of a source file? Does such a thing exist? A source file could go from 0 bytes to thousands of bytes, like any text file, it depends on the number of caracters it contains

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight