why the first byte is like that in Beyond Compare tool? - file

I have two binary files and suppose they should be the same but they are not. So I use binary diff tools to look at them. But two different tools, Beyond Compare and UltraCompare, give me different result in one file at the first byte.
I use HxD tool to verify the content and it seems the HxD supports UltraCompare.
Can anybody tell me what's that mean in Beyond Compare? Does this mean the Beyond Compare is not reliable in some cases?

In Beyond Compare spaces with the cross hatched ▨ background indicate missing (added or deleted) bytes. In your image the file on the left starts with an an 0x00 byte that the one on the right doesn't have. BC will show a gap in the file content to make the rest of the bytes line up visually. That's also indicated the hex addresses that are shown as "line numbers" being different on the two sides, and is the reason the rest of the file shows as black (exact matches). Gaps don't have any affect on the content of the files, it's just a method of presenting the alignment more clearly.
UltraCompare apparently isn't adjusting the alignment in this case, so every 0xC8 byte is lined up with a 0x00 one and vice versa, which is why the entire comparison is shown as a difference (red).
HxD is just showing a single file, not a comparison, so it doesn't need to use gaps to show the alignment. Whether UltraCompare is better or not depends on what you want the comparison to do. It is just comparing byte 1 to byte 1, byte 2 to byte 2, etc, while BC is aligning the files taking into account adds and deletes. In this case, it's showing that byte 1 on the left was added, so doesn't match anything on the right, and that byte 2 on the left is the same as byte 1 on the left, byte 3 on the left matches byte 2 on the right, etc.
If the binary data can have inserts and deletes (e.g., if it contains textual strings or variable length headers), then BC's approach is better because it avoids showing the entire file as different if one side just has an added byte (as in this case).
If the binary data is fixed size, for example a bitmap, then what UltraCompare is doing is better, since it's not adjusting offsets to line things up better. Since your filenames are labeled "pixelData" I assume that's the behavior you would prefer. In that case, in Beyond Compare you can change that by using the Session menu's Session Settings... command and switching the "Comparison" alignment setting from "Complete" to "None".

Related

Suggestions on to make my compressor faster

I have some data which I'm compressing with a custom compressor, the compressed data is fine but the compressor takes ages, and I'm seeking advice on how I could make that faster. Let me give you all the details.
The input data is an array of bytes, maximum 2^16 of them. Since those bytes in the array NEVER assume values between 0x08 and 0x37 (inclusive), I decided that I could exploit that for a simple LZ-like compression scheme that works by replacing any found sequence of 4 to 51 bytes in length that is already been found at a "lower address" (I mean closer to the array's beginning) with a single byte in the 0x08 to 0x37 range that would then be followed by two bytes addressing the low and high byte of the index of the beginning of the sequence, thus giving the decompressor the length (within that single byte) and address of the original data, to rebuild the original array.
The compressor works this way: for any sequence of any length from 51 to 4 bytes (I test longer sequences first) starting from any index (from left to right) I check if there's a correspondence 'left' of that, which means at an index which is lower than the starting point I'm checking. In case there is more than a single match, I choose the match that 'saves' more, which means the longer correspondence starting at the leftmost place.
The results are just perfect... but of course this is over-killing - it's 4 nested 'for' cycles with a memcmp() inside that, and it takes minutes on a modern workstation to compress some 20 KB worth of data, and that's why I'm seeking help.
Code is accessible here, if you need to sneak a peek. The 'job' starts at line 44.
Of course I can give you any detail you need, there's nothing secret here (BTW, just in case... I'm not going to change compression scheme for this reason, as this one works exactly as I need it!)
Thank you in advance.
A really obvious one is that you don't have to loop over the lengths, just find out what the longest match at that position is. That's not a "search", just keep extending the match by 1 for every matching pair of characters. When it stops, you have the longest match at that position (naturally you can force it to stop at 51 too, so it doesn't overrun).
An other typical trick is keeping a hashmap that maps keys of 3 or 4 characters to a list of offsets where they can be found. That way you only need to try positions that have some hope of resulting in a match. This is also described in the DEFLATE RFC all the way at the bottom.

Fseek to a line number (with lines of variable length)

I have a large file (~10GB) with variable length lines, and I would like to programmatically go to different line numbers. Is there an efficient way to do so?
Yes: build an index. For example, just once you can create a text file on the side which contains the byte offset of various line numbers, like this:
line,offset
0,0
10000,48272
20000,93726
Etc. Then when you want to go to line 13043, just jump to offset 48272 and skip another 3043 newlines. Simple and efficient.
Another approach would be to make your line lengths constant. This would work well if they already have similar lengths so you don't waste too much space. You can pad them out with \0 characters or spaces or whatever, then index the file like a big matrix (line N is at N*LEN bytes).
Finally, you could simply write the line numbers at the beginning of the lines themselves. Then just binary-search within the file, skip to a newline, and inspect the next line number to know whether to look backward or forward (and even guess by how much).
There is no efficient way to do so. You need to scan the entire file once to memorize when are the end of line markers.
Pragmatically, you need a large loop on e.g. getline(3)
You could memoize e.g. the offset of every 100 line, perhaps in a big array or some indexed file using GDBM or some Sqlite database.
My feeling is that you should not have such a huge text file in the first place at all (having a huge text file accessed randomly is the symptom of something wrong). It is not an efficient way to store such data, if you need to access it randomly. You could for example predigest it to fill some database, etc... Probably you should not put such a large piece of data in a text file, but directly in a database or whatever.
Not directly with fseek since it's only capable of moving the position by a bytes amount.
If the efficiency requirement comes from the fact that you must do it many times back and forth a simple solution could be to scan the whole file once and compute all the lines length, store them in a map or array and then use the values to move exactly where you want.

Convert COMP and COMP-3 Packed Decimal into readable value with C

I have an EBCDIC flat file to be processed from a mainframe into a C module. What can be a good process in converting the COMP and COMP-3 values into readable values? Do I have to convert the ebcdic characters to ascii then hex for COMP-3? What about for COMP? Thanks
Bill Woodger has given you some very good advice through his comments to your question, actually he answered the question and should have
posted his comments as an answer.
I would like to reiterate a few of his points and expand on a few others.
If you need to convert a file created from what is probably a COBOL application so it may be read
by some other non-COBOL program, possibly on a machine with an architecture unlike the one where it was created, then
you should demand that the file be created using only display formatted data (i.e. all character data). Mashing non-display
(binary, packed, encoded) data outside of the operating environment where it was created is just a formula for
long term pain. You will be subjected to the joys of sorting out various endianness issues
between architectures and code page conversions. These are the things that
file transfer protocols are designed to manage - they do it well so don't try to reinvent them. Short answer, use FTP or
similar file transport mechanism to move data between machines. And only transport display (character) based data.
Packed Decimal (COMP-3) data types occupy a varying number of bytes depending on their specific PICTURE layout. The position of the decimal point
is implied so cannot be determined without reference to the PICTURE used to define it. Packed Decimal fields may be either signed
or unsigned. If signed, the sign is imbedded in the low 4 bits of the least significant digit. Each byte of a Packed Decimal
data type contains two digits, except possibly the first and last bytes. The first byte contains only 1 digit if the field is signed
and contains an even number of digits. The last byte contains 2 digits if unsigned but only 1 if signed. There are several other subtlies that
you need to be aware of if you want to do your own Packed Decimal to character conversions. At this point I hope you can see
that this is not going to be a trivial exercise.
Binary (COMP) data types have a different but no less complex set of issues to resolve. Again, not a trivial exercise.
So what should you be doing? Basically, do as Bill suggested. Have the program that generates this file use display formats
for output (meaning you have to do nothing). Or, failing that, use a utility program such as DFSORT/SYNCSORT do the conversions
for you. Going the utility
route still requires that you have the original COBOL file layout (and that you understand it) in order to do the conversion.
The last resort is simply writing a simple read-a-record-write-a-record COBOL program that takes in the unformatted data, MOVEes
each COMP-whatever field to a corresponding DISPLAY field and write it out again.
As Bill said, if the group that produced this file tells you that it is too difficult/expensive to produce a DISPLAY formatted
output file they are lying to you or they are incompetent or just too lazy to
do the job they were hired to do. I can think of no other excuses.
Use XML to transport data.
That is, write a program that converts your file into characters (if on mainframe, stay with the EBCIDIC but numeric fields are unpacked, etc.) and then enclose each record and each field in XML tags.
This avoids formatting issues (what field is in column 1, what field in column 2, are the delimters spaces or commas or either, etc. ad nauseum).
Then transmit the XML file with your favorite utility that converts from EBCIDIC to ASCII.

GIF LZW decompression hints?

I've read through numerous articles on GIF LZW decompression, but I'm still confused as to how it works or how to solve, in terms of coding, the more fiddly bits of coding.
As I understand it, when I get to the byte stream in the GIF for the LZW compressed data, the stream tells me:
Minimum code size, AKA number of bits the first byte starts off with.
Now, as I understand it, I have to either add one to this for the clear code, or add two for clear code and EOI code. But I'm confused as to which of these it is?
So say I have 3 colour codes (01, 10, 11), with EOI code assumed (as 00) will the byte that follows the minimum code size (of 2) be 2 bits, or will it be 3 bits factoring in the clear code? Or is the clear code/EOI code both already factored into the minimum size?
The second question is, what is the easiest way to read in dynamically sized bits from a file? Because reading an odd numbers of bits (3 bits, 12 bits etc) from an even numbered byte (8) sounds like it could be messy and buggy?
To start with your second question: yes you have to read the dynamically sized bits from an 8bit bytestream. You have to keep track of the size you are reading, and the number of unused bits left from previous read operations (used for correctly putting the 'next byte' from the file).
IIRC there is a minimum code size of 8 bits, which would give you a clear code of 256 (base 10) and an End Of Input of 257. The first stored code is then 258.
I am not sure why you did not looked up the source of one of the public domain graphics libraries. I know I did not because back in 1989 (!) there were no libraries to use and no internet with complete descriptions. I had to implement a decoder from an example executable (for MS-DOS from Compuserve) that could display images and a few GIF files, so I know that can be done (but it is not the most efficient way of spending your time).

Problems with portability: aligning data, endianness issues, etc

I'm writing a toy database management system, and running up against some alignment and endianness issues.
First, allow me to explain the data that is being stored, and where it's being stored. So first some definitions. The layout of a record is broken up into a Record Directory and Record Data.
[Field count=N] [Field offset[0]] [...] [Field offset[N-1]] [Data for fields 0 to N]
The field count and offsets combined are called the Record Directory.
The data is called the Record Data.
The field count is of type
uint16_t.
The field offset is of type
uint16_t.
The data fields can be treated as a variable length byte buffer pointed to by (uint8_t *) with a length of at least N bytes.
The field count cannot exceed: 4095 or 0x0FFF (in big endian).
The records are stored in a Page:
Pages are of size: 4096 bytes.
Pages need to store 2 bytes of data for each record.
The last 6 bytes of the page stores the running free space offset, and data for a slot directory. The metadata is irrelevant to the question, so I will not bore anyone with the details.
We're storing records on the page, by appending to the running free space offset, and appending to it. Records can later be altered and deleted. This will leave unused space fragments on the page. This data is not reused until time of compaction.
At the moment, we store a fragment byte of 0x80 in unused space (since the free space cannot exceed 0x0FFF, the first byte will never be 0x80).
However this becomes a problem during compaction time. We end up scanning everything until we hit the first byte that is not 0x80. We consider this the start of the free space. Well unfortunately, this is not portable and will only work on big endian machines.
But just to restate the issue here, the problem is distinguishing between: 0x808000 and 0x800080 where the first two bytes (read right to left) are two valid Field count fields depending on the endianness of the platform.
I want to try aligning records on even bytes. I just don't have the foresight to see if this would be a correct workaround for this issue.
At any given time, the free space offset should always sit on an even byte boundary. This means after inserting a record, you advance the free space pointer to the next even boundary.
The problem then becomes an issue of marking the fragments. Fragments are created upon deletion or altering a record (growing/shrinking by some number of bytes). I wanted to store what I would call 2-byte fragment markers: 0xFFFF. But that doesn't seem possible when altering.
This is where I'm stuck. Sorry for the long-winded problem explanation. We (my partner, this is an academic assignment) battled the problem of data ambiguity several times, and it keeps masking itself under different solutions.
Any insight would help. I hope the problem statement can be followed.
I would try this:
Align records to at least 2-byte boundaries.
Scan the list for free space as a list of uint16_t rather than char,
then look for length & 0x8000.
If you let the machine interpret integers as such instead of trying to scan
them as characters, endianness shouldn't be an issue here (at least until
you want to read your database on a different machine than the one that
wrote it).

Resources