Is .xz file format description telling it all? - archive

I've been reading the description of xz file format ( ). But when I try to look into an xz file with binary editor, it doesn't seem to follow the structure defined in the description. What am I missing?
I compressed the description file (xz-file-format.txt) with xz cli utility in linux (xz version 4.999.9beta) and these are the first 32 bytes I get:
FD 37 7A 58 5A 00 00 04 E6 D6 B4 46 02 00 21 01 16 00 00 00 74 2F E5 A3 E0 A9 28 2A 99 5D 00 05
Overall structure of the file should be: stream - stream padding - stream - and so on. And in this case I think there should be only one stream since there is only one file compressed in the file. Structure of the stream is: stream header - block - block - ... - block - index - stream footer. And structure of the stream header is: header magic bytes - stream flags - crc code.
I can find the stream header from my file, but after the first sixteen bytes it doesn't seem to follow the description anymore.
First six bytes above are clearly the magic bytes. Next two bytes are the stream flags. Stream flags indicate that CRC64 is being used, so the CRC code takes next eight bytes. Seventeenth byte (I count from one) should then be the first byte of the first block.
Structure of a block is: block header - compressed data - block padding - check. Structure of block header should be: block header size - block flags - compressed size - uncompressed size - list of filter flags - header padding - CRC. So the seventeenth byte should then be block header size (0x16 in my file). That's possible, but the eighteenth byte seems a bit weird. It should be the block flags bit field. In my file it's null - so no flags set. Not even the number of filters, which according to description should be 1-4.
Since bits 6 and 7 of the block flags are also zeros, compressed and uncompressed sizes should not be present in the file and the next bytes should be the list of filter flags. Structure of the list is: filter ID - size of properties - filter properties. Nineteenth byte should then be filter ID. This is null in my file which is not any of officially defined filter IDs. If it would be a custom ID it would take nine bytes, but as I understand the encoding of sizes described in section 1.2 of the description it can't be, since according to the description: "All but the last byte of the multibyte representation have the highest (eighth) bit set.", but in my file the twentieth byte is also null.
So is there something I don't understand or is the file not following the description?

I asked the question a bit hastily and came up with a solution myself. Just in case someone would be interested, I answer my own question.
I had misunderstood the meaning of the stream flags in stream header. They don't affect the CRC code in the header (which is always CRC32), just CRCs in the stream itself (as the name stream flags implies). This means that the CRC in the header is only four bytes long and thus bytes 13-24 form a valid block header.
In the block header, the block flags field is again a null byte, which I saw as a problem before. According to the description, number of filters should be between 1 and 4. So I expected a decimal value of at least one. Since number of filters is expressed with two bits the maximum decimal value is 3, but number of possible values (zero included) is of course four and thus zero means one filter.
Since also the last two bits of the block flags are zeros, no compressed size or uncompressed size fields are present in the block header. This means that bytes 15-17 are the filter flags for the first (and only) filter. Filter id 0x21 is the id of LZMA2 filter. Size of properties 0x01 means size of one byte. And dictionary size 0x16 means size of 4096 KiB.


How to combine hex values in real-time?

To give a context, I have an incoming stream of Hex values that is getting written to a CSV file which are in the format shown below.
20 5a 20 5e 20 7b 20 b1 20 64 20 f8 ...
I can not change the way the data is flowing in, but before it gets written to a CSV file I want it in this format below.
205a 205e 207b 20b1 2064 20f8 ...
As the data is coming, I need to process it and store it in the format shown above. One of the ways I tried was just bitshifting and doing logical OR which would store the result in a variable. But all I have here is a pointer pointing to a buffer where the data will be flowing into. I have something like this.
uint8_t *curr_ptr;
uint8_t* dec_buffer=(uint8_t*)calloc(4000,sizeof(uint8_t)*max_len);
for(int j=17;j<=145;j+=1){
fprintf(f_write[file_count],"%02x ", *(curr_ptr+j));
if(j>0 && j%145==0){
Effectively you want to remove every other space. Why not something like this?
for(int j=17;j<=145;j+=1){
fprintf(f_write[file_count], j%2 ? "%02x " : "%02x", *(curr_ptr+j));
Not sure if you should be printing spaces after the odd values of j or the even ones, but you can sort that out.

Contents of a f77 unformatted binary file

I have an f77 unformatted binary file.
I know that the file contains 2 floats and a long integer as well as data.
The size of the file is 536870940 bytes which should include 512^3 float data values together with the 2 floats and the long integer.
The 512^3 float data values make up 536870912 bytes leaving a further 28 bytes.
My problem is that I need to work out where the 28 bytes begins and how to skip this amount of storage so that I can directly access the data.
I prefer to use C to access the file.
Unfortunately, there is no standard what unformatted means. But some methods are more common than others.
In many Fortran versions I have used, every write command writes a header (often unsigned int 32) of how many bytes the data is, then the data, then repeats the header value in case you're reading from the rear.
From the values you have provided, it might be that you have something like this:
uint32(record1 header), probably 12.
float32, float32, int32 (the three 'other values' you talked about)
uint32(record1 header, same as first value)
uint32(record2 header, probably 512^3*4)
uint32(record2 header, same as before)
You might have to check endianness.
So I suggest you open the file in a hexdump program, and check whether bytes 0-3 are identical to bytes 16-19, and whether bytes 20-23 are repeated at the end of the data again.
If that is the case, I'll try to check the endianness to see whether the values are little or big endian, and with a little luck you'll have your data.
Note: I assume that these three other values are metadata about the data, and therefore would be at the beginning of the file. If that's not the case, you might have them at the end.
In your comment, you write that your data begins with something like this:
0C 00 00 00 XX XX XX XX XX XX XX XX XX XX XX XX 0C 00 00 00
^- header-^ ^-header -^
E8 09 FF 1F (many, many values) E8 09 FF 1F
^- header-^ ^--- your data ---^ ^-header -^
Now I don't know how to read data in C. I leave this up to you. What you need to do is skip the first 24 bytes, then read the data as (probably little endian) 4-byte floating values. You will have 4 bytes left that you don't need any more.
Important note:
Fortran stores arrays column-major, C afaik stores them row-major. So keep in mind that the order of the indices will be reversed.
I know how to read this in Python:
from import FortranFile
ff = FortranFile('data.dat', 'r', '<u4')
# read the three values you are not interested in
threevals = ff.read_record('<u4')
# read the data
data = ff.read_record('<f4')

Determine size of decrypted data from gcry_cipher_decrypt?

I am using AES/GCM, but the following is a general question for other modes, like AES/CBC. I have the following call into libgcrypt:
#define COUNTOF(x) ( sizeof(x) / sizeof(x[0]) )
#define ROUNDUP(x, b) ( (x) ? (((x) + (b - 1)) / b) * b : b)
const byte cipher[] = { 0xD0,0x6D,0x69,0x0F ... };
byte recovered[ ROUNDUP(COUNTOF(cipher), 16) ];
err = gcry_cipher_decrypt(
handle, // gcry_cipher_hd_t
recovered, // void *
COUNTOF(recovered), // size_t
cipher, // const void *
COUNTOF(cipher)); // size_t
I cannot figure out how to determine what the size of the resulting recovered text is. I've checked the Working with cipher handles reference, and its not discussed (and there are 0 hits for 'pad). I also checked the libgrcrypt self tests in tests/basic.c and tests/fipsdrv.c, but they use the same oversized buffer and never prune the buffer to the actual size.
How do I determine the size of the data returned to me in the recovered buffer?
You need to apply a padding scheme to your input, and remove the padding after the decrypt. gcrypt doesn't handle it for you.
The most common choice is PKCS#7. A high level overview is that you fill the unused bytes in your final block with the number of padded bytes (block_size - used_bytes). If your input length is a multiple of the block size, you follow it with a block filled with block_size bytes.
For example, with 8-byte blocks and 4 bytes of input, your raw input would look like:
AB CD EF FF 04 04 04 04
When you do the decrypt, you take the value of the last byte of the last block, and remove that many bytes from the end.

FAT BPB and little endian reversal

My CPU is little endian, which documentation has told me conforms to the byte-order of the FAT specification. Why then, am I getting a valid address for the BS_jmpBoot, bytes 0-3 of first sector, but not getting a valid number for BPB_BytesPerSec, bytes 11-12 of the first sector.
116 int fd = open (diskpath, O_RDONLY, S_IROTH);
118 read (fd, BS_jmpBoot, 3);
119 printf("BS_jmpBoot = 0x%02x%02x%02x\n", BS_jmpBoot[0], S_jmpBoot[1], S_jmpBoot[2]);
121 read (fd, OEMName, 8);
122 OEMName[8] = '\0';
123 printf("OEMName = %s\n", OEMName);
125 read (fd, BPB_BytesPerSec, 2);
126 printf("BPB_BytesPerSec = 0x%02x%02x\n",BPB_BytesPerSec[0], BPB_BytesPerSec[1]);
BS_jmpBoot = 0xeb5890 //valid address, while 0x9058eb would not be
OEMName = MSDOS5.0
BPB_BytesPerSec = 0x0002 //Should be 0x0200
I would like figure out why BS_jmpBoot and OEMName print valid but BPB_BytesPerSec does not. If anyone could enlighten me I would be greatly appreciative.
EDIT: Thanks for the help everyone, it was my types that were making everything go awry. I got it to work by writing the bytes to an unsigned short, as uesp suggested(kinda), but I would still like to know why this didn't work:
unsigned char BPB_BytesPerSec[2];
125 read (fd, BPB_BytesPerSec, 2);
126 printf("BPB_BytesPerSec = 0x%04x\n", *BPB_BytesPerSec);
BPB_BytesPerSec = 0x0000
I would like to use char arrays to allocate the space because I want to be sure of the space I'm writing to on any machine; or should I not?
Thanks again!
You are reading BPB_BytesPerSec incorrectly. The structure of the Bpb is (from here):
BYTE BS_jmpBoot[3];
WORD BPB_BytesPerSec;
The first two fields are bytes so their endianness is irrelevant (I think). BPB_BytesPerSec is a WORD (assuming 2 bytes) so you should define/read it like:
WORD BPB_BytesPerSec; //Assuming WORD is defined on your system
read (fd, &BPB_BytesPerSec, 2);
printf("BPB_BytesPerSec = 0x%04x\n", BPB_BytesPerSec);
Since when you read the bytes directly you get 00 02, which is 0x0200 in little endian, you should correctly read BPB_BytesPerSec like this.
First of all, this line:
printf("BPB_BytesPerSec = 0x%02x%02x\n",BPB_BytesPerSec[0], BPB_BytesPerSec[1]);
is printing the value out in big endian format. If it prints 0x0002 here, the actual value would be 0x0200 in little endian.
As for the BS_jmpBoot value, according to this site:
The first three bytes EB 3C and 90 disassemble to JMP SHORT 3C NOP. (The 3C value may be different.) The reason for this is to jump over the disk format information (the BPB and EBPB). Since the first sector of the disk is loaded into ram at location 0x0000:0x7c00 and executed, without this jump, the processor would attempt to execute data that isn't code.
In other words, the first 3 bytes are opcodes which are three separate bytes, not one little endian value.

Reading SQLite header

I was trying to parse the header from an SQLite database file, using this (fragment of the actual) code:
struct Header_info {
char *filename;
char *sql_string;
uint16_t page_size;
int read_header(FILE *db, struct Header_info *header)
uint8_t sql_buf[100] = {0};
/* load the header */
if(fread(sql_buf, 100, 1, db) != 1) {
return ERR_SIZE;
/* copy the string */
header->sql_string = strdup((char *)sql_buf);
/* verify that we have a proper header */
if(strcmp(header->sql_string, "SQLite format 3") != 0) {
memcpy(&header->page_size, (sql_buf + 16), 2);
return 0;
Here are the relevant bytes of the file I'm testing it on:
0000000: 5351 4c69 7465 2066 6f72 6d61 7420 3300 SQLite format 3.
0000010: 1000 0101 0040 2020 0000 c698 0000 1a8e .....# ........
Following this spec, the code looks correct to me.
Later I print header->page_size with this line:
printf("\tPage size: %"PRIu16"\n", header->page_size);
But that line prints out 16, instead of the expected 4096. Why? I'm almost certain it's some basic thing that I've just overlooked.
It's an endianness problem. x86 is little-endian, that is, in memory, the least significant byte is stored first. When you load 10 00 into memory on a little-endian architecture, you therefore get 00 10 in human-readable form, which is 16 instead of 4096.
Your problem is therefore that memcpy is not an appropriate tool to read the value.
See the following section of the SQLite file format spec :
1.2.2 Page Size
The two-byte value beginning at offset 16 determines the page size of
the database. For SQLite versions and earlier, this value is
interpreted as a big-endian integer and must be a power of two between
512 and 32768, inclusive. Beginning with SQLite version 3.7.1, a page
size of 65536 bytes is supported. The value 65536 will not fit in a
two-byte integer, so to specify a 65536-byte page size, the value is
at offset 16 is 0x00 0x01. This value can be interpreted as a
big-endian 1 and thought of is as a magic number to represent the
65536 page size. Or one can view the two-byte field as a little endian
number and say that it represents the page size divided by 256. These
two interpretations of the page-size field are equivalent.
It seems an endianness issue. If you are on a little-endian machine this line:
memcpy(&header->page_size, (sql_buf + 16), 2);
copies the two bytes 10 00 into an uint16_t which will have the low-order byte at the lower address.
You can do this instead:
header->page_size = sql_buf[17] | (sql_buf[16] << 8);
For the record, note that the solution I propose will work regardless of the endianness of the machine (see this Rob Pike's Article).
