I am trying to understand how do the ZLIB algorithm that was implemented in python works, I understand that it uses a variant of DEFLATE. I am wondering if it is possible to do it by hand from the given Data to the Compress Data.
Data : 00 Compress Data : 63 00 00 bin: 01100011 00000000 00000000
Data : 01 Compress Data : 63 04 00 bin: 01100011 00000100 00000000
Data : 02 Compress Data : 63 02 00 bin: 01100011 00000010 00000000
Data : 03 Compress Data : 63 06 00 bin: 01100011 00000110 00000000
Data : 04 Compress Data : 63 01 00 bin: 01100011 00000001 00000000
The above data is compressed with ZLIB level 1 compression with the header 78 01 and their ADLER32 stripped. So what is left is the compress data from DEFLATE (if I am not wrong)
Numbering the bits as such,
+--------+
|76543210|
+--------+
and the bytes as follow,
0 1
+--------+--------+
|00001000|00000010|
+--------+--------+
From DEFLATE standard here.
I understand that bit 0 from the first byte indicates the last block of the file and bit 1 and 2 indicates that DEFLATE is using mode 10 compression. But I am unable to recover/understand what do the rest of the bits means or how to compute them by hand.
Below is an extended version of the bytes,
Data : 01 01 01 01 02 02 Compress Data : 63 64 64 64 64 62 02 00 bin: 01100011 01100100 01100100 01100100 01100100 01100010 00000010 00000000
zlib does not use a "variant" of DEFLATE. It uses DEFLATE.
The DEFLATE compressed data format is fully described in RFC 1951.
You can use infgen to disassemble DEFLATE streams. Example output for your first one:
! infgen 2.4 output
!
last
fixed
literal 0
end
It says that the first block is also the last block, that it is a fixed-code block, and that there is a literal zero byte as the data contained in the block.
You can also look at the source code for infgen to assist in your understanding of RFC 1951.
Related
Why does a value of datetime type get displayed with a format using 12 hour clock and AM/PM, while datetime2 type is returned in ISO format? This using FreeTDS and the client programs sqsh and Perl's DBD::Sybase. How can I change the default date format used for datetime so that it matches datetime2?
After investigating and finding the answer, a better phrasing of the question would be "How can I change the format string used by FreeTDS for values of datetime type returned from the server, and how can I make sure newer types like datetime2 are encoded properly in the TDS protocol?" But of course, once you have expressed the question precisely you have already got most of the way to the answer.
Recent versions of Microsoft SQL Server introduce new types date, time, and datetime2. But to fully support them in the TDS protocol you need to speak TDS version 7.3 or later. Otherwise, they are returned to the client as strings over the wire, as seen by setting the TDSDUMP environment variable:
packet.c:410:Received packet
0000 04 01 00 5a 00 63 01 00-81 01 00 00 00 21 00 e7 |...Z.c.. .....!..|
0010 36 00 09 04 d0 00 34 00-d1 36 00 32 00 30 00 32 |6.....4. .6.2.0.2|
0020 00 32 00 2d 00 30 00 31-00 2d 00 31 00 37 00 20 |.2.-.0.1 .-.1.7. |
0030 00 31 00 30 00 3a 00 33-00 35 00 3a 00 32 00 34 |.1.0.:.3 .5.:.2.4|
0040 00 2e 00 38 00 33 00 36-00 36 00 36 00 36 00 37 |...8.3.6 .6.6.6.7|
0050 00 fd 10 00 c1 00 01 00-00 00 |........ ..|
This shows the datetime string '2022-01-17 10:35:24.8366667' being sent as text. If you use datetime you get a binary format:
packet.c:410:Received packet
0000 04 01 00 25 00 d6 01 00-81 01 00 00 00 21 00 6f |...%.... .....!.o|
0010 08 00 d1 08 20 ae 00 00-df c4 ae 00 fd 10 00 c1 |.... ... ........|
0020 00 01 00 00 00 - |.....|
and something similar is seen for datetime2 provided TDSVER is new enough:
packet.c:410:Received packet
0000 04 01 00 2b 00 59 01 00-81 01 00 00 00 00 00 21 |...+.Y.. .......!|
0010 00 2a 07 00 d1 08 2b 35-0f ab 69 7b 43 0b fd 10 |.*....+5 ..i{C...|
0020 00 c1 00 01 00 00 00 00-00 00 00 |........ ...|
This explains why datetime2 is returned in a standard date format: because it's the fallback SQL Server uses when the client is using an older protocol. In a way it's a good result (I prefer the ISO format) but obtained using the wrong means (it's better to use the more efficient binary protocol).
So the first step is to move to a newer TDS version. This is easiest done by setting the environment variable TDSVER to 7.4 (the current latest supported by FreeTDS).
That will make datetime2 use binary encoding in the same way datetime always has done. And then the application code will get it back as a string using FreeTDS's date formatting. This looks like a regression compared to the ISO format string, but there is a second step.
This is configured in /etc/locales.conf. As shipped, it has a default format and then overrides for some non-English locales. In my personal view, the client library is too low-level a place to do locale-aware date formatting, since application code might want to parse the string into its own datetime type, and a change in locale (like switching the order of day and month) should not change application logic. Locale-aware formatting should come later, at the user interface layer. But anyway, I replaced the contents of the file with
[default]
date format = %Y-%m-%d %H:%M:%S.%z
(The %z is a FreeTDS extension to date format strings to give milliseconds.)
This brings FreeTDS's client-side date formatting into line with the server-side formatting that happens when the protocol version is old. So you could in principle just change this date format and continue with the older TDSVER. It's better to update TDSVER, though, because the binary encoding is more compact and presumably more efficient for the server to generate.
There is one unhappy side effect, though. When getting back a value of type date it will now be formatted as a datetime:
2022-01-17 00:00:00.000
Previously, with the older TDS protocol, the client would just see 2022-01-17. I believe that fixing this will require a patch to FreeTDS.
I have also seen out-of-memory errors with Perl's DBD::Sybase after moving from TDS protocol 7.1 to 7.4. I will update this answer with more information when I have it.
In python2:
$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C
00000000 08 04 87 18 0a |.....|
00000005
In python3:
$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C
00000000 08 04 c2 87 18 0a |......|
00000006
Why does it have the byte "\xc2" here?
Edit:
I think when the string have a non-ascii character, python3 will append the byte "\xc2" to the string. (as #Ashraful Islam said)
So how can I avoid this in python3?
Consider the following snippet of code:
import sys
for i in range(128, 256):
sys.stdout.write(chr(i))
Run this with Python 2 and look at the result with hexdump -C:
00000000 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f |................|
Et cetera. No surprises; 128 bytes from 0x80 to 0xff.
Do the same with Python 3:
00000000 c2 80 c2 81 c2 82 c2 83 c2 84 c2 85 c2 86 c2 87 |................|
...
00000070 c2 b8 c2 b9 c2 ba c2 bb c2 bc c2 bd c2 be c2 bf |................|
00000080 c3 80 c3 81 c3 82 c3 83 c3 84 c3 85 c3 86 c3 87 |................|
...
000000f0 c3 b8 c3 b9 c3 ba c3 bb c3 bc c3 bd c3 be c3 bf |................|
To summarize:
Everything from 0x80 to 0xbf has 0xc2 prepended.
Everything from 0xc0 to 0xff has bit 6 set to zero and has 0xc3 prepended.
So, what’s going on here?
In Python 2, strings are ASCII and no conversion is done. Tell it to
write something outside the 0-127 ASCII range, it says “okey-doke!” and
just writes those bytes. Simple.
In Python 3, strings are Unicode. When non-ASCII characters are
written, they must be encoded in some way. The default encoding is
UTF-8.
So, how are these values encoded in UTF-8?
Code points from 0x80 to 0x7ff are encoded as follows:
110vvvvv 10vvvvvv
Where the 11 v characters are the bits of the code point.
Thus:
0x80 hex
1000 0000 8-bit binary
000 1000 0000 11-bit binary
00010 000000 divide into vvvvv vvvvvv
11000010 10000000 resulting UTF-8 octets in binary
0xc2 0x80 resulting UTF-8 octets in hex
0xc0 hex
1100 0000 8-bit binary
000 1100 0000 11-bit binary
00011 000000 divide into vvvvv vvvvvv
11000011 10000000 resulting UTF-8 octets in binary
0xc3 0x80 resulting UTF-8 octets in hex
So that’s why you’re getting a c2 before 87.
How to avoid all this in Python 3? Use the bytes type.
Python 2's default string type is byte strings. Byte strings are written "abc" while Unicode strings are written u"abc".
Python 3's default string type is Unicode strings. Byte strings are written as b"abc" while Unicode strings are written "abc" (u"abc" still works, too). since there are millions of Unicode characters, printing them as bytes requires an encoding (UTF-8 in your case) which requires multiple bytes per code point.
First use a byte string in Python 3 to get the same Python 2 type. Then, because Python 3's print expects Unicode strings, use sys.stdout.buffer.write to write to the raw stdout interface, which expects byte strings.
python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")'
Note that if writing to a file, there are similar issues. For no encoding translation, open files in binary mode 'wb' and write byte strings.
I want to read the data of an 8bit wav file using textPad, I know that the data is located at the 44/46th byte, but I am having problems reading it.
I have that code:
52 49 46 46 F8 37 01 00 57 41 56 45 66 6D 74 20
12 00 00 00 06 00 01 00 40 1F 00 00 40 1F 00 00
01 00 08 00 00 00 66 61 63 74 04 00 00 00 C6 37
01 00 64 61 74 61 C6 37 01 00 D6 D4 56 54 D5 56
56 51 D4 D3 D0 D6 54 57 D4 54 57 51 57 D0 D3 D1
etc.
The part in bold is the data of it.
The problem is when i read it in sndfile using sf_read_int I get in the buffer the following values:
3670016 1572864 -3670016 -1572864 524288 -3670016 -3670016
etc
How am I supposed to read the data in the wav file? What is the equation or 'relationship' between the numbers i got in sndfile and the data in textPad?
Oh and one more thing, if i switch the reading to sf_read_float instead of int i get values between -0.0001 and +0.0001...
Any idea what's happening, the writing and data processing is very good but I don't understand what's up with these values.
Thank you.
You have some pattern to see in a .wav file:
"RIFF" : 0x52 0x49 0x46 0x46
"WAVE" : 0x57 0x41 0x56 0x45
"fmt " : 0x66 0x6d 0x74 0x20
"data" : 0x64 0x61 0x74 0x61
We see 64 61 74 61 at offset 50. So data begins only at offset 54 and not 46.
You can find a wave specification to understand how is encoded your file.
Thanks to this spec, I can tell you that your file is encoded in "8-bit ITU-T G.711 A-law".
Okay so i found out that the wav file is encoded and libsndfile takes care of it without any intervention. That caused the "unequal" values.
I am writing something to get at the information contained in an MPO multi-picture file produced by cameras such as the fuji 3D cameras.
I have the proposed spec, which is obviously written to confuse, and is available from here:
CIPA multi picture spec PDF and have navigated the normal exif parts and extracted the MFP information.
Section 5.2.3 which describes the header lists
The header lists:
4 byte endian flag
4 byte offset
-- start of MP Index --
2 byte count
12 byte version
12 byte number of images
12 byte MP entry
12 byte Individual unique ID list
12 byte total number of capture frames
4 byte offset to next IFD.
The diagram shows the MP Entry and Unique ID list pointing to a offset (which would make 12 bytes).
However, in the description that follows it shows that the MP entry should be 16 bytes for every image (which in my sample image with two images, it is), and the individual unique ID being 33 bytes for every image (my sample doesn't have this).
Up to the number of images, everything is as it should be. I have 2 images, the version is correct, and there appear to be 3 blocks (count).
However, the third block (which is the MP Entry) has the right code, the correct number of bytes and the correct type, but contains the following information
32 00 00 00 52 00 00 00 02 00 02 20 40 63 1B 00
00 00 00 00 00 00 00 00 02 00 02 00 EE 6F 1B 00
The text says the contents of this should be
offset length name
0x00 4 Individual Image attribute
0x04 4 individual Image Size
0x08 4 Individual image offset
0x10 2 dependant image 1 entry number
0x12 2 dependant image 2 entry number
Clearly that makes no sense that if there are two images (they are effectively 10MP jpegs) that they have a size of 52 bytes and 0.
Could anyone take a look at this and check I am not going mad in my interpretation of it, or does anyone know what should be here?
Sorry, I know it is a bit complicated but I really cannot see where this is going wrong.
I believe your tag is ok.
Chapter 5.2 (MP Extensions) of the spec states MP index IFD as following:
Count (2 bytes)
MP Index Fields (Overall structure Info)
Offset of Next IFD (4 bytes)
Value (MP Index IFD)
Up to the number of images, everything is fine, as per spec. Starting from the MP Entry tag the bytes should be parsed as following:
(I'll be using data parsed from my MPO file)
02 b0 07 00 20 00 00 00 32 00 00 00 52 00 00 00
02 00 02 20 00 13 18 00 00 00 00 00 00 00 00 00
02-b0 - Tag ID (MP entry), 2 bytes
07-00 - Type (7 = undefined), 2
bytes
20-00-00-00 - 32 (16 x NumberOfImages,
two images in my file, this value was
parsed before), 4 bytes
32-00-00-00 - Offset of First IFD (50
bytes, starting from endianness tag),
4 bytes
That's where MP entry data ends. It's weird that there are no Individual Image Unique ID List (b003) or Total Number of Captured Frames (b004) tags, maybe they aren't necessary. Anyway, the next 4 bytes show offset of the next IFD, i.e.:
52-00-00-00 - Offset of Next IFD (82
bytes, starting from endiannesss
tag), 4 bytes
First IFD (in my case) starts immediately after the offset of Next IFD:
02-00-02-20 - Individual Image attr., 4
bytes
00-13-18-00 - Individual image size
(which states for 1 577 728 bytes of
data in my case), 4 bytes
00-00-00-00 - Individual image data
offset (states for 0 in case of first
image), 4 bytes
00-00 - Dependent image 1 (no
dependent image), 2 bytes
00-00 - Dependent image 2 (no
dependent image), 2 bytes
In your case, the data should be parsed as following:
32 00 00 00 52 00 00 00 02 00 02 20 40 63 1B 00
00 00 00 00 00 00 00 00 02 00 02 00 EE 6F 1B 00
32-00-00-00 - Offset of First IFD (50
bytes, starting from endianness tag),
4 bytes
52-00-00-00 - Offset of Next IFD (82
bytes, starting from endiannesss
tag), 4 bytes
02-00-02-20 - Individual Image attr., 4
bytes
00-13-18-00 - Individual image size
(which states for 1 794 880 bytes of
data), 4 bytes
00-00-00-00 - Individual image data
offset (states for 0 in case of first
image), 4 bytes
00-00 - Dependent image 1 (no
dependent image), 2 bytes
00-00 - Dependent image 2 (no
dependent image), 2 bytes
And starting next image's data:
02-00-02-00 - Individual Image attr., 4
bytes
EE-6F-1B-00 - Individual image size
(which states for 1 798 126 bytes of
data), 4 bytes
etc.
ABOUT IMAGE SIZE AND OFFSETS
Image size and offsets can confuse a bit. Spec says that:
"Image size is the data between SOI
and EOI markers" (ch. 5.2.3.3.2)
and
"Data offset (for the second image) is
specified relative to the address of
the MP Endian field in the MP Header"
(ch. 5.2.3.3.3)
Which I understand to be as following:
SOI---MPF_FIELDS-------------EOI
^
MP endian field
XXXXXXXXXXXXXXXXXXXXXXXXXXSOI--------------------------EOI
(first image, say, 2164288 bytes between SOI and EOI markers, including the markers. Second image, 2221368 bytes between SOI and EOI markers, including the markers. XXXXX states for offset)
Offset of the second image means that SOI marker is starting in lengthOf(XXXXX) bytes after MP endian marker, which is in the first image's MP field.
I suspect that if you subtract offset value from first image size, you should get MP Endian marker position.
I believe the base address is 8 bytes into the APP2 header, just after the "MPF\0" and right before "II*\0", assuming little endian. In the hex dump you've provided there's no MPO tag included but I guess "32 00 00 00" points to the offset of the MPEntries and "52 00 00 00" is the offset to the next IFD. What you want to be looking at is "02 00 02 20" which probably is the attribute of the first image.
Remember if the offset is "00 00 00 00", it's not relative but absolute.
When I am sending a DNS query to the DNS it returns the header with the format bit set.
Indicating there is a problem with the format, but I am failing to see what it is. Its possible I have misinterpreted the RFC, or misread it but right now I cant seem to work it out.
The DNS structure I am sending looks like this in hex.
Header
00 01 - ID = 1
01 00 - RD = 1
00 01 - QD = 1
00 00 - AN
00 00 - NS
00 00 - NR
Question for www.google.com
03 77 - 3 w
77 77 - w w
06 67 - 6 g
6f 6f - o o
67 6c - g l
65 03 - e 3
63 6f - c o
6d 00 - m 0
00 01 - QTYPE
00 01 - QCLASS
I then flip the bytes for any field that is two bytes, to convert to big endian for the network format. So each row of the header, and then QTYPE and QCLASS ...
Here's what a byte-by-byte hexdump of that query packet should look like (tested and working!):
00000000 00 01 01 00 00 01 00 00 00 00 00 00 03 77 77 77 |.............www|
00000010 06 67 6f 6f 67 6c 65 03 63 6f 6d 00 00 01 00 01 |.google.com.....|
I think your problem is that the third and fourth bytes of the packet (flags and rcode) are two single-byte fields, not one 2-byte field - it looks like you might be treating it as a 16 bit integer and swapping the bytes?
To get these you can use netcat and dig.
# nc –uip 53 > dnsreqdump
# dig www.example.com #localhost
# nc –u 8.8.8.8 53 <dnsreqdump >dnsrespdump
Now you can inspect them in hexedit or your favorite hex editor.
I tend to think that your problem depends on how are you actually "flipping the bits to convert to network format".
Typical C library implementations provide the htonl() function family to do the conversion from host into network order and viceversa.
Of course, without seeing the code, I cannot be sure that this is the problem.