Open a wav file to read data in text editor vs in sndfile - c

I want to read the data of an 8bit wav file using textPad, I know that the data is located at the 44/46th byte, but I am having problems reading it.
I have that code:
52 49 46 46 F8 37 01 00 57 41 56 45 66 6D 74 20
12 00 00 00 06 00 01 00 40 1F 00 00 40 1F 00 00
01 00 08 00 00 00 66 61 63 74 04 00 00 00 C6 37
01 00 64 61 74 61 C6 37 01 00 D6 D4 56 54 D5 56
56 51 D4 D3 D0 D6 54 57 D4 54 57 51 57 D0 D3 D1
etc.
The part in bold is the data of it.
The problem is when i read it in sndfile using sf_read_int I get in the buffer the following values:
3670016 1572864 -3670016 -1572864 524288 -3670016 -3670016
etc
How am I supposed to read the data in the wav file? What is the equation or 'relationship' between the numbers i got in sndfile and the data in textPad?
Oh and one more thing, if i switch the reading to sf_read_float instead of int i get values between -0.0001 and +0.0001...
Any idea what's happening, the writing and data processing is very good but I don't understand what's up with these values.
Thank you.

You have some pattern to see in a .wav file:
"RIFF" : 0x52 0x49 0x46 0x46
"WAVE" : 0x57 0x41 0x56 0x45
"fmt " : 0x66 0x6d 0x74 0x20
"data" : 0x64 0x61 0x74 0x61
We see 64 61 74 61 at offset 50. So data begins only at offset 54 and not 46.
You can find a wave specification to understand how is encoded your file.
Thanks to this spec, I can tell you that your file is encoded in "8-bit ITU-T G.711 A-law".

Okay so i found out that the wav file is encoded and libsndfile takes care of it without any intervention. That caused the "unequal" values.

Related

Interpret DBus Messages

I was trying to interpret the bytes in a DBus Message as specified in https://dbus.freedesktop.org/doc/dbus-specification.html. This is taken from a pcap while using the Frida tool.
The bytes are
0000 6c 01 00 01 08 00 00 00 01 00 00 00 70 00 00 00
0010 01 01 6f 00 15 00 00 00 2f 72 65 2f 66 72 69 64
0020 61 2f 48 6f 73 74 53 65 73 73 69 6f 6e 00 00 00
0030 02 01 73 00 16 00 00 00 72 65 2e 66 72 69 64 61
0040 2e 48 6f 73 74 53 65 73 73 69 6f 6e 31 35 00 00
0050 08 01 67 00 05 61 7b 73 76 7d 00 00 00 00 00 00
0060 03 01 73 00 17 00 00 00 47 65 74 46 72 6f 6e 74
0070 6d 6f 73 74 41 70 70 6c 69 63 61 74 69 6f 6e 00
0080 00 00 00 00 00 00 00 00
There are some fields which I am uncertain what they mean. Appreciate if anyone can provide some guidance on this.
0x6C: Refers to little endian
0x01: Message Type (Method Call)
0x00: Bitwise of OR flags
0x01: Major Protocol Version
0x08000000: Length of Message Body (Little Endian), starting from end of Header. This should be referring to the eight null bytes at the end?
0x01000000: Serial of this Message (Little Endian)
0x70000000: (Little Endian) Not sure what this represents? This value does correspond to the length of the array of struct, excluding trailing null bytes, that starts from 0x0010 and ends at 0x007F.
0x01: Decimal Code for Object Path
0x01: Not sure what this represents?
0x6F: DBus Type 'o' for Object Path
0x15: Length of the Object Path string
You want to look at the part of the specification that tells you what the message format is.
But to answer your questions:
0x08000000: Length of Message Body (Little Endian), starting from end of Header. This should be referring to the eight null bytes at the end?
Correct.
0x70000000: (Little Endian) Not sure what this represents? This value does correspond to the length of the array of struct, excluding trailing null bytes, that starts from 0x0010 and ends at 0x007F.
That's the length of the array in the header. The DBus header is of a variable size - after the first few bytes, it is an array of struct(byte,variant). As per the documentation, that looks like a(yv) if you were to express this as a DBus type signature.
0x01: Decimal Code for Object Path
0x01: Not sure what this represents?
This is where the parsing gets interesting: in our struct, the signature is yv, so the first 0x01 is telling us that this struct entry is the header field for Object Path, as you have seen. However, we now need to parse what the variant contains inside of it. To marshal a variant, you first marshal a signature, which in this case is 1 byte long: 01 6f 00. Note that signatures can be a max of 255 bytes long, so unlike other strings they only have a 1-byte length at the front. As a string, that is o, which tells us that this variant contains an object path inside of it. Since object paths are strings, we then decode the next bytes as a string(keeping note that the leading 4 bytes are the string length): 15 00 00 00 2f 72 65 2f 66 72 69 64 61 2f 48 6f 73 74 53 65 73 73 69 6f 6e 00
If I've done the conversion correctly, that says /re/frida/HostSession
This is taken from a pcap
If it's a standard pcap (or pcapng) D-Bus capture file, using the LINKTYPE_DBUS link-layer type, then Wireshark should be able to read it and, at least to some degree, interpret the messages (i.e., it has code that understands the message format, as defined by the specification to which #rm5248 referred (and to which the LINKTYPE_DBUS entry in the list of link-layer header types refers), so you might not have to interpret all the bytes by yourself.

How to parse this Apple-specific MPEG file header?

I have an MPEG file which starts like this:
0: 00 0f 6d 79 5f 66 69 6c 65 6e 61 6d 65 2e 6d 70 ..my_filename.mp
10: 67 00 04 fc 00 00 f0 00 b2 10 39 a8 b2 10 39 ad g.........9...9.
20: 0f 6d 79 5f 66 69 6c 65 6e 61 6d 65 2e 6d 70 67 .my_filename.mpg
30: 03 92 3b 40 00 00 00 00 03 7a b5 7c 03 7a d7 d0 ..;#.....z.|.z..
40: 00 4d 6f 6f 56 54 56 4f 44 01 00 01 2a 00 80 00 .MooVTVOD...*...
50: 00 00 00 00 36 b2 83 00 00 04 fc b2 10 39 a8 b2 ....6........9..
60: 10 39 ad 00 00 00 00 00 00 00 00 00 00 00 00 00 .9..............
70: 00 00 00 00 00 00 00 00 00 00 81 81 35 d3 00 00 ............5...
80: 00 36 b2 83 6d 64 61 74 00 00 01 ba 21 00 01 00 .6..mdat....!...
90: 05 80 2b 81 00 00 01 bb 00 0c 80 2f d9 04 e1 ff ..+......../....
a0: c0 c0 20 e0 e0 2e 00 00 01 c0 07 ea ff ff ff ff .. .............
What is the file format of the beginning of the file (first 0x80 bytes), and how do I parse it?
I've run a Google search on MooVTVOD, it looks like something related to QuickTime and iTunes.
What I understand already:
There is 4 bytes of big endian file size in front mdat, according to the QuickTime .mov file format when the .mov contains an MPEG.
Right after mdat there is the MPEG-PS header 00 00 01 ba. Shortly after there is the MPEG-PES header 00 00 01 c0 indicating an audio stream.
However, the first 0x80 bytes in this file seem to be in a different file file format (not QuickTime .mov, not MPEG-PS, not MPEG-PES), and in this question I'm only interested in the file format of the first 0x80 bytes.
Media players such as VLC routinely ignore junk at the beginning of the file, and start playing the MPEG-PS stream at offset 0x80. However, I'm interested in the 0x80 bytes they ignore.
The file format is "Quicktime movie atom," that contains meta information about the media file, or the media itself. mdat is a Media Data atom.
Media atoms describe and define a track’s media type and sample data.
the spec is here: https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFChap2/qtff2.html
You can parse it with this python script:
https://github.com/kzahel/quicktime-parse
or this software, mentioned in the linked question:
https://archive.codeplex.com/?p=mp4explorer
The file format of the first 0x80 bytes in the question is MacBinary II.
Based on the file format description https://github.com/mietek/theunarchiver/wiki/MacBinarySpecs , it's not MacBinary I, because MacBinary II has data[0x7a] == '\x81' and data[0x7b] == '\x81' (and MacBinary I has data[0x7a] == '\x00', and MacBinary III has data[0x7a] == '\x82'); and it's also not MacBinary III, because that one has data[0x66 : 0x6a] == 'mBIN'.
The CRC value data[0x7c : 0x7e] is incorrect because the filename was modified before posting to StackOverflow. FYI The CRC algorithm used is CRC16/XMODEM (on https://crccalc.com/), also implemented as CalcCRC in
MacBinary.c.
The data[0x41 : 0x45] == 'MooV' is the file type code. According to the Excel spreadsheet downloadable from TCDB, it means QuickTime movie (video and maybe audio).
The data[0x45 : 0x49] == 'TVOD' is the file creator code. TCDB and this database indicate that it means QuickTime Player.
More info and links about MacBinary: http://fileformats.archiveteam.org/wiki/MacBinary
Please note that all these headers are unnecessary to play the video: by removing the first 0x88 bytes we get an MPEG-PS video file, which many video players can play (not only QuickTime Player, and not only players running on macOS).

h264 inside AVI, MP4 and "Raw" h264 streams. Different format of NAL units (or ffmpeg bug)

TL;DR: I want to read raw h264 streams from AVI/MP4 files, even broken/incomplete.
Almost every document about h264 tells me that it consists of NAL packets. Okay. Almost everywhere told to me that the packet should start with a signature like 00 00 01 or 00 00 00 01. For example, https://stackoverflow.com/a/18638298/8167678, https://stackoverflow.com/a/17625537/8167678
The format of H.264 is that it’s made up of NAL Units, each starting
with a start prefix of three bytes with the values 0x00, 0x00, 0x01
and each unit has a different type depending on the value of the 4th
byte right after these 3 starting bytes. One NAL Unit IS NOT one frame
in the video, each frame is made up of a number of NAL Units.
Okay.
I downloaded random_youtube_video.mp4 and strip out one frame from it:
ffmpeg -ss 10 -i random_youtube_video.mp4 -frames 1 -c copy pic.avi
And got:
Red part - this is part of AVI container, other - actual data.
As you can see, here I have 00 00 24 A9 instead of 00 00 00 01
This AVI file plays perfectly
I do same for mp4 container:
As you can see, here exact same bytes.
This MP4 file plays perfectly
I try to strip out raw data:
ffmpeg -i pic.avi -c copy pic.h264
This file can't play in VLC or even ffmpeg, which produced this file, can't parse it:
I downloaded mp4 stream analyzer and got:
MP4Box tells me:
Cannot find H264 start code
Error importing pic.h264: BitStream Not Compliant
It very hard to learn internals of h264, when nothing works.
So, I have questions:
What actual data inside mp4?
What I must read to decode that data (I mean different annex-es)
How to read stream and get decoded image (even with ffmpeg) from this "broken" raw stream?
UPDATE:
It seems bug in ffmpeg:
When I do double conversion:
ffmpeg -ss 10 -i random_youtube_video.mp4 -frames 1 -c copy pic.mp4
ffmpeg pic.mp4 -c copy pic.h264
But when I convert file directly:
ffmpeg -ss 10 -i random_youtube_video.mp4 -frames 1 -c copy pic.h264
I have NALs signatures and one extra NAL unit. Other bytes are same (selected).
This is bug?
UPDATE
Not, this is not bug, U must use option -bsf h264_mp4toannexb to save stream as "Annex B" format (with prefixes)
"I want to read raw h264 streams from AVI files, even broken/incomplete."
"Almost everywhere told to me that the packet should start with a signature like : 00 00 01 or 00 00 00 01"
"...As you can see, here I have 00 00 24 A9 instead of 00 00 00 01"
Your H264 is in AVCC format which means it uses data sizes (instead of data start codes). It is only Annex-B that will have your mentioned signature as start code.
You seek frames, not by looking for start codes, but instead you just do skipping by frame sizes to reach the final correct offset of a (requested) frame...
AVI processing :
Read size (four) bytes (32-bit integer, Little Endian).
Extract the next following bytes up to size amount.
This is your H.264 frame (in AVCC format), decode the bytes to view image.
To convert into Annex-B, try replacing first 4 bytes of H.264 frame bytes with 00 00 00 01.
Consider your shown AVI bytes (see first picture) :
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 4C 49 53 54 BA 24 00 00 6D 6F 76 69 ....LISTº$..movi
30 30 64 63 AD 24 00 00 00 00 24 A9 65 88 84 27 00dc.$....$©eˆ„'
C7 11 FE B3 C7 83 08 00 08 2A 7B 6E 59 B5 71 E1 Ç.þ³Çƒ...*{nYµqá
E3 9C 0E 73 E7 10 50 00 18 E9 25 F7 AA 7D 9C 30 ãœ.sç.P..é%÷ª}œ0
E6 2F 0F 20 00 3A 64 AA CA 5E 4F CA FF AE 20 04 æ/. .:dªÊ^OÊÿ® .
07 81 40 00 48 00 0A 28 71 21 84 48 06 18 90 0C ..#.H..(q!„H....
31 14 57 9E 7A CD 63 A0 E0 9B 96 69 C5 18 AE F2 1.WžzÍc à›–iÅ.®ò
E6 07 02 29 01 20 10 70 A1 0F 8C BC 73 F0 78 FA æ..). .p¡.Œ¼sðxú
9E 1D E1 C2 BF 8C 62 CE CE AC 14 5A A4 E1 45 44 ž.á¿ŒbÎά.Z¤áED
38 38 85 DB 12 57 3E F6 E0 FB AE 03 04 21 62 8D 88…Û.W>öàû®..!b.
F6 F1 1E 37 1C A2 FF 75 1C F1 02 66 0C 92 07 06 öñ.7.¢ÿu.ñ.f.’..
15 7C 90 15 6F 7D FC BD 13 1E 2B 0C 14 3C 0C 00 .|..o}ü½..+..<..
B0 EA 6F 53 B4 98 D7 80 7A 68 3E 34 69 20 D2 FA °êoS´˜×€zh>4i Òú
F0 91 FC 75 C6 00 01 18 C0 00 3B 9A C5 E2 7D BF ð‘üuÆ...À.;šÅâ}¿
Some explanation :
Ignore leading multiple 00 bytes.
4C 49 53 54 D6 3C 00 00 6D 6F 76 69 including 30 30 64 63 = AVI "List" header.
AD 24 00 00 == decimal 9389 is AVI's own size of H264 item (must read in Little Endian).
Notice that the AVI bytes include...
- a note of item's total size (AD 24 00 00... or reverse for Little Endian : 00 00 24 AD)
- followed by item data (00 00 24 A9 65 88 84 27 ... etc ... C5 E2 7D BF).
This size includes both the 4 bytes of the AVI's"size" entry + expected bytes length of the item's own bytes. Can be written simply as:
AVI_Item_Size = ( 4 + item_H264_Frame.length );
H.264 video frame bytes in AVI :
Next follows the item data, which is the H.264 video frame. By sheer coincidence of formats/bytes layout, it too holds a 4-byte entry for data's size (since your H264 is in AVCC format, if it was Annex-B then you would be seeing start code bytes here instead of size bytes).
Unlike AVI bytes, these H264 size bytes are written in Big Endian format.
00 00 24 A9 = size of bytes for this video frame (instead of start code : 00 00 00 01).
65 88 84 27 C7 11 FE B3 C7 = H.264 keyframe (always begins X5, where the X value is based on other settings).
Remember after four size bytes (or even start codes) if followed by...
byte X5 = keyframe (IDR), example byte 65.
byte X1 = P or B frame, example byte 41.
byte X6 = SEI (Supplemental Enhancement Information).
byte X7 = SPS (Sequence Parameter Set).
byte X8 = PPS (Picture Parameter Set).
bytes 00 00 00 X9 = Access unit delimiter.
You can find the H.264 if you search for exact same bytes within AVI file. See third picture, these are your H.264 bytes (they are cut & pasted into the AVI container).
Sometimes a frame is sliced into different NAL units. So if you extract a key frame and it only shows 1/2 or 1/3 instead of full image, just grab next one or two NAL and re-try the decode.

searching Tags (TLV) with C function

I' m working on Mastercard Paypass transactions, I Have sent a READ RECORD command and got the result:
70 81 AB 57 11 54 13 33 00 89 60 10 83 D2 51 22
20 01 23 40 91 72 5A 08 54 13 33 00 89 60 10 83
5F 24 03 25 12 31 5F 25 03 04 01 01 5F 28 02 00
56 5F 34 01 01 8C 21 9F 02 06 9F 03 06 9F 1A 02
95 05 5F 2A 02 9A 03 9C 01 9F 37 04 9F 35 01 9F
45 02 9F 4C 08 9F 34 03 8D 0C 91 0A 8A 02 95 05
9F 37 04 9F 4C 08 8E 0E 00 00 00 00 00 00 00 00
42 03 5E 03 1F 03 9F 07 02 FF 00 9F 08 02 00 02
9F 0D 05 00 00 00 00 00 9F 0E 05 00 08 00 60 00
9F 0F 05 00 00 00 00 00 9F 42 02 09 78 9F 4A 01
82 9F 14 01 00 9F 23 01 00 9F 13 02 00 00
This response contains TLV data objects (without spaces). I have converted the response as described in the following:
// Read Record 1 with SFI2
//---------------------------------SEND READ RECORD-------------------
inCtlsSendVAPDU(0x2C,0x03,(unsigned char *)"\x00\xB2\x01\x14\x00",5);
clrscr();
inRet = inCTLSRecv2(Response, 269);
LOG_HEX_PRINTF("Essai EMV4 Read record 1 EMV Paypass:",Response,inRet);
if(Response[14]==0x70)
{
sprintf(Response_PPSE,"%02X%02X",Response[12],Response[13]);//To retrieve length of received data
t1=hexToInt(Response_PPSE);// Convert length to integer
t11=t1-2;
i=14;
k=0;
//--------------------------- Converting data to be used later---------
while(i<t11+14)// 14 to escape the header+ command+ status+ length
{
sprintf(READ1+(2*k),"%02X",Response[i]);
i++;
k++;
}
Now I should check if this Response contains the Mandatory Tags:
5A - Application Primary Account Number (PAN)
5F24 - Application Expiration Date
8C - Card Risk Management Data Object List 1 (CDOL1)
8D - Card Risk Management Data Object List 2 (CDOL2)
So I tried the following to check for the 5A tag (Application Primary Account Number (PAN)):
i=0;
t11=2*t11;
while(i<=t11)
{
strncpy(Response_PPSE,READ1+i,2);
if(strncmp(Response_PPSE,"\x05\x0A")==0)
{
write_at("true",4,1,1);// Just to test on the terminal display
goto end;
}
else
i=i+2;
}
goto end;
I don't know why nothing is displayed on the terminal, The if block is not executed!
I tried to print the 5A tag manually by:
strncpy(Response_PPSE,READ1+44,2);
write_at(Response_PPSE,strlen(Response_PPSE),1,1);
And it display the right value!!
Can someone helps to resolve this issue?
You don't find that tag because you are not searching for the string "5A" but for the string "\x05\x0A" (ENQ character + line feed character). Moreover, I wonder if the above code actually compiles as you did not specify the mandatory length argument to strncmp(). You could try something like
if(strncmp(Response_PPSE,"5A", 2)==0)
instead.
However, you should understand that you are scanning the whole response data for the value 5A. Therefore, finding this value could also mean that it was part of some other TLV tag's data field, length field or even part of a multi-byte tag field. It would therefore make sense to implement (or use an existing) TLV parser for the BER (Basic Encoding Rules) format.
It's not a good approach to search a specific byte in a raw byte-stream data using strings functions at first place.
The generic TLV parser is a very easy algorithm and you will do it in 30 minutes or so.
In general a pseudo-code for TLV parser that look for a specific Tag would be something like this:
index = 0
while (byte[i] != 0x5A or EOF)
{
length = DecodeLength(byte[i+1])
i += length + 2 // + 1 for L (length) byte itself, it might be encoded with
// 2 bytes so the function DecodeLength can return the number
// of bytes lenght has been encoded
// +1 for T (tag) byte
}
if(EOF) return tag_not_found
return byte[i + 2], length // pointer to data for Tag '5A'and length of data

Visual C++, Extract text information from a DLL database (sentence mining)

The 2006 version of the free offline Chinese sentence dictionary Jukuu contains a collection of 100,000 publicly sourced example sentences in Chinese and English in a .dll file.
The application size is about 80mb, but once installed a 500mb DLL dictionary file is created with the source text. For whatever reason the application doesn't run on my computer, and I'd like to extract all the example sentences so I can do some POS analysis on them.
Opening the 500mb .DLL file is mostly gibberish, except for some fragments of text here and there and references to other resources.
I'm wondering if there is any way I can extract the information in plain text?
The application can be downloaded here: http://www.jukuu.com/down/download.html
Thanks!
Edit: Never-mind, It looks like when viewed in HEX, the file is ordered in a way that is not conducive to sentence mining at all:
00 06 00 02 00 07 03 00 01 00 00 00 FF FE 6E 65 76 65 72 7E 73 74 61
6E 64 20 75 70 0B 00 00 80 00 00 00 00 00 00 00 00 FF FE 33 30 32
32 31 38 32 36 38 2D 00 16 00 06 00 02 00 07 03 00 01 00 00 00 FF
FE 65 78 74 72 65 6D 65 6C 79 7E 63 6C 6F 73 65 0B 00 00 80 00 00
00 00 00 00 00
Something like ÿþnever~stand upÿþ302218268-ÿþextremely~close
Any other ideas on how to mine sentences from the application? Maybe a batch script?

Resources