Visual C++, Extract text information from a DLL database (sentence mining) - c

The 2006 version of the free offline Chinese sentence dictionary Jukuu contains a collection of 100,000 publicly sourced example sentences in Chinese and English in a .dll file.
The application size is about 80mb, but once installed a 500mb DLL dictionary file is created with the source text. For whatever reason the application doesn't run on my computer, and I'd like to extract all the example sentences so I can do some POS analysis on them.
Opening the 500mb .DLL file is mostly gibberish, except for some fragments of text here and there and references to other resources.
I'm wondering if there is any way I can extract the information in plain text?
The application can be downloaded here: http://www.jukuu.com/down/download.html
Thanks!
Edit: Never-mind, It looks like when viewed in HEX, the file is ordered in a way that is not conducive to sentence mining at all:
00 06 00 02 00 07 03 00 01 00 00 00 FF FE 6E 65 76 65 72 7E 73 74 61
6E 64 20 75 70 0B 00 00 80 00 00 00 00 00 00 00 00 FF FE 33 30 32
32 31 38 32 36 38 2D 00 16 00 06 00 02 00 07 03 00 01 00 00 00 FF
FE 65 78 74 72 65 6D 65 6C 79 7E 63 6C 6F 73 65 0B 00 00 80 00 00
00 00 00 00 00
Something like ÿþnever~stand upÿþ302218268-ÿþextremely~close
Any other ideas on how to mine sentences from the application? Maybe a batch script?

Related

How to parse this Apple-specific MPEG file header?

I have an MPEG file which starts like this:
0: 00 0f 6d 79 5f 66 69 6c 65 6e 61 6d 65 2e 6d 70 ..my_filename.mp
10: 67 00 04 fc 00 00 f0 00 b2 10 39 a8 b2 10 39 ad g.........9...9.
20: 0f 6d 79 5f 66 69 6c 65 6e 61 6d 65 2e 6d 70 67 .my_filename.mpg
30: 03 92 3b 40 00 00 00 00 03 7a b5 7c 03 7a d7 d0 ..;#.....z.|.z..
40: 00 4d 6f 6f 56 54 56 4f 44 01 00 01 2a 00 80 00 .MooVTVOD...*...
50: 00 00 00 00 36 b2 83 00 00 04 fc b2 10 39 a8 b2 ....6........9..
60: 10 39 ad 00 00 00 00 00 00 00 00 00 00 00 00 00 .9..............
70: 00 00 00 00 00 00 00 00 00 00 81 81 35 d3 00 00 ............5...
80: 00 36 b2 83 6d 64 61 74 00 00 01 ba 21 00 01 00 .6..mdat....!...
90: 05 80 2b 81 00 00 01 bb 00 0c 80 2f d9 04 e1 ff ..+......../....
a0: c0 c0 20 e0 e0 2e 00 00 01 c0 07 ea ff ff ff ff .. .............
What is the file format of the beginning of the file (first 0x80 bytes), and how do I parse it?
I've run a Google search on MooVTVOD, it looks like something related to QuickTime and iTunes.
What I understand already:
There is 4 bytes of big endian file size in front mdat, according to the QuickTime .mov file format when the .mov contains an MPEG.
Right after mdat there is the MPEG-PS header 00 00 01 ba. Shortly after there is the MPEG-PES header 00 00 01 c0 indicating an audio stream.
However, the first 0x80 bytes in this file seem to be in a different file file format (not QuickTime .mov, not MPEG-PS, not MPEG-PES), and in this question I'm only interested in the file format of the first 0x80 bytes.
Media players such as VLC routinely ignore junk at the beginning of the file, and start playing the MPEG-PS stream at offset 0x80. However, I'm interested in the 0x80 bytes they ignore.
The file format is "Quicktime movie atom," that contains meta information about the media file, or the media itself. mdat is a Media Data atom.
Media atoms describe and define a track’s media type and sample data.
the spec is here: https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFChap2/qtff2.html
You can parse it with this python script:
https://github.com/kzahel/quicktime-parse
or this software, mentioned in the linked question:
https://archive.codeplex.com/?p=mp4explorer
The file format of the first 0x80 bytes in the question is MacBinary II.
Based on the file format description https://github.com/mietek/theunarchiver/wiki/MacBinarySpecs , it's not MacBinary I, because MacBinary II has data[0x7a] == '\x81' and data[0x7b] == '\x81' (and MacBinary I has data[0x7a] == '\x00', and MacBinary III has data[0x7a] == '\x82'); and it's also not MacBinary III, because that one has data[0x66 : 0x6a] == 'mBIN'.
The CRC value data[0x7c : 0x7e] is incorrect because the filename was modified before posting to StackOverflow. FYI The CRC algorithm used is CRC16/XMODEM (on https://crccalc.com/), also implemented as CalcCRC in
MacBinary.c.
The data[0x41 : 0x45] == 'MooV' is the file type code. According to the Excel spreadsheet downloadable from TCDB, it means QuickTime movie (video and maybe audio).
The data[0x45 : 0x49] == 'TVOD' is the file creator code. TCDB and this database indicate that it means QuickTime Player.
More info and links about MacBinary: http://fileformats.archiveteam.org/wiki/MacBinary
Please note that all these headers are unnecessary to play the video: by removing the first 0x88 bytes we get an MPEG-PS video file, which many video players can play (not only QuickTime Player, and not only players running on macOS).

How to Decode XML Blob field in D7

I'm having a problem trying to decode XML data returned by an instance of MS SQL Server 2014 to an app written in D7. (the version of Indy is the one which came with it, 9.00.10).
Update When I originally wrote this q, I was under the impression that the contents of the blob field needed to be Base64-decoded, but it seems that that was wrong. Having followed Remy Lebeau's suggestion, the blob stream contains recognisable text in the field names and field values before decoding but not afterwards.
In the code below, the SQL in the AdoQuery is simply
Select * from Authors where au_lname = 'White' For XML Auto
the Authors table being the one in the demo 'pubs' database. I've added the "Where" clause to restrict the size of the result set so I can show a hex dump of the returned blob.
According to the Sql Server OLH, the default type of the returned data when 'For XML Auto' is specified is 'binary base64-encoded format'. The data type of the single field of the AdoQuery is ftBlob, if I let the IDE create this field.
Executing the code below generates an exception "Uneven size in DecodeToStream". At the call to IdDecoderMIME.DecodeToString(S), the length of the string S is 3514, and 3514 mod 4 is 2, not 0 as it apparently should be, hence the exception. I've confirmed that the number of bytes in the field's value is 3514, so there's no difference between the size of the variant and the length of the string, i.e. nothing has gone awol in between.
procedure TForm1.FormCreate(Sender: TObject);
var
SS : TStringStream;
Output : String;
S : String;
IdDecoderMIME : TIdDecoderMIME;
begin
SS := TStringStream.Create('');
IdDecoderMIME := TIdDecoderMIME.Create(Nil);
try
AdoQuery1.Open;
TBlobField(AdoQuery1.Fields[0]).SaveToStream(SS);
S := SS.DataString;
IdDecoderMIME.FillChar := #0;
Output := IdDecoderMIME.DecodeToString(S);
Memo1.Lines.Text := S;
finally
SS.Free;
IdDecoderMIME.Free;
end;
end;
I'm using this code:
procedure TForm1.FormCreate(Sender: TObject);
var
SS : TStringStream;
MS : TMemoryStream;
Output : String;
begin
SS := TStringStream.Create('');
MS := TMemoryStream.Create;
try
AdoQuery1.Open;
TBlobField(AdoQuery1.Fields[0]).SaveToStream(SS);
SS.WriteString(#13#10);
Output := SS.DataString;
SS.Position := 0;
MS.CopyFrom(SS, SS.Size);
MS.SaveToFile(ExtractFilePath(Application.ExeName) + 'Blob.txt');
finally
SS.Free;
MS.Free;
end;
end;
A hex dump of the Blob.Txt file looks like this
00000000 44 05 61 00 75 00 5F 00 69 00 64 00 44 08 61 00 D.a.u._.i.d.D.a.
00000010 75 00 5F 00 6C 00 6E 00 61 00 6D 00 65 00 44 08 u._.l.n.a.m.e.D.
00000020 61 00 75 00 5F 00 66 00 6E 00 61 00 6D 00 65 00 a.u._.f.n.a.m.e.
00000030 44 05 70 00 68 00 6F 00 6E 00 65 00 44 07 61 00 D.p.h.o.n.e.D.a.
00000040 64 00 64 00 72 00 65 00 73 00 73 00 44 04 63 00 d.d.r.e.s.s.D.c.
00000050 69 00 74 00 79 00 44 05 73 00 74 00 61 00 74 00 i.t.y.D.s.t.a.t.
00000060 65 00 44 03 7A 00 69 00 70 00 44 08 63 00 6F 00 e.D.z.i.p.D.c.o.
00000070 6E 00 74 00 72 00 61 00 63 00 74 00 44 07 61 00 n.t.r.a.c.t.D.a.
00000080 75 00 74 00 68 00 6F 00 72 00 73 00 01 0A 02 01 u.t.h.o.r.s.....
00000090 10 E4 04 00 00 0B 00 31 37 32 2D 33 32 2D 31 31 .......172-32-11
000000A0 37 36 02 02 10 E4 04 00 00 05 00 57 68 69 74 65 76.........White
000000B0 02 03 10 E4 04 00 00 07 00 4A 6F 68 6E 73 6F 6E .........Johnson
000000C0 02 04 0D E4 04 00 00 0C 00 34 30 38 20 34 39 36 .........408 496
000000D0 2D 37 32 32 33 02 05 10 E4 04 00 00 0F 00 31 30 -7223.........10
000000E0 39 33 32 20 42 69 67 67 65 20 52 64 2E 02 06 10 932 Bigge Rd....
000000F0 E4 04 00 00 0A 00 4D 65 6E 6C 6F 20 50 61 72 6B ......Menlo Park
00000100 02 07 0D E4 04 00 00 02 00 43 41 02 08 0D E4 04 .........CA.....
As you can see, some of it is legible (field names and contents), some of it not. Does anyone recognise this format and know how to clean it up into the plain text I get from executing the same query in SS Management Studio, i.e. how do I successfully extract the XML from the result set?
Btw, I get the same result (including the contents of the Blob.Txt file) using both the MS OLE DB Provider for Sql Server and the Sql Server Native Client 11 provider, and using Delphi Seattle in place of D7.
Given that the code accesses an external database, this code is the closest I can get to an MCVE.
Update #2 The decoding problem vanishes if I change the Sql query to
select Convert(Text,
(select * from authors where au_lname = 'White' for xml AUTO
))
which gives the result (in SS) of
<authors au_id="172-32-1176" au_lname="White" au_fname="Johnson" phone="408 496-7223" address="10932 Bigge Rd." city="Menlo Park" state="CA" zip="94025" contract="1"/>
but I'm still interested to know how to get this to work without needing the Convert(). I've noticed that if I remove the Where clause from the Sql, what is returned is not well-formed XML - it contains a series of nodes, one per data row, but there is no enclosing root node.
Also btw, I realise that I can avoid this problem by not using "For XML Auto", I'm just interested in how to do it correctly. Also, I don't need any help parsing the XML once I've managed to extract it.
Add the TYPE Directive to specify that you want XML returned.
select *
from Authors
where au_lname = 'White'
for xml auto, type
You can't simply decode the binary blob into XML.
You can use TADOCommand and direct its output stream to an XML document object e.g.:
const
adExecuteStream = 1024;
var
xmlDoc, RecordsAffected: OleVariant;
cmd: TADOCommand;
xmlDoc := CreateOleObject('MSXML2.DOMDocument.3.0'); // or CoDomDocument30.Create;
xmlDoc.async := False;
cmd := TADOCommand.Create(nil);
// specify your connection string
cmd.ConnectionString := 'Provider=SQLOLEDB;Data Source=(local);...';
cmd.CommandType := cmdText;
cmd.CommandText := 'select top 1 * from items for xml auto';
cmd.Properties['Output Stream'].Value := xmlDoc;
cmd.Properties['XML Root'].Value := 'RootNode';
cmd.CommandObject.Execute(RecordsAffected, EmptyParam, adExecuteStream);
xmlDoc.save('d:\test.xml');
cmd.Free;
This results a well-formed XML with enclosing root node RootNode.

What is the Informix database file extension

I've got an image file of a hard drive from a client that wants a database extracted out of it. The client does not know any details except that the database was once installed on the server from which the image was created.
I found out that it is a UNIX system with a Informix DBS installed, but I am unable to find any database files. I'm not sure about the version of Informix, but it seem that it was installed around 15 years ago.
I'm not able to boot from the image. I'm just viewing the files.
Do informix database files have an extension and what could it be? Any other tips how to identify the database files?
Do you know whether the database was Informix Standard Engine (SE) or Informix (Informix Dynamic Server — IDS — or one of its multitude of namesakes over the years)?
Standard Engine
If it was SE, then the database files are in a directory database.dbs and the files holding the indexes and data for the tables have extensions .idx and .dat. That's pretty much fixed and very easy. The files and the directory should belong to group informix; the owner will be whoever created the database or the table within the database.
Informix Dynamic Server
If it was IDS, then there is no guaranteed naming convention, and Informix didn't even recommend one. Depending on the state of the disk, I'd look for large files that are owned by user informix and belong to group informix and have 660 (-rw-rw----) permissions. The files will have structure, but it isn't easy to discern it.
For example, I have 'chunk 0 of the root dbspace' in a file toru_31.rootdbs.c0 (server name toru_31 — a naming standard I imposed on my systems). It starts with:
0x0000: 00 00 00 00 01 00 AB 89 03 00 00 18 30 01 C0 06 ............0...
0x0010: 00 00 00 00 00 00 00 00 49 42 4D 20 49 6E 66 6F ........IBM Info
0x0020: 72 6D 69 78 20 44 79 6E 61 6D 69 63 20 53 65 72 rmix Dynamic Ser
0x0030: 76 65 72 20 43 6F 70 79 72 69 67 68 74 20 32 30 ver Copyright 20
0x0040: 30 31 2C 20 32 30 31 31 20 20 49 42 4D 20 43 6F 01, 2011 IBM Co
0x0050: 72 70 6F 72 61 74 69 6F 6E 2E 00 00 00 00 00 00 rporation.......
0x0060: 00 00 00 00 00 00 00 00 00 00 03 00 00 08 00 00 ................
0x0070: BA 2D CC 50 1A 00 00 00 00 00 00 00 C8 00 00 00 .-.P............
0x0080: 31 31 37 33 05 00 00 00 00 00 00 00 00 00 00 00 1173............
0x0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
However, 'chunk 1 of the root dbspace' (file toru_31.rootdbs.c1) starts:
0x0000: 00 00 00 00 02 00 F1 C6 00 00 00 18 18 00 E4 07 ................
0x0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
* (125)
0x07F0: 00 00 00 00 00 00 00 00 00 00 00 00 61 C6 92 00 ............a...
0x0800: 69 6E 66 6F 72 6D 69 78 69 6E 66 6F 72 6D 69 78 informixinformix
* (127)
0x1000: 02 00 00 00 02 00 BE B6 06 00 08 08 20 00 D8 07 ............ ...
0x1010: 00 00 00 00 00 00 00 00 10 06 00 00 01 00 00 00 ................
0x1020: EA 06 00 00 03 00 00 00 75 08 00 00 08 00 00 00 ........u.......
0x1030: A1 08 00 00 08 00 00 00 B5 0A 00 00 30 03 00 00 ............0...
There is very little information there to give the game away. Although the appearance of informix (128 lines containing informix twice) looks like a tell-tale sign, it isn't — I created the chunk with a program that writes informix over the disk space. It simply shows where the server has not yet written data to this chunk.
You can look for $INFORMIXDIR. In fact, if there's a directory /INFORMIXTMP, you can look in the text file /INFORMIXTMP/.infxdirs to see where Informix has been installed. From those directories, you can look in $INFORMIXDIR/etc for onconfig files - they're text files that contain an entry ROOTPATH, which is the pathname of chunk 0 of the root dbspace — basically, the starting point for the whole system. There's usually an onconfig.std which is a template; the naming convention is not firmly fixed, though I always use onconfig.servername (so the config file for toru_31 is onconfig.toru_31). You may also find other files around in $INFORMIXDIR/etc. For example, there's a file oncfg_toru_31.31 (prefix oncfg_, followed by server name toru_31 followed by dot and the server number 31) which contains information about the chunks and other disk space etc used by the server. You might also see binary files analogous to .conf.toru_31 and .infos.toru_31 — the .infos file is normally present only while the server is up but the .conf file persists. These files have some limited information in them, most notably the name of the onconfig file.
If you can find these files on the disk, then you can proceed to identify where the data was stored on the disk.

How does raw result format returned from Microsoft SQL Server looks like?

I have been using SQL for like 10 years but now I realised I never knew how the client actually receive and process data it gets from the server.
My question is, how does the result from Microsoft SQL Server actually look like in raw format? The same as that result from HTTP server contains HTTP headers and a Content-Type header to tell what the body format is (mostly HTML for web pages).
The protocol name is TDS (Tabular Data Stream).
Some documentation is available at MSDN.
There is a very basic example of data being transferred for simple query select 'foo' as 'bar'
Request
Packet header (type, legth, etc)
01 01 00 5C 00 00 01 00
Packet data
16 00 00 00 - headers total length
12 00 00 00 - first header length
02 00 - type
00 00 00 00 00 00 00 01 00 00 00 00 - data
0A 00 73 00 65 00 6C 00 65 00 63 00 74 00 20 00
27 00 66 00 6F 00 6F 00 27 00 20 00 61 00 73 00
20 00 27 00 62 00 61 00 72 00 27 00 0A 00 20 00
20 00 20 00 20 00 20 00 20 00 20 00 20 00 - sql
Response
Packet header (type, legth, etc)
04 01 00 33 00 00 01 00
Packet data
columns metadata
81 - record id
01 - count
first column
00 00 00 00 00 - user type
20 00 - flags
A7 - type
03 00 - length
09 04 D0 00 34 - colation
03 - column name length
62 00 61 00 72 00 - column name bytes
rows
D1 - record id
03 00 - length
66 6F 6F - value
ending data
FD - record id
10 00 - status
C1 00
01 00 00 00 00 00 00 00 - rows total
We can also look at parser implementation thanks to reference sources.

wrapping h264 olimex a13 encoded data to RTMP server

I am trying to stream Olimex A13 encoded data to RTMP server with librtmp and view it on VLC. The problem is I cannot find how to correctly wrap Cedar data to flash container... I have found this example which looks exactly what I would need, but VLC has still problems understanding it -
No suitable decoder module:
VLC does not support the audio or video format "undf". Unfortunately there is no way for you to fix this.
I tried grabbing data through rtmpdump and it appears I cannot disable audio track which VLC is trying to get from the header
Format : Flash Video
File size : 195 KiB
Duration : 1mn 27s
Overall bit rate : 1 464 Kbps
_Server : NGINX RTMP (github.com/arut/nginx-rtmp-module)
_displayWidth : 640.000
_displayHeight : 480.000
_fps : 25.000
Video
Format : AVC
Format/Info : Advanced Video Codec
Codec ID : 7
Duration : 1mn 27s
Width : 640 pixels
Height : 480 pixels
Display aspect ratio : 4:3
Frame rate mode : Constant
Frame rate : 25.000 fps
Bit depth : 8 bits
Bits/(Pixel*Frame) : 0.191
Audio
The last line for 'Audio' is what I suspect is causing VLC to err. Moreover I am sure the video data I am trying to send is not correctly wrapped.
From what I understood looking into example is that FLV expects a general stream description which can be flagged for video and audio stream information, other examples I found (including ffmpeg ) use some magic number following onMetaData tag, but I dont think I do that by following an example from above:
STR2AVAL(av, "onMetaData");
enc = AMF_EncodeString(enc, pend, &av);
*enc++ = AMF_ECMA_ARRAY;
enc = AMF_EncodeInt32(enc, pend, 5+5+2); //5 - video 5 - audio
The dump of my rtmp header is:
sending 307 as header
00000000 02 00 0d 40 73 65 74 44 61 74 61 46 72 61 6d 65 ...#setDataFrame
00000010 02 00 0a 6f 6e 4d 65 74 61 44 61 74 61 03 00 06 ...onMetaData...
00000020 61 75 74 68 6f 72 02 00 00 00 09 63 6f 70 79 72 author.....copyr
00000030 69 67 68 74 02 00 00 00 0b 64 65 73 63 72 69 70 ight.....descrip
00000040 74 69 6f 6e 02 00 00 00 08 6b 65 79 77 6f 72 64 tion.....keyword
00000050 73 02 00 00 00 06 72 61 74 69 6e 67 02 00 00 00 s.....rating....
00000060 0a 70 72 65 73 65 74 6e 61 6d 65 02 00 06 43 75 .presetname...Cu
00000070 73 74 6f 6d 00 05 77 69 64 74 68 00 40 84 00 00 stom..width.#...
00000080 00 00 00 00 00 05 77 69 64 74 68 00 40 84 00 00 ......width.#...
00000090 00 00 00 00 00 06 68 65 69 67 68 74 00 40 7e 00 ......height.#~.
000000a0 00 00 00 00 00 00 09 66 72 61 6d 65 72 61 74 65 .......framerate
000000b0 00 40 39 00 00 00 00 00 00 00 0c 76 69 64 65 6f .#9........video
000000c0 63 6f 64 65 63 69 64 02 00 04 61 76 63 31 00 0d codecid...avc1..
000000d0 76 69 64 65 6f 64 61 74 61 72 61 74 65 00 40 96 videodatarate.#.
000000e0 e3 60 00 00 00 00 00 08 61 76 63 6c 65 76 65 6c .`......avclevel
000000f0 00 3f f0 00 00 00 00 00 00 00 0a 61 76 63 70 72 .?.........avcpr
00000100 6f 66 69 6c 65 00 40 50 80 00 00 00 00 00 00 17 ofile.#P........
00000110 76 69 64 65 6f 6b 65 79 66 72 61 6d 65 5f 66 72 videokeyframe_fr
00000120 65 71 75 65 6e 63 79 00 40 20 00 00 00 00 00 00 equency.# ......
00000130 00 00 09 ...
After that application send PSP and SPS data which I am not sure I encode correctly too:
This is what I have in my encoder application
psp struct start
00000000 34 32 30 31 66 00 00 00 00 00 5a 30 49 41 48 2b 4201f.....Z0IAH+
00000010 56 41 55 42 37 49 00 00 00 00 00 00 00 00 00 00 VAUB7I..........
00000020 00 00 00 00 00 00 00 00 61 4f 34 78 45 67 3d 3d ........aO4xEg==
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040 00 00 00 00 00 00 ......
psp struct end
And this is what I send to RTMP server:
sending 38 as sps/pps info
00000000 17 00 00 00 00 01 42 c0 15 03 01 00 0d 67 5a 30 ......B......gZ0
00000010 49 41 48 2b 56 41 55 42 37 49 01 00 09 68 61 4f IAH+VAUB7I...haO
00000020 34 78 45 67 3d 3d 4xEg==
info frame sent
So the questions are:
How do you modify the header to explicitly state that its video only?
How does one wrap video data in NAL format and how often do you have to transmit SPS and PPS data?
Are there any online services to inspect the RTMP data to understand whats wrong with it? VLC doesnt help much...

Resources