Opening a Delphi file with unknown encoding - database

I am a web developer by trade, so please excuse my lack of knowledge on how databases are structured in Delphi.
I wish to open database files from a program that I know was written in Delphi, however I do not have access to any source code. The database files have names such as '.IX$', '.TG$', '.VO$', etc..
No matter what encoding I try, it opens with unintelligible characters plus some interspersed plain text (notably the plain text appears to be all the important strings, but they're nested in varying clusters of non-English characters). There appears to be no byte order mark.
Other than sitting here attempting to play cryptographer and matching repeating gibberish, are there any tools that can parse this? I am hoping I'm just missing some obvious answer such as "Oh all Delphi files that end in $ are XYZ and can be evaluated with program ABC"...

Related

How are the binary data inside a certain format parsed?

Considering a binary data (video/images/audio/executable) can be regarded as a long sequence of random bytes,
when the data is inside a special format (SQL, BOLB in database, MP3, JSON, XML etc), how does the parser know that a special char(or sequence of chars, like {,},\t,space,EOF) is used in formatting, not a part of the binary data and vice versa?
Also, I am not quite sure which category this question fits in, so I put lexical analysis and linguistics. What subject/fields of computer science studies this?
This is indeed an odd place for this question. I'm a little unclear on exactly what you're asking here, but in sum, not all binary data (assuming you mean machine readable data) are equal. For instance: audio, images, and video are not executable data, they are parsed data; as such they are handled differently.
Also, "binary data" are not as arbitrary as you might think upon opening a hex editor for the first time :). Executables are structured into DATA and CODE segments, so with those flags the computer knows how to treat things appropriately. As for the other three types you mentioned, they are all structured differently depending upon their file format, which is why so many different file formats are out there! The executable program which parses these files knows how to handle them based upon information contained in the code about the file format, which of course means that the program has to know how to handle the file format and have info on how it is segmented to load it properly, which is why you can't open an MP3 in Microsoft Paint.
As for the study of file formats and data storage, that has applications in a lot of areas, it's not really a field unto itself so much as a topic that comes up in a lot of areas. Information Theory, Reverse Engineering, Natural Language Processing, and many others have uses for understanding different file types and how they store data. Anyhow, this was only a brief, cursory explanation, and there's plenty of things you can google (try .exe file formats or .jpg/.png file formats to start).

read thunderbird address mab files content

I have several address list's on my TBIRD address book.
every time I need to edit an address that is contained in several lists, is a pain on the neck to find which list contains the address to be modified.
As a help tool I want to read the several files and just gave the user a list of which
xxx.MAB files includes the searched address on just one search.
having the produced list, the user can simply go to edit just the right address list's.
Will like to know a minimum about the format of mentioned MAB files, so I can OPEN + SEARCH for strings into the files.
thanks in advance
juan
PD have asked mozilla forum, but there are no plans from mozilla to consolidate the address on one master file and have the different list's just containing links to the master. There is one individual thinking to do that, but he has no idea when due to lack of resources,
on this forum there is a similar question mentioning MORK files, but my actual TBIRD looks like to have all addresses contained on MAB files
I am afraid there is no answer that will give you a proper solution for this question.
MORK is a textual database containing the files Address Book Data (.mab files) and Mail Folder Summaries (.msf files).
The format, written by David McCusker, is a mix of various numerical namespaces and is undocumented and seem to no longer be developed/maintained/supported. The only way you would be able to get the grips of it is to reverse engineer it parallel with looking at source code using this format.
However, there have been experienced people trying to write parsers for this file format without any success. According to Wikipedia former Netscape engineer Jamie Zawinski had this to say about the format:
...the single most brain-damaged file format that I have ever seen in
my nineteen year career
This page states the following:
In brief, let's count its (Mork's) sins:
Two different numerical namespaces that overlap.
It can't decide what kind of character-quoting syntax to use: Backslash? Hex encoding with dollar-sign?
C++ line comments are allowed sometimes, but sometimes // is just a pair of characters in a URL.
It goes to all this serious compression effort (two different string-interning hash tables) and then writes out Unicode strings
without using UTF-8: writes out the unpacked wchar_t characters!
Worse, it hex-encodes each wchar_t with a 3-byte encoding, meaning the file size will be 3x or 6x (depending on whether whchar_t is 2
bytes or 4 bytes.)
It masquerades as a "textual" file format when in fact it's just another binary-blob file, except that it represents all its magic
numbers in ASCII. It's not human-readable, it's not hand-editable, so
the only benefit there is to the fact that it uses short lines and
doesn't use binary characters is that it makes the file bigger. Oh
wait, my mistake, that isn't actually a benefit at all."
The frustration shines through here and it is obviously not a simple task.
Consequently there apparently exist no parsers outside Mozilla products that is actually able to parse this format.
I have reversed engineered complex file formats in the past and know it can be done with the patience and right amount of energy.
Sadly, this seem to be your only option as well. A good place to start would be to take a look at Thunderbird's source code.
I know this doesn't give you a straight-up solution but I think it is the only answer to the question considering the circumstances for this format.
And of course, you can always look into the extension API to see if that allows you to access the data you need in a more structured way than handling the file format directly.
Sample code which reads mork
Node.js: https://www.npmjs.com/package/mork-parser
Perl: http://metacpan.org/pod/Mozilla::Mork
Python: https://github.com/KevinGoodsell/mork-converter
More links: https://wiki.mozilla.org/Mork

Why do application folders contain so many files?

I have a general question about finished applications. When I go into the files of a windows computer application, some files make sense as to why they are there, such as the executable, various media files, .dll files, etc. However, what I don't understand is how there's potentially thousands of different files, located in hundreds of different directories (counting hierarchy) with anywhere between dozens and hundreds of different filetypes. Some of the filetypes don't even seem like actual files, the extension could be something completely obscure. How does the application know how to work with that? Are all of those files hand-written and compiled or are many of them supplied automatically upon generating a desktop application (which would vary based on the application, of course)? I've never actually compiled an application in any language, as I've been studying JavaScript as a starting point, and I recognize that JavaScript is not intended for creating standalone applications, it's used to implement inside HTML. This is why I have so many questions about the generation of the application itself.
To provide an example, a few of the file extensions I see contained in the Audacity application folder which I don't recognize are as follows: .lsp .raw .mo .ny .exp
Even that is a very short list compared to the amount of filetypes/extensions I usually encounter which I have no knowledge of. So, all in all, my main question is why there's such a crazy amount of files, folders, and filetypes/extensions being used by an application. Hopefully someone can help me understand.
Extra question, for those who might care to answer it:
What does it mean when you open a file in an application like Notepad++ (or a .plist editor) and it's just a bunch of unreadable characters? I'm assuming that means it's a compiled file, but I could use some clarification. This happens when I try to open an .exe, a .dll, etc. I understand why I can't edit things like that in a text editor of course, yet why all the strange symbols and characters? Why wouldn't it just throw an error upon trying to open it? Are all the strange characters just a way of attempting to interpret already compiled code?
Bear with me, I'm pretty new to programming and I'm trying to get a better understanding of the process behind actually generating a GUI-based desktop application. As I said before, my current knowledge doesn't extend to the point of actually compiling an application.
Thank you for any help, I really appreciate it.
Focusing on your extra question: you have to learn what a binary file and a text file is, but in short: Imagine you have a simple calculator program that stores the result in a file. Lets say the result you want to store is the number 64. You have to options to do it: saving it as text (characteres 6 and 4) or as a binary data.
If you store it as a text, you need two bytes: one for the code of the character 6 and other for the character 4. You can open that file with the notepad and you'll see that two characteres '64'.
If you store it as a binary value, you only need one byte, but if you open it with the notepad, you'll see the character whose code is 64: 'A'
Most of such "strange" files are resources needed by parts of the application. A complex application is constructed very modular, and each component may need to load different additional resources, often depending on conditions decided at runtime.
For example, on startup if a Qt-based application reads it should use German translation, it may load trans/de_DE.qm from a directory also containing other language files. Or a game may load level by level from different files depending on how far you've come.
Your second question is quite simple. Most resource files are read by an application function as stream of bytes. If e.g. such stream contains '005a' as 4 bytes, you'll see strange symbols in notepade.exe since that editor interprets such bytes as ASCII code, which means it prints the symbols it finds at place 0, 0, 5, and a in the ASCII table. But the application actually reads it in as 4 x 8 bits = 32bit value, which may mean a 32bit integer value of a variable in my simple example. So the variable value is set to 0x5a wich is decimal 90.

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?
Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.
the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.
If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.
To quickly identify the type of the file, or the positions within it.
Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

Help required with ancient, unknown storage system

Morning all,
I've gone and told a customer I could migrate some of their old data out of a DOS based system into the new system I've developed for them. However I said that without actually looking at the files that stored the data in the old system - I just figured a quick google would solve all the problem for me... I was wrong!
Anyway, this program has a folder with hundreds... well 800 files with all sorts of file extensions, .ave, .bak, .brw, .dat, .001, .002...., .007, .dbf, .dbe and .his.
.Bak obviously isn't a SQL backup file.
Does anyone have any programming experience using any of those file types who may be able to point me in the direction of some way to read and extract the data?
I cant mention the program name for the reason that I don't think the original developer will allow this...
Thanks.
I'm willing to bet that the .dbf file is in DBase format, which is really straightforward. The contents of that might provide clues to the rest of them.
the unix 'file' utility can be used to recognize many file types by their 'magic number'. It examines the file's contents and compares it with thousands of known formats. If the files are in any kind of common format, this can probably save you a good amount of work.
if they're NOT in a common format, it may send you chasing after red herrings. Take its suggestions as just that, suggestions.
In complement to the sites suggested by Greg and Dmitriy, there's also the repository of file formats at http://www.wotsit.org ("What's its format?").
If that doesn't help, a good hex editor (with dump display) is your friend... I've always found it amazing how easy it can be to read and recognize many file formats.
Could be anything. Best be is to open with a hex editor, and see what you can see
Most older systems used a basic ISAM which had one file per table that contained a set of fixed length data records. The other files would probably be indexes
As you only need the data, not the index, just look for the files with repeating data patterns (it often looks like pretty patterns on the hex editor screen)
When you find the file with the data, try to locate a know record e.g. "Mr Smith" and see if you can work out the other fields. Integers are often byte for byte, dates are often encoded and days from a known start date, money could be in BCD
If you see a strong pattern, then most likely each record is a fixed length. There will probably be a header block on the file say 128 or 256 bytes, and then the fixed length records
Many old system where written in COBOL. There is plenty of info on the net re cobol formats, and some companies even sell COBOL ODBC drivers!
I think Greg is right about .dbf file. You should try to find some information about other file formats using sites like http://filext.com and http://dotwhat.net. The .bak file is usually a copy of another file with the same name, but other extension. For example there may be database.dbf file and database.bak file with backup of it. You should ask (if it's possible) for any details/documentation/source code of application that used that files from your customer.
Back in the DOS days, programmers used to make up their own file extentions pretty much as they saw fit. The DBF might well be a DBase file which is easy enough to read, and the .BAK is probably a backup of one of the other important files, or just a backup left by a text editor.
For the remaining files, first thing I would do is check if they are in a readable ASCII format by opening them in a text editor.
If this doesn't give you a good result, try opening them in a binary editor that shows side by side hex and ASCII with control characters blanked out. Look for repeating patterns that might correspond to record fields. For example, say the .HIS was something like an order histrory file, it might contain embedded product codes or names. If this is the case, count the number of bytes between such fields. If it is a regular number, you probably have a flat binary file of records. This is best decoded by opening the file in the app, looking for values in a given record, and searching for the corresponding values in the binary file. Time consuming, and a pain in the ass, but workable enough once you get the hang of it.
Happy hacking!
.DBF is a dBASE or early FoxPro database.
.DAT was used by Btrieve, and IIRC Paradox for DOS.
The .DBE and .00x files are probably either temporary or index files related to the .DAT files.
.DBF is easy. They'll open with MS Access or Excel (pre-2007 versions of Office, anyway), or with ADO or ODBC.
If the .DAT files are indeed Btrieve, you're in a world of hurt. They're a mess, even if you can get your hands on the right version of the data dictionary and a copy of the Btrieve structure. (Been there, done that, wore out the t-shirt before I got done.)
As others have suggested, I recommend a hex editor if you can't figure out what those files are and that dbf is probably Dbase.
BAK seems to be a backup file. I'm thinking that *.001, *.002, etc might be part of the backup. Are they all the same size? Maybe the backup was broken up into smaller pieces so that it could fit onto removable media?
Finally, take this as a life lesson. Before sending that Statement of Work over, if the customer asks you to import data from System A to System B, always ask for the sample schema and sample data and sample files. Lots of times things that seem straight forward hand up being nightmares.
Good luck!
Be sure to use the Modified date on the files as clues, if the .001, .002, etc all have similar time stamps, maybe along with the .BAK, they might be part of the backup. Also there may be some old cruft in the directory you can (somewhat safely) ignore. Look for .BAT files and try to dissect them as well.
One hint, if the .dbf files are DBase, FoxPro, or one of the other products that used that format. Then you may be able to read them using ODBC. My system still has the ODBC driver for .dbf (Vista, with VS 2008 - how it got there I'd have to hunt up, but I'd guess it was MDAC Microsoft Data Access which put that there). So, you may not have a "world of unpicking to do", if the ODBC driver will read the .dbf files.
I seem to remember (with a little confidence of 20+ years ago DBase III tinkering) that DBase used .001, .002, ... file for memo (big text) fields.
Good luck trying to salvage the data.
The DBF format is fairly common.
The other files are puzzling.
I'm guessing that either you're dealing with old BTrieve files (bad), or (hopefully) with the results of some ill-conceived backup scheme where someone backed up his database into the same directory rather than into the hard drive in which case you could ignore these.
It's now part of Pervasive, but I used, years ago, Data Junction to migrate data from lots of file types to others. Have a look, unless you want to write a parser.
.dat can also be old Clarion 2.1 files... It works on an ISAM basis also, with key/index files

Resources