How to reverse engineer a database file format? - database

It is an accounting database whose database file format is proprietary data. But the problem is the database is highly unreliable, corruption go unnoticed for years after which it becomes unbearably difficult to recover data, we maintain accounts of lots of companies, so reversing the format will be highly useful. Also, the software is odbc compliant which might be very helpul for reversing it.The software is partially open-source, i.e. except database engine everything is open-source(and fully customizable), though they have there own different language.
I have no experience in reversing. we would like to reverse it in such a way that we could write data in that format directly. So I would like to know where I could get tools and how should I approach this, any articles or videos will be awesome.

Usually it's not so "hard". For example, I like hybrid solution:
Basic target file analyze: entropy, signatures, etc. It will be a good idea to look at possible chunks with some hex-editor (be ready to live in hex-editor for a long time), I recommend you 010 Editor with templates / scripts inside. Basic target application analyze: you need to understand how it was written (language, frameworks, libraries and so on) it'll be really useful on next steps.
Classic rev-engineering of application that can open this file format. Best way to use something like IDA Pro + Hex-Rays (but be ready to spend a lot of money) to generate C-like sources. You can break target file (change / steal some bytes) and break on exception in target application, for example, to start exploration.
Try to use FileMon (now it's a part of ProcMon) to understand how target applications reads target file. Sometimes it's possible (some applications doesn't use read buffer at all, so, you will see small readings + call stack for exploration!).
Different tools that can help you in some cases.
Best of all, upload few samples of database and application and give a link. We can try to do it together and write a little tutorial for others.


which db to go for tiny data requirements

I need some help choosing databases for my application.
My web application will basically consist of a main table. lets call it the "User" table.
it will have the user info like name, id, password, address, phone etc.
There will be 5 other related tables where i will save each user's info.
eg. Table for books read, Table for songs heard, Food eaten etc.
Overall i dont expect my data to go beyond 1,000 users.
So, i have got tiny data requirements.
Generally i would have gone with mysql, but i am feeling a bit adventurous.
I want to try out some of the new solutions on the block.
my requirements are:
1. pure performance
2. good documentation, ease of use
since my db size shouldn't be more than a few hundreds megs in size, i'd rather the entire tablespace in the memory itself for faster performance. How about some of the new NoSQL DBs.
any recommendations? I have worked mainly on oracle and MySQl and don't have much idea of all the new exciting stuff out there.
I would suggest to go with sqlite if your database requirement is small.
From sqlite website:
SQLite is a compact library. With all features enabled, the library
size can be less than 350KiB, depending on the target platform and
compiler optimization settings. (64-bit code is larger. And some
compiler optimizations such as aggressive function inlining and loop
unrolling can cause the object code to be much larger.) If optional
features are omitted, the size of the SQLite library can be reduced
below 200KiB. SQLite can also be made to run in minimal stack space
(4KiB) and very little heap (100KiB), making SQLite a popular database
engine choice on memory constrained gadgets such as cellphones, PDAs,
and MP3 players. There is a tradeoff between memory usage and speed.
SQLite generally runs faster the more memory you give it.
Nevertheless, performance is usually quite good even in low-memory
Object oriented dbs can be used like db4o or versant.
Neo4j (for Java) is a pretty awesome tool. It's technically a graph database, but by the sounds of your data model, I think it would be well-suited for you. From what I've seen it performs very well, its documentation was just incredibly good, and if you are using Java then it's like second nature. You basically point it at a directory and it sets up shop there.
If you are feeling adventurous and happen to be using Java, I suggest you give it a try.
I think redis is exactly what you want!
Yesterday I downloaded and installed it for the first time. It runs completely in memory and that meets your performance requirement. (It only writes the data to disk for cases like power failure, like a backup, but this does not slow down the writes to it.)
For linux and such there is tar.gz on the download page.
For windows you can download Dusan's native port: - it is precompiled and also has the client console to try out.
The documentation is very good, for example this is the page for the data types: and you also find all the other relevant information as a fast to browse reference there.
And there is a nice online tutorial to get started quickly: which is actually fun to work through.
I like the atomic operations like "increment by" and the list stuctures with push and pop.
There is also a hash type.
For python there is redis-py:
Me myself being a python coder I think the data structures that redis offers do very good match the python datatypes.

Database or format for help system?

I'm implementing a help system for my app. Its written in C and Gtk+.
Help is divided on books. Each book is just a bunch of HTML pages with resource.
The problem I've encountered is that each book is ~30M (using WebKit Gtk port to display it). After zipping it becomes ~7M, but opening document becomes extremely slow :( So I'm thinking about using some kind of library able to provide me with: full text search index creation, document listing, tree structure (a-la file system), and compression of course.
Any ideas on such thing?
P.S. Not all of the requirements are "must have", I'm still exploring this part and not sure that all of them are required, so if it'll miss something it'll be ok.
SQLite supports compressed databases (readonly), and is ideal for a one user database.
However, you should think about the need to compress. Hard disks are so big these days that a 700MB library on a computer isn't too much of a worry.
Alternatively, you could go for Firebird, which as far as I know doesn't support a compressed database, but you could compress your individual pages, in which case you would need to build your own index for full-text search - which I would consider unnecessary work.
Firebird supports a feature called "Embedded Server" which is especially designed for deployment with Windows applications.
I think you should make an ordered list, of the features you want, then pick the top two or three, and do that before advancing on. E.g. getting it into a database is something you'd want to do, before you thought about compression.

When to use an Embedded Database

I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.
My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.
Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?
Tough question - but I'll share my experience and let you decide if it helps.
If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):
To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
To make it easier to export the data in a flexible manner
To enable access to your processed data to external clients (known format)
To allow more flexible transformation of the data when preparing for viewing
Factors which you should consider when making the decision:
What is the target platform(s) (windows, linux, android, iPhone, PDA)?
What technology base? (Java, .Net, C, C++, ...)
What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
What operational behaviours do you need to take into account (connected to network, disconnected)?
On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.
If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).
Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.
Just a stream-of-consciousness response - hope that helps a little.
Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).

How would you build a database filesystem (DBFS)?

A database file system is a file system that is a database instead of a hierarchy. Not too complex an idea initially but I thought I'd ask if anyone has thought about how they might do something like this? What are the issues that a simple plan is likely to miss? My first guess at an implementation would be something like a filesystem to for a Linux platform (probably atop an existing file system) but I really don't know much about how that would be started. Its a passing thought that I doubt I'd ever follow through on but I'm hoping to at least satisfy my curiosity.
DBFS is a really nice PoC implementation for KDE. Instead of implementing it as a file system directly, it is based on indexing on a traditional file system, and building a new user interface to make the results accessible to users.
The easiest way would be to build it using fuse, with a database back-end.
A more difficult thing to do is to have it as a kernel module (VFS).
On Windows, you could use IFS.
I'm not really sure what you mean with "A database file system is a file system that is a database instead of a hierarchy".
Probably, using "Filesystem in Userspace" (FUSE), as mentioned by Osama ALASSIRY, is a good idea. The FUSE wiki lists a lot of existing projects about databased-backed filesystems as well as filesystems in which you can search by SQL-like queries.
Maybe this is a good starting point for getting an idea how it could work.
It's a basic overview of the Firebird architecture.
Firebird is an opensource RDBMS, so you can have a real deep insight look, too, if you're interested.
Its been a while since you asked this. I'm surprised no one suggested the obvious. Look at mainframes and minis, especially iSeries-OS (now called IBM-i used to be called iOS or OS/400).
How to do an relational database as a mass data store is relatively easy. Oracle and MySQL both have these. The catch is it must be essentially ubiquitous for end user applications.
So the steps for an app conversion are:
1) Everything in a normal hierarchical filesystem
2) Data in BLOBs with light metadata in the database. File with some catalogue information.
3) Large data in BLOBs with extensive metadata and complex structures in the database. File with substantial metadata associated with it that can be essentially to understanding the structure.
4) Internal structures of the BLOB exposed in an object <--> Relational map with extensive meta-data. While there may be an exportable form, the application naturally works with the database, the notion of the file as the repository is lost.

SQLite / Firebird embedded for numeric data

I have an experiment streaming up 1Mb/s of numeric data which needs to be stored for later processing.
It seems as easy to write directly into a database as to a CSV file and I would then have the ability to easily retrieve subsets or ranges.
I have experience of sqlite2 (when it only had text fields) and it seemed pretty much as fast as raw disk access.
Any opinions on the best current in-process DBMS for this application?
Sorry - should have added this is C++ intially on windows but cross platform is nice. Ideally the DB binary file format shoudl be cross platform.
If you only need to read/write the data, without any checking or manipulation done in database, then both should do it fine. Firebird's database file can be copied, as long as the system has the same endianess (i.e. you cannot copy the file between systems with Intel and PPC processors, but Intel-Intel is fine).
However, if you need to ever do anything with data, which is beyond simple read/write, then go with Firebird, as it is a full SQL server with all the 'enterprise' features like triggers, views, stored procedures, temporary tables, etc.
BTW, if you decide to give Firebird a try, I highly recommend you use IBPP library to access it. It is a very thin C++ wrapper around Firebird's C API. I has about 10 classes that encapsulate everything and it's dead-easy to use.
If all you want to do is store the numbers and be able to easily to range queries, you can just take any standard tree data structure you have available in STL and serialize it to disk. This may bite you in a cross-platform environment, especially if you are trying to go cross-architecture.
As far as more flexible/people-friendly solutions, sqlite3 is widely used, solid, stable,very nice all around.
BerkeleyDB has a number of good features for which one would use it, but none of them apply in this scenario, imho.
I'd say go with sqlite3 if you can accept the license agreement.
Depends what language you are using. If it's C/C++, TCL, or PHP, SQLite is still among the best in the single-writer scenario. If you don't need SQL access, a berkeley DB-style library might be slightly faster, like Sleepycat or gdbm. With multiple writers you could consider a separate client/server solution but it doesn't sound like you need it. If you're using Java, hdqldb or derby (shipped with Sun's JVM under the "JavaDB" branding) seem to be the solutions of choice.
You may also want to consider a numeric data file format that is specifically geared towards storing these types of large data sets. For example:
HDF -- the most common and well supported in many languages with free libraries. I highly recommend this.
CDF -- a similar format used by NASA (but useable by anyone).
NetCDF -- another similar format (the latest version is actually a stripped-down HDF5).
This link has some info about the differences between the above data set types:
I suspect that neither database will allow you to write data at such high speed. You can check this yourself to be sure. In my experience - SQLite failed to INSERT more then 1000 rows per second for a very simple table with a single integer primary key.
In case of a performance problem - I would use CSV format to write the files, and later I would load their data to the database (SQLite or Firebird) for further processing.
