Implementing a database in a single file - database

This question is about creating a new single file database format. I am new to this!
I wonder how SQLite does this- for databases larger than the available memory, SQLite must be reading from certain parts of the file somehow, i.e. reading at position n?
Is this possible at sub-linear runtime complexity? I assume that when SQLite fetches a particular row, it uses a O(logn) index lookup first- so it doesn't fetch the entire index- and then it fetches the row from a particular location in the file. All of this involves not reading the whole file into memory- but FS methods appear not to provide this functionality.
Is fs.skip(n) [pseudocode] done in O(n) or does the OS skip straight to position n? Theoretically this should be possible because in the OS files are divided into blocks- and inodes reference 1-3 levels of array-like structures that locate the blocks, so fetching a particular block in a file should be possible in sub-linear time- without reading in the entire file.

I wonder how SQLite does this- for databases larger than the available memory, SQLite
must be reading from certain parts of the file somehow, i.e. reading at position n?
Yes. Almost every programming language has documentation that explains how to position the read on a file.
All of this involves not reading the whole file into memory- but FS methods appear not to
provide this functionality.
Every file system access API that I know of does support this, and it is explained in the documentation. Examples range from memory-mapped files in Windows (which are "quite" advanced and not supported if you plan to go OS-agnostic), down to something simple like the fseek() method in C that positions a file stream.
I suggest brushing up on your knowledge of file-system access methods in your programming language of choice.

Related

Loadrunner: Store many of xml file contents in an array and one by one call array

I have five hundred xml files, I want to store file contents into a array or list. use Loadrunner call array and send xml files to application server, But I am not familiar with c.
example:
Result01.xml,Result02.xml,Result03.xml,Result4.xml,Result05.xml,....Result500.xml,
Thanks!
CPU, Disk, Memory, Network. This is your finite resource pool. Attempting, for every single virtual user, to pull all XML files into memory is going to drive the memory requirements for each virtual user through the roof and you will very likely wind up in a swap of death model on your load generator.
Consider storing all of the names of your files and their location on a very fast SSD local to the load generator in a parameter file. Select from that file randomly for the filename. Read it from disk and then submit it as appropriate. This will limit your in memory need to the size of your largest XML file, which you can free() as soon as you are done using the file. This does introduce a disk dependency, but note it is a read only dependency and the recommendation to used SSD for the storage is because of the absurdly high read IOPS on that media to reduce the window of conflict to an absolute minimum.
This is also a good time to brush up on your C programming. There are lots of great books out there. You need to be proficient with the language of your tool, no matter what the tool is, if you are going to be effective with the tool.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

Embedded database specialized for tiny size data & almost no writes?

I am looking for a lightweight embedded database to store (and rarely modify) a few kilobytes of data (5kb to 100kb) in Java applications (mostly Android but also other platforms).
Expected characteristics:
fast when reading, but not necessarily fast when writing
almost no size overhead (kilobytes used even when there is no data), but not necessarily very compact (kilobytes used per kilobyte of actual data)
very small database client library JAR file size
Open Source
QUESTION: Is there a database format specialized for those tiny cases?
Text-based solutions acceptable too.
If relevant: it will be this kind of data.
Stuff it in an object and serialize it out to a file. Write the new file on save, rename it on top of the old one to "commit" it so you don't have to worry about corrupting it if the write fails. No DB, no nothing. Simple.
If you can use flat (text) files, you could keep the file on disk and read/seek around. Never read it all in at once. If you need e.g. a faster index, maybe you can build the index and a record number and use the index to find the right "rows" and use the record number to get the rest of the data from a constant size field database or as a line number in a text file.
I don't know about this Java and that static initializer message, but that sounds to me like a code size limit, not data? Why would the runtime data affect bytecode?
Can't suggest specific libraries. Maybe there's some Berkeley DB, DSV or xBase style library around.

How to implement B+ Tree for file systems?

I have a text file which contains some info on extents about all the files in the file system, like below
C:\Program Files\abcd.txt
12345 100
23456 200
C:\Program Files\bcde.txt
56789 50
26746 300
...
Now i have another binary which tries to find out about extents for all the files.
Now currently i am using linear search to find extent info for the files in the above mentioned text file. This is a time consuming process. Is there a better way of coding this ? Like Implementing any good data structure like BTree. If B+ Tree is used what is the key, branch factor i need to use ?
Use a database.
The key points in implementing a tree in a file are to have fixed record lengths and to use file offsets instead of pointers.
Use a database. Hmmm, SQL Lite.
Another point to consider with files is that reading in chunks of data is faster than reading individual items (regardless of whether or not the hard disk has a cache or the OS has a cache). I implemented a B+Tree, which uses pages as it's nodes.
Use a database. Databases have already been written and tested.
A more efficient design is to keep the initial node in memory. This reduces the number of fetches from the file. If your program has the space, keeping the first couple of levels in memory may also speed up execution.
Use a database.
I gave up writing a B-Tree implementation for my application because I wanted to concentrate on the other functionality of the program. I later learned that in the real world (the world where programs need to be finished on a schedule) that time should be spent on the 'core' of the application rather than accessories that have already been written and tested (a.k.a. off-the-shelf).
It depends on how do you want to search your file. I assume that you want to look up your info given a file name. Then a hash table or a Trie would be a good data structure to use.
The B-tree is possible but not the most convenient choice given that your keys are strings.

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan
Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.
Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.
Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.
Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)
Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.
This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).
The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.
Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Resources