When to use an Embedded Database - database

I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.
My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.
Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?

Tough question - but I'll share my experience and let you decide if it helps.
If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):
To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
To make it easier to export the data in a flexible manner
To enable access to your processed data to external clients (known format)
To allow more flexible transformation of the data when preparing for viewing
Factors which you should consider when making the decision:
What is the target platform(s) (windows, linux, android, iPhone, PDA)?
What technology base? (Java, .Net, C, C++, ...)
What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
What operational behaviours do you need to take into account (connected to network, disconnected)?
On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.
If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).
Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.
Just a stream-of-consciousness response - hope that helps a little.
Edit:
Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).

Related

Write performance between Filesystem and Database

I have a very simple program for data acquisition. The data comes frequently (around 5200 Hz). One piece of data has around 24 kB, so it is around 122 MB/s.
What would be more efficient only for storing this data? Saving it in raw binary files, or use the database? If the database, then which? SQLite, or maybe some other?
The database, of course, is more tempting, because when saving it to file I would have to separate them by delimiters (data can have different sizes), also processing data would be much easier with the database. I'm not sure about database performance compared to files though, I couldn't find any specific pieces of information about it.
[EDIT]
I am using Linux based OS and SSD disk which supports writing up to 350 MB/s. Data will be acquired with that frequency all the time (with a small service break every day to transfer the data to another machine)
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
Another point is understanding the relational model meaning how you design your database, so that data doesn't need to be repeated over and over.
Moreover understanding types is inportant as well. If you have a txt file, you'll need to parse numbers, dates, etc.
For the performance point of view I would say that DB are slower to start (is usually faster to open a file than open a connection to a db). However once they are open I can guarantee that DB is faster then XML or whatever file you are thinking to use. BTW this is the main purpose of a database: manage huge amount of data, filesystems are made for storing files.
Last points for DB is that they usually can handle multi-threading and concurrency problems, which a file cannot and last but not least important in a database you cannot delete a file by mistake and loose your data
So my choice would be a DB and anway I hope that providing you some info you can decide what is best for you
-- UPDATE --
Since you your needs are more specific now I tried to dig deeper: I found some solutions that could be interesting for you however I don't have experience in any of them to provide you a personal suggestion about them:
SharedHashFile: SharedHashFile is a lightweight NoSQL key value store / hash table, a zero-copy IPC queue, & a multiplexed IPC logging library written in C for Linux. There is no server process. Data is read and written directly from/to shared memory or SSD; no sockets are used between SharedHashFile and the application program. APIs for C, C++, & nodejs. However keep an eye out for issues because this project seems to be no longer maintained on Github
WhiteDB another NoSql database that claims to be really fast, go to the speed section of their website to consult it
Symas an extraordinarily fast, memory-efficient database
Just take a look at them and if you ever use them just provide here a feedback for the community

Database vs File system storage

Database ultimately stores the data in files, whereas File system also stores the data in files. In this case what is the difference between DB and File System. Is it in the way it is retrieved or anything else?
A database is generally used for storing related, structured data, with well defined data formats, in an efficient manner for insert, update and/or retrieval (depending on application).
On the other hand, a file system is a more unstructured data store for storing arbitrary, probably unrelated data. The file system is more general, and databases are built on top of the general data storage services provided by file systems. [Quora]
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
For very complex operations, the filesystem is likely to be very slow.
Main RDBMS advantages:
Tables are related to each other
SQL query/data processing language
Transaction processing addition to SQL (Transact-SQL)
Server-client implementation with server-side objects like stored procedures, functions, triggers, views, etc.
Advantage of the File System over Data base Management System is:
When handling small data sets with arbitrary, probably unrelated data, file is more efficient than database.
For simple operations, read, write, file operations are faster and simple.
You can find n number of difference over internet.
"They're the same"
Yes, storing data is just storing data. At the end of the day, you have files. You can store lots of stuff in lots of files & folders, there are situations where this will be the way. There is a well-known versioning solution (svn) that finally ended up using a filesystem-based model to store data, ditching their BerkeleyDB. Rare but happens. More info.
"They're quite different"
In a database, you have options you don't have with files. Imagine a textfile (something like tsv/csv) with 99999 rows. Now try to:
Insert a column. It's painful, you have to alter each row and read+write the whole file.
Find a row. You either scan the whole file or build an index yourself.
Delete a row. Find row, then read+write everything after it.
Reorder columns. Again, full read+write.
Sort rows. Full read, some kind of sort - then do it next time all over.
There are lots of other good points but these are the first mountains you're trying to climb when you think of a file based db alternative. Those guys programmed all this for you, it's yours to use; think of the likely (most frequent) scenarios, enumerate all possible actions you want to perform on your data, and decide which one works better for you. Think in benefits, not fashion.
Again, if you're storing JPG pictures and only ever look for them by one key (their id maybe?), a well-thought filesystem storage is better. Filesystems, btw, are close to databases today, as many of them use a balanced tree approach, so on a BTRFS you can just put all your pictures in one folder - and the OS will silently implement something like an early SQL query each time you access your files.
So, database or files?...
Let's see a few typical examples when one is better than the other. (These are no complete lists, surely you can stuff in a lot more on both sides.)
DB tables are much better when:
You want to store many rows with the exact same structure (no block waste)
You need lightning-fast lookup / sorting by more than one value (indexed tables)
You need atomic transactions (data safety)
Your users will read/write the same data all the time (better locking)
Filesystem is way better if:
You like to use version control on your data (a nightmare with dbs)
You have big chunks of data that grow frequently (typically, logfiles)
You want other apps to access your data without API (like text editors)
You want to store lots of binary content (pictures or mp3s)
TL;DR
Programming rarely says "never" or "always". Those who say "database always wins" or "files always win" probably just don't know enough. Think of the possible actions (now + future), consider both ways, and choose the fastest / most efficient for the case. That's it.
Something one should be aware of is that Unix has what is called an inode limit. If you are storing millions of records then this can be a serious problem. You should run df -i to view the % used as effectively this is a filesystem file limit - EVEN IF you have plenty of disk space.
The difference between file processing system and database management system is as follow:
A file processing system is a collection of programs that store and manage files in computer hard-disk. On the other hand, A database management system is collection of programs that enables to create and maintain a database.
File processing system has more data redundancy, less data redundancy in dbms.
File processing system provides less flexibility in accessing data, whereas dbms has more flexibility in accessing data.
File processing system does not provide data consistency, whereas dbms provides data consistency through normalization.
File processing system is less complex, whereas dbms is more complex.
Context: I've written a filesystem that has been running in production for 7 years now. [1]
The key difference between a filesystem and a database is that the filesystem API is part of the OS, thus filesystem implementations have to implement that API and thus follow certain rules, whereas databases are built by 3rd parties having complete freedom.
Historically, databases where created when the filesystem provided by the OS were not good enough for the problem at hand. Just think about it: if you had special requirements, you couldn't just call Microsoft or Apple to redesign their filesystem API. You would either go ahead and write your own storage software or you would look around for existing alternatives. So the need created a market for 3rd party data storage software which ended up being called databases. That's about it.
While it may seem that filesystems have certain rules like having files and directories, this is not true. The biggest operating systems work like that but there are many mall small OSs that work differently. It's certainly not a hard requirement. (Just remember, to build a new filesystem, you also need to write a new OS, which will make adoption quite a bit harder. Why not focus on just the storage engine and call it a database instead?)
In the end, both databases and filesystems come in all shapes and sizes. Transactional, relational, hierarchical, graph, tabled; whatever you can think of.
[1] I've worked on the Boomla Filesystem which is the storage system behind the Boomla OS & Web Application Platform.
The main differences between the Database and File System storage is:
The database is a software application used to insert, update and delete
data while the file system is a software used to add, update and delete
files.
Saving the files and retrieving is simpler in file system
while SQL needs to be learn to perform any query on the database to
get (SELECT), add (INSERT) and update the data.
Database provides a proper data recovery process while file system did not.
In terms of security the database is more secure then the file system (usually).
The migration process is very easy in File system just copy and paste into the target
while for database this task is not as simple.

Best way to bundles photos with app: files or in sqlite database?

Lets say that I have an app that lets you browse through a listing of cars found in a Sqlite database. When you click on a car in the listing, it'll open up a view with the description of the car and a photo of the car.
My question is: should I keep the photo in the database as a binary data column in the row for this specific car, or should I have the photo somewhere in the resources directory? Which is better to do? Are there any limitations of Sqlite in terms of how big a binary data column can be?
The database will be pretty much be read only and bundled with the app (so the user wouldn't be inserting any cars and their photos).
This is a decision which is discussed quite a lot. And in my opinion, it's a matter of personal taste. Pretty much like vim/emacs, windows/linux kind of debates. Not that heated though.
Both sides have their advantages and disadvantages. When you store them in a database, you don't need to worry much about filename and location. Management is also easier (you delete the row(s) containing the BLOBs, and that's it). But the files are also harder to access and you may need to write wrapper code in one way or the other (f.ex. some of those "download.php" links).
On the other hand, if the binary data is stored as a separate file, management is more complicated (You need to open the proper file from the disk by constructing the filename first). On large data sets you may run into filesystem bottlenecks when the number of files in one directory grows very large (but this can be prevented by creating subdirs easily). But then, if the data is stored as files, replacing them becomes so much easier. Other people can also access it without the need to know the internals (imagine for example a user who takes fun in customizing his/her UI).
I'm certain that there are other points to be raised, but I don't want to write too much now...
I would say: Think about what operations you would like to do with the photos (and the limitations of both storage methods), and from there take an informed decision. There's not much that can go wrong.
TX-Log and FS-Journal
On further investigation I found some more info:
SQLite uses a Transaction Log
Android uses YAFFS on the system mount points and VFAT on the SD-Card. Both are (to the best of my knowlege) unjournaled.
I don't know the exact implementation of SQLite's TX-Log, but it is to be expected that each INSERT/UPDATE operation will perform two writes on the disk. I may be mistaken (it largely depends on the implementation of transactions), but frankly, I couldn't be bothered to skim through the SQLite source code. It feels to me like we start splitting hairs here (premature optimization anyone?)...
As both file systems (YAFFS and VFAT) are not journalled, you have no additional "hidden" write operations.
These two points speak in favour of the file system.
Note that this info is to be taken with a grain of salt. I only skimmed over the Google results of YAFFS journaling and sqlite transaction log. I may have missed some details.
The default MAX SQLite size for a BLOB or String is 231-1 bytes. That value also applies to the maximum bytes to be stored in a ROW.
As to which method is better, I do not know. What I would do in your case is test both methods and monitor the memory usage and its effect on battery life. Maybe you'll find the filesystem method has a clear advantage over the other in that area and the ease of storing it in SQLite is not worth it for your use case, or maybe you'll find the opposite.

Database recommendation

I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible

Databases versus plain text

When dealing with small projects, what do you feel is the break even point for storing data in simple text files, hash tables, etc., versus using a real database? For small projects with simple data management requirements, a real database is unnecessary complexity and violates YAGNI. However, at some point the complexity of a database is obviously worth it. What are some signs that your problem is too complex for simple ad-hoc techniques and needs a real database?
Note: To people used to enterprise environments, this will probably sound like a weird question. However, my problem domain is bioinformatics. Most of my programming is prototypes, not production code. I'm primarily a domain expert and secondarily a programmer. Most of my code is algorithm-centric, not data management-centric. The purpose of this question is largely for me to figure out how much work I might save in the long run if I learn to use proper databases in my code instead of the more ad-hoc techniques I typically use.
1) Concurrency. Do you have multiple people accessing the same dataset? Then it's going to get pretty involved to broker all of the different readers and writers in a scalable fashion if you roll your own system.
2) Formatting and relationships: Is your data something that doesn't fit neatly into a table structure? Long nucleotide sequences and stuff like that? That's not really conveniently tabular data.
Another example: Nobody would consider implementing software like Photoshop to store PSDs in a relational format, because the data structures don't really lend themselves to that type of storage or query pattern.
3) ACID (sort of a corollary to #1): If Atomicity, Consistency, Integrity, and Durability are not challenges with a flat file, then go with a flat file.
For me, the line is crossed once I have to query my data in ways that involve more than a single relationship. Relating two flat data structures on disk is fairly simple, but once we get beyond that, a set-based language like SQL and formal database relationships actually reduce complexity.
I think at some point you'll miss the querying capabilities of a database, but you can consider some minimalistic database alternatives:
SQLite (Great, almost SQL-92 standard compliant)
shsql
SQL Server Compact
I would only write my own on-disk format under very special circumstances. Reusing someone else's code is nearly always faster.
For relational data, I would use SQLite. For key/value pairs, I would use BerkeleyDB (perhaps via KiokuDB). For simple objects, I would use JSON or YAML, but only if I only had a few.
With SQLite and BDB, "a real database" is literally two lines of code away. It is hard to beat that.
The problem with small projects is that they become bigger before we know it. And once they do , we start missing the sql capabilities.
Always design such that a db can be utilized later on if required without ripping apart half of the application.
It depends entirely on the domain-specific application needs. A lot of times direct text file/binary files access can be extremely fast, efficient, as well as providing you all the file access capabilities of your OS's file system.
Furthermore, your programming language most likely already has a built-in module (or is easy to make one) for specific parsing.
If what you need is many appends (INSERTS?) and sequential/few access little/no concurrency, files are the way to go.
On the other hand, when your requirements for concurrency, non-sequential reading/writing, atomicity, atomic permissions, your data is relational by the nature etc., you will be better off with a relational or OO database.
There is a lot that can be accomplished with SQLite3, which is extremely light (under 300kb), ACID compliant, written in C/C++, and highly ubiquitous (if it isn't already included in your programming language -for example Python-, there is surely one available). It can be useful even on db files as big as 1GB, possible more.
If your requirements where bigger, there wouldn't even be a discussion, go for a full-blown RDBMS.
For the kind of applications you are developing in bioinformatics, you are often doing one-shot applications (often scripts that define a workflow of calculations) that answer a specific questions, and you are not likely to be reusing these applications after you answered your question.
Often, you should therefore avoid creating databases to store the results, as after all you are not going to use their features very much.
You will probably be querying some webservices, files, or databases, run some local algorithms on the data gathered from different sources, and produce some tabular or structured output format (xml, json, etc).
For that, I would suggest you to use workflow tools like Knime (or a commercial solution like Inforsense KDE, Accelrys's Pipeline pilot, or Snaplogic, as they allow you to query data in a variety of formats and locations (rdbms, flat files, webservices), run algorithms, and build powerful web apps that allow you to easily publish your workflows to your users and let them interact at specific points).
If your prototype "grows" and you have to build more functionality on top of the data your workflows output, and if the output of your prototype is not likely to change everyday, then it's a wise decision to store a subset of the results in a database. This allows you to plug in powerful reporting tools like BusinessObjects, Crystal reports, jasper reports or whatever reporting solution available out there and show data to your users in a better shape than a spreadsheet or a csv file.
Finally, some development frameworks will make your choices more obvious : if you build a web application using an MVC framework, it is likely that your data will reside in an RDBMS (but please, don't put genomic sequences in a table column :-)).
All in all, it's a case by case choice, depending on your needs for each particular application.
In software I can usually get away with storing values in a XML configuration file or in the registry, e.g. software options. Once I need to persist objects I move to a database because the upfront cost is not that bad compared to the long term effects that relations and reporting can offer.
For bioinformatics you may be interested on that: Blast on DB. The guy who is working on that is a friend of mine and has a work on fast similarity sequence search, he found out to make his own binary storage better than using databases at this point.
I don't know specific details about his solution but you probably can exchange one or two ideias mailing the guy, even sharing code.
Do you need/want SQL queries?
Are multiple people going to want to access the data?
Is your data relational?
If you answered no to those questions, you (probably) don't need a full on database.
First, I'd consider:
How large will the database initially be: # of tables, # of rows
How quickly will it grow?
Is the data frequently queried?
If I were to create a personal recipe app, for example, I know I might add 50 favorite recipes to start and add no more than 5 recipes a year. With that being said, I could easily get by without a database since the size of the data store will have minimal impact on queries.
That said, I would probably use a database for any application where data entry and queries occur (even a small personal recipe app). I don't think it adds a lot of overhead especially when your framework (e.g. Rails) allows you to keep your database dumb (primarily tables, indexes, and constraints). It alleviates the chance that I'll have to eventually port to a database if I decide to scale up.
If you know the format of your data, flat files, if faster/easier to develop with, will be fine. If you expect your record formats to change frequently during development then I'd suggest that ALTER TABLE is your friend. Flat files will also tend to be faster (if you care about speed) unless you expect to implement the equivalent of joins across many combinations of files.
The real benefit of using a RDBMS during development is the flexibility with which you can modify your data schema and the ease with which you can access your data via queries.
Good design will ensure that you keep your data access layer relatively isolated (because of separation of concerns) so it should be a fairly straightforward (if tedious) matter to rework to a database later should it be worthwhile. Or, of course, if you use a database to develop your structures you may subsequently take the app back to flat/indexed files once those structures are crystallized in order to gain performance.
Use whatever persistence technology you're most comfortable with, and scales sufficiently.
YAGNI at least means "Don't add a new technology to your personal stack unless you can't be productive with whatever is already there."
For many (most?) of us, our comfort zone for data persistence is SQL. For some, it might be XML. Just don't write your own until (see paragraph 2).
As someone also doing research in Bioinformatics, I would suggest NOT using a database for these kinds of prototype projects unless you are sure it needs it. If you are on the fence, go with the databaseless solution and stick with flat files. It is also important to note that traditionally Bioinformatics researchers have go the flat file route, which means there are well defined file formats for most types of data in the feild. If you decide to go with a database solution, it may hurt your compatibility with existing research projects.

Resources