We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.
Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.
The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.
Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.
It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.
If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.
Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.
Related
I am doing an information retrieval project in c++. What are the advantages of using a database to store terms, as opposed to storing it in a data structure such as a vector? More generally when is it a good idea to use a database rather than a data structure?
(Shawn): Whenever you want to keep the data beyond the length of the instance of the program. (persistence across time)
(Michael Kjörling): Whenever you want many instances of your program, either in the same computer or in many computers, like in a network or the Net, access and manipulate (share) the same data. (persistence across network space)
Whenever you have very big amount of data that do not fit into memory.
Whenever you have very complex data structures and you prefer to not rewrite code to manipulate them, e.g search, update them, when the db programmers have already written such code and probably much faster than the code you (or I)'ll write.
Whenever you want to keep the data beyond the length of the instance of the program?
In addition to Shawn pointing out persistence: whenever you want multiple instances of the program to be able to easily share data?
In-memory data structures are great, but they are not a replacement for persistence.
It really depends on the scope. For example if you're going to have multiple applications accessing the data then a database is better because you won't have to worry about file locks, etc. Also, you'd use a database when you need to do things like joining other data, sorting, etc... unless you like to implement Quicksort.
Hi I'm using CakePHP and I'm wondering if it's advisable to store things that don't change a lot in the database lik the list of cities?
If your application already needs a database, why would you keep data anywhere else?
If the list doesn't change (per installation) and it's reasonably small and frequently used, then it might be worth reading it once on initialization and caching the result to improve performance and reduce the load on the database.
You get all sorts of queries and retrievals out of the box, the same way you access any other of your data. Databases are as cheap as flat files today, but you get a full service.
I see this question has had an answer accepted - I still want to chime in with my $0.02
The way I typically do for arrays of static data (country list, timezone list, immutable sets you would use enum for...) is to use this array datasource.
It allows you to map relationships between db models and array based models and to use the usual find syntax / Containable on the relationships.
http://github.com/jrbasso/array_datasource
If it is pretty much a static list, then you can store it either in the db or a file, but keep it in memory for use. In other words, load it once whether from db or file. What you don't want to do is keep taking a hit loading it. Especially if you use it on most page views. Those little bits of time add up if you have a large number of visitors.
The flip side, of course, is if you find yourself doing this for large lists or lots and lots of little lists. Then you could run into problems of keeping too much in memory.
Bill the Lizard is right about it being important whether or not the list links to other tables. If it does, then you will need it in the db if you need queries that will include it.
There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?
It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.
From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).
That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.
Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?
I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...
Imagine you're dealing with many strings of text that are about 10,000 characters long entered by users. Would it be more efficient to write those automatically onto pages or input them onto a table in a database? I hope that question is clear enough...
It depends on what sort of "efficiency" you're aiming for.
Here's what I mean:
will you be changing the content of your text strings?
what sorts of searches will you be doing?
when you extract the text do what do you do with it?
My opinion is that provided you're not going to change the content much, nor perform much analysis, you're better off with the database.
10k isn't particularly large, so either is fine. I would personally use the database, as it will allow you to easily search though.
Depends how you're accessing them, but normally using the FS would result in better performance. That's for the obvious reason the DB is another layer built on top of the FS, and using the FS directly, assuming no extra heavy processing (for example, have 100s of named files instead of one big bloated file ordered in a special order you need to parse), would save you the DBMS operations.
I'm wondering if SQLite would be the best of both worlds, or at least, the best database for that size of job.
The real answer her is what you're going to do with these strings.
Databases are meant to be able to quickly return specific records. If you're just going to SELECT * FROM Table and then concat it all together, there's no point in using a database.
However, if you have a relation between your data that you want to be able to search, then a database will likely be more efficient.
E.G., do you want to be able to pull up all the text records from a set of users on a set of dates? Find all records from users who match some records?
These kinds of loads will likely be more efficient than a naive implementation, and still probably faster than a decent one, even if it does avoid some access layers.
There are a lot of considerations. As others said - either approach would work fine for a small number of 10k rows (thousands).
But what's the rest of your app do? If it does everything in the database, then I'd be inclined to put this there as well; the opposite is true as well.
And how will you be selecting these? Do you need to do complex text searches? If so, a database might not be the best. Or, would you be adding new attributes, searching on those attributes - or matching them against data in other tables? In this common case a database would be better.
And if your data is really vast (many millions of 10k rows) and your performance requirements aren't terribly high - you may want to compress them and store them in the file system.
Lastly, how important is data quality? Given the features of a good database it's much easier to guarantee good data quality with a database.
We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.