In regards to DBMS integrity, how is an operating system's buffered I/O a threat? I have read multiple articles on why DBMS make use of their own, local cache rather than using OS buffered O/I (a good number of the reads were right here in stackoverflow), however I haven't seen any indication that the buffered O/I might pose an integrity threat to DBMS.
I believe I have found the answer I needed. It relates to transfer errors within the database effecting the integrity, "...when a piece of data is present in the destination table, but not in the source table of a relational database.", as per Talend.com article What is "Data Integrity and Why Is It Important".
Related
Well every database book starts with the story that how earlier people used to store data as files and it was very inconvenient. After database came, things became really easy and seamless, because we can now query data etc. My question is how are the tables really stored in the disk and retrieved ? Aren't they stored as files only or they are just copied to the address space bit by bit, and access via address only ? Or there is a underneath file system and the database server handles accessing the file system and presents the abstraction of a table in front of us.
Might be a very trivial question but, I have not found answer in any book
The question is not trivial, but the distinction between the two is quite apparent.
File systems provide a way to logically view the streams in a hierarchical manner.
A virtual representation of what lies on the disk; which would otherwise just be a binary stream, unreadable.
When we talk about storing data, we can extend a method of writing data to files and later define our own protocols for CRUD'ing on it; thus mimicking a fractional part of what databases do.
There are numerous limitations to storing data in files. If you store them in file and define your own protocol, it will be very specific to you. Plus, there are various other concerns like security, disaster recovery etc etc.
Even though everything is stored in some or the other way on disk, the main advantage databases bring to the table versus files are the mechanisms that they offer.
To minimize the io, we have db caches and numerous other features.
As you imagine a File system to be something that helps visualize and access the data on the disk in streams, we can imagine a database to be such a tool for data - Data systems, which organizes your data. Files can only fractionally do that; again, unless you extend your program to mimic a database.
How the tables are really stored on the disk and retrieved, that's a vast topic. Advise reading your favourite database internals. A book by Korth might also be a good read.
I have a very simple program for data acquisition. The data comes frequently (around 5200 Hz). One piece of data has around 24 kB, so it is around 122 MB/s.
What would be more efficient only for storing this data? Saving it in raw binary files, or use the database? If the database, then which? SQLite, or maybe some other?
The database, of course, is more tempting, because when saving it to file I would have to separate them by delimiters (data can have different sizes), also processing data would be much easier with the database. I'm not sure about database performance compared to files though, I couldn't find any specific pieces of information about it.
[EDIT]
I am using Linux based OS and SSD disk which supports writing up to 350 MB/s. Data will be acquired with that frequency all the time (with a small service break every day to transfer the data to another machine)
The file system is useful if you are looking for a particular file, as operating systems maintain a sort of index. However, the contents of a txt file won't be indexed, which is one of the main advantages of a database.
Another point is understanding the relational model meaning how you design your database, so that data doesn't need to be repeated over and over.
Moreover understanding types is inportant as well. If you have a txt file, you'll need to parse numbers, dates, etc.
For the performance point of view I would say that DB are slower to start (is usually faster to open a file than open a connection to a db). However once they are open I can guarantee that DB is faster then XML or whatever file you are thinking to use. BTW this is the main purpose of a database: manage huge amount of data, filesystems are made for storing files.
Last points for DB is that they usually can handle multi-threading and concurrency problems, which a file cannot and last but not least important in a database you cannot delete a file by mistake and loose your data
So my choice would be a DB and anway I hope that providing you some info you can decide what is best for you
-- UPDATE --
Since you your needs are more specific now I tried to dig deeper: I found some solutions that could be interesting for you however I don't have experience in any of them to provide you a personal suggestion about them:
SharedHashFile: SharedHashFile is a lightweight NoSQL key value store / hash table, a zero-copy IPC queue, & a multiplexed IPC logging library written in C for Linux. There is no server process. Data is read and written directly from/to shared memory or SSD; no sockets are used between SharedHashFile and the application program. APIs for C, C++, & nodejs. However keep an eye out for issues because this project seems to be no longer maintained on Github
WhiteDB another NoSql database that claims to be really fast, go to the speed section of their website to consult it
Symas an extraordinarily fast, memory-efficient database
Just take a look at them and if you ever use them just provide here a feedback for the community
Does google file system provide read consistency. I am confused because I know that the primary maintains write consistency in GFS. If a system provided write consistency is it not that it provides read consistency also?
Thanks
Manjit
The read operation won't mutate the data, there is no concern about the consistency.
From the NimbusDB website:
Our distributed non-blocking atomic commit protocol allows database transaction processing at any available node.
They claim that they can guarantee ACID transactions in a distributed environment, and provide all of: consistency, high availability and partition tolerance. As far as I can tell from the text, their "secret" for overcoming the limitations of CAP theorem is some sort of "predictable and consistent" way to manage network partitions.
I'm wondering if anyone has some insights or more information on what's behind?
There are multiple possible meanings for the word "consistency". See, e.g., Why is C in CAP theorem not same as C in ACID? .
Plus, some level of debate is also possible as to the meaning of the C in 'ACID' : while it is typically defined in a sense that relates to database integrity ("no transaction shall get to see a database state that violates a declared constraint - modulo the inconsistencies that that transaction has created itself of course"), one commenter said he interpreted it as referring to "the database state as seen (or perhaps better, as effectively used) by any transaction does not change while that transaction is in progress. Paraphrased : transactions are ACID-compliant if they are executing in at least repeatable read mode.
If you take the CAP-C to mean "all nodes see the same data at the same time", then availability is necessarily hampered because while the system is busy distributing the data to the various nodes, it cannot allow any transaction access to (the elder versions of) that data. (Unless of course access to elder versions is precisely what is needed, such as when a transaction is running under MVCC.)
If you take the CAP-C to mean something along the lines of "no transaction can get to see an inconsistent database state", then essentially the same applies, except that it is now the user's update process that should be locking out access for all other transactions.
If you impose a rule to the effect that "whenever a transaction has accessed a particular node N to read from some resource R (assuming R could theoretically be accessed on more than one node), then whenever that transaction accesses R again, it should do so on the same node N.", then I can imagine this will increase your guarantee of "consistency", but you pay in availability, because if node N falls out, then precisely because of the rule imposed, your transaction cannot access R anymore even if it could be done on other nodes.
At any rate, I think that if an institution such as Berkeley comes up with a proof of some theorem, then you're on the safe side if you consider vociferous claims such as the one you mention, as marketing lies.
It's been a while since this post was written and since then NuoDB has added a lot to their product marketing and technical resources on their website.
They've achieve data durability and ACID compliance by using their Distributed Data Cache System. They now call it an "Emergent Architecture:" (p.6-7)
The architecture opens a variety of possible future directions including “time-travel”, the ability to create a copy of the database that recreates its state at an earlier time; “cloud bursting”, the ability to move a database across cloud systems managed by separate groups; and
“coteries” a mechanism that addresses the CAP Theorem by allowing the DBA to specify which systems survive a network partition to provide consistency and partition resistance with continuous availability.
From the How It Works page :
Today’s database vendors have applied three common design patterns around traditional systems to extend them into distributed scale-out database systems. These approaches – Shared-Disk, Shared-Nothing and Synchronous Commit - overcome some of the limitations of single-server deployments, but remain complex and prone to error.
By stepping back and rethinking database design from the ground up, Jim Starkey, NuoDB’s technical founder, has come up with an entirely new design approach called Durable Distributed Cache (DDC). The net effect is a system that scales-out/in dynamically on commodity machines and virtual machines, has no single point of failure, and delivers full ACID transactional semantics.
The primary architectural difference between NuodDB's NewSQL model and that of the more traditional RDMS systems is that the NuoDB inverts the traditional relationship between Memory and Storage, creating an ACID compliant RDBMS with an underlying design similar to that of a distributed DRAM cache. From the NuoDB Durable Distributed Cache page:
All general-purpose relational databases to date have been architected around a storage-centric assumption. Unfortunately this creates a fundamental problem relative to scaling out. In effect, these database systems are fancy file systems that arrange for concurrent read/write access to disk-based files such that users do not interfere with each other.
The NuoDB DDC architecture inverts this idea, imagining the database as a set of in-memory container objects that can overflow to disk if necessary and can be retained in backing stores for durability purposes.
All servers in the NuoDB DDC architecture can request and supply objects (referred to as Atoms) thereby acting as peers to each other. Some servers have a subset of the objects at any given time, and can therefore only supply a subset of the database to other servers. Other servers have all the objects and can supply any of them, but will be slower to supply objects that are not resident in memory.
NuoDB consists of two types of servers: Transaction Engines (TEs) hold a subset of the objects; Storage Managers (SMs) are servers that have a complete copy of all objects. TEs are pure in memory servers that do not need use disks. They are autonomous and can unilaterally load and eject objects from memory according to their needs. Unlike TEs, SMs can’t just drop objects on the floor when they are finished with them; instead they must ensure that they are safely placed in durable storage.
For those familiar with caching architectures, you might have already recognized that these TEs are in effect a distributed DRAM cache, and the SMs are specialized TEs that ensure durability. Hence the name Durable Distributed Cache.
They also publish a technical white paper that deep-dives into the sub-system components and the way they work together to provide an ACID-compliant RDMBS with most of the performance of a NoSQL system (NOTE: registration on their site to download the white paper). The general gist is that they provide an automated network cluster partitioning system that, when combined with their persistent storage system, addresses the concerns the CAP Theorem.
There are also a lot of informative technical white papers and independent analysis reports on their technology in their Online Documents Library
Which is more expensive to do in terms of resources and efficiency, File read/write operation or Database Read/Write operation
I was initially going to say database read/write, hands down, as it would include the requisite file io on top of the DB overhead, but then realized its not that simple. If you have your entire DB loaded into memory, reads would be nearly instantaneous as there's no file IO involved.
Writes would, in general, be faster too, as the DB engine doesn't have to wait for the file IO to complete before returning since they can take a "lazy write" approach.
A poorly tuned database, on the other hand, will be orders of magnitude slower than any file based IO. DB tuning matters. A lot.
This is kind of a loaded question. What size files are we talking about? Gigabytes? Also, what type and size of DB? I often use a combination. Do you want to control any data level integrity? If so, you might want to leave that to the DB otherwise you have to control all that at the application level.
There are so many factors to make a good decision on this. For example, when I am creating temporary data that I don't want persisted I use File, but if I am using data I want persisted or backed up, then I use a DB.
This coupled with the architecture is important. If hardware, licensing, or facility is an issue then maybe you don't need the infrastructure of DB servers etc. But if you have the resources then adding a DB layer might be the right choice.
There's no simple answer. With any database you have the overhead of having it running all the time. But then when you access it is generally much faster than accessing a file. If you are talking about just a handful of accesses you won't notice much of difference. But when it gets to hundreds, thousands, and millions of accesses per minute the database will be much faster. And as Tim noted above, a poorly tuned database can be much slower than accessing a flat file.