As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is a database a reasonable data structure for memoization? When extremely large amounts of data need to be cached, it may be unreasonable for an ordinary piece of software to actively maintain it in memory. A database makes it easy to store the results of calculations for later use, meaning calculations can be stopped and started at anytime without affecting a program's progress. If the database is shared, processing can also be distributed among multiple systems (a computer cluster).
My only reservation is that the delay caused by querying a database may impact algorithm performance, especially if an algorithm processes many permutations very quickly. Of course, database memoization would only be necessary if the space complexity of an algorithm / application is extremely high (gigabytes). Any thoughts?
If you're worried about large data to be answered on a single machine, the answer to this is almost certainly NO! And on modern hardware, if the answer is not no, then either there is a pattern to the calculation, or the computation should be ruled infeasible. But there are several variations where it can make sense.
The win with memoization is that the cost of recalculation is more than fetching your previous answer. But if your answer fits in RAM, then there is no win to using a database since it is faster to just keep the store in memory. So the only interesting case for the database is where the answer does not fit in RAM.
Let's suppose, for the sake of argument, that each key/value pair takes a whopping 640 bytes. Let us suppose that you have 64 GB of RAM available to you. So in order for it to not fit in RAM, you need over 100 million facts, which are created/accessed randomly. However let's consider actual hardware. These facts, when they don't fit in RAM, are stored in a hard drive. The hard drive spins at, let's say, 6k RPM, or 100 times per second. This makes the time to fetch/store a random piece of data an average of 1/200th of a second (on average you have to spin half-way to find your data). So after you fill your data structure, to access it all again randomly takes 100 million * 0.005 s = 500,000 seconds which is nearly 590 days. We're taking years just to access data (let alone create it) which is getting perilously close to the mean time between failure for the hardware. (BTW there is some parallelism we can take advantage of here, hard drives cam look for several disk sectors they are looking for at a time, but that is limited and will not save you.)
The moral is that randomly accessing large data sets on disk is not feasible. Even if you put a database in front of it. Hard drives are not RAM, and should not be thought of as such.
But all is not lost.
A scenario where the database makes sense is your suggestion of a distributed computation. If your computational steps are expensive, memoized calls are relatively few, and the data can fit in memory, then a database is very convenient. Calls to the database will be fast (things are in memory), you can't simply keep things on a local hard drive (your data is spread out across multiple machines to use CPUs so there is no shared hard drive), and the database may be convenient simply because it is there. (I've used databases this way before, and been very happy.)
However in this scenario the database is just a key/value store. While a SQL database works, you may want to consider no-SQL solutions. And once you go to no-SQL solutions you have options for data stores where data has been sharded such that it all fits in RAM, no matter how much data you have. (Yes, you can shard relational databases as well. eBay is a good example of a company that I know does, but once you do you tend to lose the "relational" part of it. Yes, I know that several companies claim otherwise, their claims come with significant caveats.)
In fact when you do a Google search you are running against just this kind of sharded data store, which contains what is essentially memoized answers to a lot of questions about which pages match which key words, and which of those pages are most relevant. Without memoization they could never do it. But they also could never actually do it if they had to go to a hard drive for the answer. (They're also not using SQL...)
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I used open source java implementation of TPC-C benchmark (called TCJ - TPC-C via JDBC (created by MMatejka last year)) to compare the performance of Oracle and 2 OSS DBMS.
TPC-C is standard in the proprietary sphere and my question is:
What are the main reasons that there is not systematically implemented performance test for OSS database systems?
Firstly, I'm not sure your question is a perfect fit for SO as it is getting close to asking opinion, and so all of my answer is more opinion than fact. most of this I have read over the years, but will struggle to find references/proof anymore. I am not a TPC member, but did heavily investigate trying to get a dsitributed column store database tested under the TPC-H suite.
Benchmarks
These are great at testing a single feature and comparing them, unfortunately, that is nowhere as easy as it sounds. Companies will spend large amounts of effort to get better results, sometimes (so I have heard) implementing specific functions in the source for a benchmark. There is a lot of discussion about how reliable benchmark results are overall. Also, a benchmark may be a perfect fit for some product, but not another.
Your example uses Jdbc, but not every database has jdbc, or worse it may be a 'minor bolt on' just to enable that class of application. So performing benchmarks via jdbc when all main usage will be embedded sql may portray some solutions unfairly/poorly.
There is some arguments around that benchmarks distract vendors from real priorities, they spend effort and implement features solely for benchmarks.
Benchmarks can also be very easily misunderstood, even TPC is a suite of different benchmarks and you need to select the correct one for your needs ( tpc-c for oltp, tpc-h for dss etc)
TPC
If this reads as negative for tpc, please forgive me, I am pro tpc.
Tpc defines a very tight set of test requirements. You must follow these to a letter. For tpc-h this is an example of what you must do
do multiple runs, some in parallel, some single user
use exactly the sql provided, you must not change it at all. If you need to because your system uses a slightly different syntax, but must get a waiver.
you must use an external auditor.
you may not index colmns etc beyond what is specified.
for tpch you must do writing in a specified way (which eliminates 'single writer' style databases)
The above ensures that people reading the results can have trust in the integrity of the results, which is great for a corporate buyer.
Tpc is a non profit org and anybody can join. There is a fee but it isnt a major barrier, except for OSS. You are only realistically going to pay this fee if you think you can achieve really great results, or you need published results to bid for govt contracts etc.
The biggest problem I see with tpc for oss is that it is heavily skewed towards relational vendors and very few oss solutions can met the entry criteria with their offerings, or if they do they may not perform well enough for every test. Doing a benchmark may also be a distraction for some teams.
Alternatives to tpc
Of course alternatives exist to tpc, but none really gain traction, as yet, that i am aware of. Major vendors often stipulate that you cannot benchmark their products and publish the results. So any new benchmark will need to be politically astutue to get them on board. I agree with the vendors stance here, I would hate someone to mis-implement a benchmark and report my product poorly.
The database landscape has fractured a lot since tpc started, but many 'bet you business' applications still run on 'classic' databases, so they still have a place. However, with the rise in nosql etc, there is a place for new benchmarks, but the real question becomes what to measure - even chosing xyz like '%kitten%'. Or xyz like 'kitten%'. Will have dramatic effects on different solutions. If you solve that, what common interface wil,you allow (odbc, jdbc, http/ajax, embedded sql, etc) each of these interfaces affects performance greatly. What about the actual models, such as ACID for relational models vs eventual consistency models? What about hardware/software solutions that use specificaly designed hardware?
Each database has made design trade offs for different needs, and a benchmark is attempting to level the playing field, which is only really possible if you have something in common, or report lots of different metrics.
One of the problems with trying to create an alternative is that 'who will pay'? You need consenus over the type of tests to perform, and then you need to audit results for them to be meaningful. This all costs money.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I really interesting in non relational databases, but due the many reason familiar only with the small part of it. So I want to list all NoSQL technologies you use with basic use cases, pros and cons.
If you have some specific issues during the work with some technologies, interesting experience, etc. you are welcome to share it with community.
Personally I worked with:
Mongodb:
Usecases: For my opinion is one of the best if you need good aggregation features, automatic replication. Good in scale. Have many features which allow using it like everyday use database and if for some reason you don't want to use SQL solution - Mongo could be the great choice. Also mongo is great if you need dynamic queries. And also mongodb support indexing - it's also important feature.
Pros: Fast, good scale, easy to use, internal geospatial Indexes
Cons: Comparatively slow write operation, blocking atomic operation could make a lot of problems. Memory consuming process could "eat" all available memory.
Couchdb:
Usecases: I use it in Wiki liked project and I think for that cases is the perfect database. The fact that each document automatically saves in new revision during update helps to see all the changes. For accumulating, occasionally changing data, on which pre-defined queries are to be run.
Pros: Easy to use, REST oriented interface, versions.
Cons: Problem with performance when amount of docs is quite huge (more than half a million), a bit pure query features (could be solving with adding Lucene)
SimpleDB:
Usecases: This is dataservice from Amazon, the cheapest from the all stuff they provide. Very limited in features so the main use case is using it if you want to use Amazon service, but paying as less ass possible.
Pros: Cheap, all data stored like text - simple to operate, easy to use.
Cons: Very much limitation (document size, collections size, attribute count, attribute size). The way that all data stored like a text also creating additional problems during sorting by date or by number (because it use lexicographical sorting, which need some workaround when saving date or numbers).
Cassandra
Cassandra is perfect solution if writing is your main goal, it's designed to write a lot (in some cases writing could be faster then reading), so it's perfect for logging. Also it very useful for data analysis. Except that Cassandra have built in geographical distribution features.
Strengths Supported by Apache (good community and high quality), fast writing, no single point for failure. Easy to manage when scale (easy to deploy and enlarge cluster).
Weaknesses indexes implementation have problems, querying by index have some limitation, and if you using indexes inserting performance decrease. Problems with stream data transfering.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm gathering information for upcoming massive online game. I has my experience with MEGA MASSIVE farm-like games (millions of dau), and SQL databases was great solution. I also worked with massive online game where NoSQL db was used, and this particular db (Mongo) was not a best fit - bad when lot of connections and lot of concurrent writes going on.
I'm looking for facts, benchmarks, presentation about modern massive online games and technical details about their backend infrastructure, databases in particular.
For example I'm interested in:
Can it manage thousands of connection? May be some external tool can help (like pgbouncer for postgres).
Can it manage tens of thousands of concurrent read-writes?
What about disk space fragmentation? Can it be optimized without stopping database?
What about some smart replication? Can it tell that some data is missing from replica, when master fails? Can i safely propagate slave to master and know exactly what data is missing and act appropriately?
Can it fail gracefully? (like postgres for ex.)
Good reviews from using in production
Start with the premise that hard crashes are exceedingly rare, and when they occur
it won't be a tragedy of some information is lost.
Use of the database shouldn't be strongly coupled to the routine management of the
game. Routine events ought to be managed through more ephemeral storage. Some
secondary process should organize ephemeral events for eventual storage in a database.
At the extreme, you could imagine there being just one database read and one database
write per character per session.
Have you considered NoSQL ?
NoSQL database systems are often highly optimized for retrieval and
appending operations and often offer little functionality beyond
record storage (e.g. key–value stores). The reduced run-time
flexibility compared to full SQL systems is compensated by marked
gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working
with a huge quantity of data when the data's nature does not require a
relational model. The data can be structured, but NoSQL is used when
what really matters is the ability to store and retrieve great
quantities of data, not the relationships between the elements. Usage
examples might be to store millions of key–value pairs in one or a few
associative arrays or to store millions of data records. This
organization is particularly useful for statistical or real-time
analyses of growing lists of elements (such as Twitter posts or the
Internet server logs from a large group of users).
There are higher-level NoSQL solutions, for example CrouchDB, which has built-in replication support.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've worked for clients that had a large number of distinct, small to mid-sized projects, each interacting with each other via properly defined interfaces to share data, but not reading and writing to the same database. Each had their own separate database, their own cache, their own file servers/system that they had dedicated access to, and so they never caused any problems. One of these clients is a mobile content vendor, so they're lucky in a way that they do not have to face the same problems that everyday business applications do. They can create all those separate compartments where their components happily live in isolation of the others.
However, for many business applications, this is not possible. I've worked with a few clients, one of whose applications I am doing the production support for, where there are "bad data issues" on an hourly basis. Yeah, it's that crazy. Some data records from one of the instances (lower than production, of course) would have been run a couple of weeks ago, and caused some other user's data to get corrupted. And then, a data script will have to be written to fix this issue. And I've seen this happening so much with this client that I have to ask.
I've seen this happening at a moderate rate with other clients, but this one just seems to be out of order.
If you're working with business applications that share a large amount of data by reading and writing to/from the same database, are "bad data issues" that common in your environment?
Bad data issues occur all the time. The only reasonably effective defense is a properly designed, normalized database, preferrably interacting with the outside world only through stored procedures.
This is why it is important to put the required data rules at the database level and not the application. (Of course, it seems that many systems don't bother at the application level either.)
It also seems that a lot of people who design data imports, don't bother to clean the data before putting it in their system. Of course it's hard to find all the possible ways to mess up the data, I've done imports for years and I still get surprised sometimes. My favorite was the company where their data entry people obviously didn't care about the field names and the application just went to the next field when the first field was fully. I got names like: "McDonald, Ja" in the last name field and "mes" in the first name field.
I do data imports from many, many clients and vendors. Out of hundreds of different imports I've developed, I can think of only one or two where the data was clean. For some reason the email field seems to be particularly bad and is often used for notes instead of emails. It's really hard to send an email to "His secretary is the hot blonde."
Yes, very common. Getting the customer to understand the extent of the problem is another matter. At one customer I had to resort to writing an application which analyzed their database and beeped every time it enountered a record which didn't match their own published data format. I took the laptop with their DB installed to a meeting and ran the program, then watched all the heads at the table swivel around to stare at their DBA while my machine beeped crazily in the background. There's nothing quite like grinding the customer's nose in his own problems to gain attention.
I don't think you are talking about bad data (but it would only be polite of you to answer the various questions raised in comments) but invalid data. For example, '9A!' stored in a field that is supposed to contains a 3-character ISO ccurrency code is probably invalid data, and should have been caught at data entry time. Bad is data usually taken to be equivalent to corruption caused by disk errors etc. The former are quite common, depending on the quality of the data input applications, while the latter are pretty rare.
I assume that by "bad data issues" you mean "issues of data that does not satisfy all applicable business constraints".
They can only be a consequence of two things : bad database design by the database designer (that is : either unintentional or -even worse- intentional omission of integrity constraints in the database definition), or else the inability of the DBMS to support the more complex types of database constraint, combined with a flawed program written by the programmer to enforce the dbms-unsupported integrity constraint.
Given how poor SQL databases are at integrity constraints, and given the poor level of knowledge of data management among the average "modern programmer", yes such issues are everywhere.
If the data get's corrupted because users shut down their application in the middle of complex database updates then transactions are your friends. This way you don't get entry in Invoice table, but no entries in InvoiceItems table. Unless Commited at the end of the process, all made changes are rolled back,
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
What would be the best DB for Inserting records at a very high rate.
The DB will have only one table and the Application is very simple. Insert a row into the DB and commit it but the insertion rate will be very high.
Targetting about 5000 Row Insert per second.
Any of the very expensive DB's like Oracle\SQLServer are out of option.
Also what are the technologies for taking a DB Backup and will it be possible to create one DB from the older backed up DB's ?
I can't use InMemory capabilities of any DB's as I can't afford a crash of the Application. I need to commit the row as soon as I recieve it.
If your main goal is to insert a lot of data in a little time, perhaps the filesystem is all you need.
Why not write the data in a file, optionally in a DB-friendly format (csv, xml, ...) ? That way you can probably achieve 10 times your performance goal without too much trouble. And most OSs are robust enough nowadays to prevent data loss on application failures.
Edit: As said below, jounaling file systems are pretty much designed so that data is not lost in case of software (or even hardware in case of raid-arrays) failures. ZFS has a good reputation.
Postgres provides WAL (Write Ahead Log) which essentially does inserts into RAM until the buffer is full or the system has time to breath. You combine a large WAL cache with a UPS (for safety) and you have very efficient insert performance.
If you can't do SQLite, I'd take a look at Firebird SQL if I were you.
To get high throughput you will need to batch inserts into a big transaction. I really doubt you could find any db that allows you to round trip 5000 times a second from your client.
Sqlite can handle tons of inserts (25K per second in a tran) provided stuff is not too multithreaded and that stuff is batched.
Also, if structure correctly I see no reason why mysql or postgres would not support 5000 rows per second (provided the rows are not too fat). Both MySQL and Postgres are a lot more forgiving to having a larger amount of transactions.
The performance you want is really not that hard to achieve, even on a "traditional" relational DBMS. If you look at the results for unclustered TPC-C (TPC-C is the de-facto standard benchmark for transaction processing) many systems can provide 10 times your requirements in an unclustered system. If you are going for cheap and solid you might want to check out DB2 Express-C. It is limited to two cores and two gigabytes of memory but that should be more than enough to satisfy your needs.