I want to calculate the size my JavaDB database will need and therefore I need to know how char is encoded to know how many bytes it will use on disk.
Does anyone know which character encoding Derby uses? I have read that Derby uses Unicode, but couldn't find any information about the encoding (neither in the reference manual, nor any other pages).
I believe it's UTF8.
However, whenever I'm trying to figure out how much disk space my database will use, I always use a benchmarking approach: write a little test program that generates a sample database, using sample test data, taking care to try to make that data as representative of your application as you can. Generate enough data to be realistic (e.g., at least 10,000 rows).
Generate some nice round number of rows, like 10,000 or 100,000, or 1,000,000. Then look at the actual database files that Derby created on disk, see how large they are, do a little bit of match, and you can figure out your result.
The nice thing about an approach like this is that it will help you catch mistakes like: forgetting an entire secondary index, or not understanding that your UNIQUE constraint added a bunch of overhead behind the scenes, or forgetting to count some of the columns that you thought wouldn't matter, but turned out to take a substantial amount of space.
Related
I'm building a component for recording historical data. Initially I expect it to do about 30 writes/second, and less than 1 read/second.
The data will never be modified, only new data will be added. Reads are likely to be done with fresh records.
The demand is likely to increase rapidly, expecting around 80 writes/second in one year time.
I could choose to distribute my component and use a common database such as MySql, or I could go with a distributed database such as MongoDb. Either way, I'd like the database to handle writes very well.
The database must be free. Open source would be a plus :-)
Note: A record is plain text in variable size, typically 50 to 500 words.
Your question can be solved a few different ways, so let's break it down and look at the individual requirements you've laid out:
Writes - It sounds like the bulk of what you're doing is append only writes at a relatively low volume (80 writes/second). Just about any product on the market with a reasonable storage backend is going to be able to handle this. You're looking at 50-500 "words" of data being saved. I'm not sure what constitutes a word, but for the sake of argument let's assume that a word is an average of 8 characters, so your data is going to be some kind of metadata, a key/timestamp/whatever plus 400-4000 bytes of words. Barring implementation specific details of different RDBMSes, this is still pretty normal, we're probably writing at most (including record overhead) 4100 bytes per record. This maxes out at 328,000 bytes per second or, as I like to put it, not a lot of writing.
Deletes - You also need the ability to delete your data. There's not a lot I can say about that. Deletes are deletes.
Reading - Here's where things get tricky. You mention that it's mostly primary keys and reads are being done on fresh data. I'm not sure what either of these mean, but I don't think that it matters. If you're doing key only lookups (e.g. I want record 8675309), then life is good and you can use just about anything.
Joins - If you need the ability to write actual joins where the database handles them, you've written yourself out of the major non-relational database products.
Data size/Data life - This is where things get fun. You've estimated your writes at 80/second and I guess at 4100 bytes per record or 328,000 bytes per second. There are 86400 seconds in a day, which gives us 28,339,200,000 bytes. Terrifying! That's 3,351,269.53125 KB, 27,026 MB, or roughly 26 GB / day. Even if you're keeping your data for 1 year, that's 9633 GB, or 10TB of data. You can lease 1 TB of data from a cloud hosting provider for around $250 per month or buy it from a SAN vendor like EqualLogic for about $15,000.
Conclusion: I can only think of a few databases that couldn't handle this load. 10TB is getting a bit tricky and requires a bit of administration skill, and you might need to look at certain data lifecycle management techniques, but almost any RDBMS should be up to this task. Likewise, almost any non-relational/NoSQL database should be up to this task. In fact, almost any database of any sort should be up to the task.
If you (or your team members) already have skills in a particular product, just stick with that. If there's a specific product that excels in your problem domain, use that.
This isn't the type of problem that requires any kind of distributed magical unicorn powder.
Ok for MySQL I would advice you to use InnoDB without any indexes, expect on primary keys, even then, if you can skip them it would be good, in order to make input flow uninterrupted.
Indexes optimize reading, but descrease the writing capabilities.
You also may use PostgreSQL. Where you also need to skip indexes as well but you wont have a engine selection and its capabilities are also very strong for writing.
This approach you want is actually used in some solutions, but with two db servers, or at least two databases. The first is receiving a lot of new data (your case), while the second communicates with the first and store it in a well-structured database (with indexes, rules, etc). And then when you need to read or make a snapshot of the data you refer the second server (or second database), where you can use transactions and so on.
You should take a look and refer at Oracle Express (I think this was its name) and SQL Server Express Edition. The last two have better performance, but also some limitations. To have more detailed picture.
I'm generating data at a rate of 4096 bytes every 16.66ms. This data needs to be stored constantly and will be read randomly. It would be nice to have it in a relational database, but I think doing so many inserts would create too much overhead for the processor I'm working with (ARM11). And I don't need all the features that something like SQLite offers.
In fact, just writing this stuff to a file seems tempting because while most of the time I'll just be writing lots of data, when I actually do need to read data, I can just seek to the block I need. However, I just know I'm going to run into some problem along the way. Especially when I leave this thing running for a day and end up with gigabytes of data.
This just seems like a very naive solution to my problem and I need someone else to tell me so I can start thinking about a better solution. Thanks.
You should add some more details to get better answers. What are you use cases, do you need ACID, what is your storage you are writing to, etc.
What is your OS, do you only write fixed size records. Just saying like I will do random access and this is my write rate is something is much too unspecific.
You are writing at 240 kb/s, which are 20 GB/day.
If you have just fixed size records, only append data and use Linux, then a plain file is great. Perhaps think about using some fsync calls, if your storage is fast enough.
I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data, adding calculated values and writing the result as another .tsv.
The plain text is used for its robustness, longevity and self-documenting character. Storing the data in another format is not an option, it has to stay open and easy to process. There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
Since I am mostly doing selects and joins, I realized I basically need a database engine with .tsv based backing store. I do not care about transactions, since my data is all write-once-read-many. I need to process the data in-place, without a major conversion step and data cloning.
As there is a lot of data to be queried this way, I need to process it efficiently, utilizing caching and a grid of computers.
Does anyone know of a system that would provide database-like capabilities, while using plain tab-separated files as backend? It seems to me like a very generic problem, that virtually all scientists get to deal with in one way or the other.
There is a lot of data (tens of TBs), and it is not affordable to load a copy into a relational database (we would have to buy twice as much storage space).
You know your requirements better than any of us, but I would suggest you think again about this. If you have 16-bit integers (0-65535) stored in a csv file, your .tsv storage efficiency is about 33%: it takes 5 bytes to store most 16-bit integers plus a delimiter = 6 bytes, whereas the native integers take 2 bytes. For floating-point data the efficiency is even worse.
I would consider taking the existing data, and instead of storing raw, processing it in the following two ways:
Store it compressed in a well-known compression format (e.g. gzip or bzip2) onto your permanent archiving media (backup servers, tape drives, whatever), so that you retain the advantages of the .tsv format.
Process it into a database which has good storage efficiency. If the files have a fixed and rigorous format (e.g. column X is always a string, column Y is always a 16-bit integer), then you're probably in good shape. Otherwise, a NoSQL database might be better (see Stefan's answer).
This would create an auditable (but perhaps slowly accessible) archive with low risk of data loss, and a quickly-accessible database that doesn't need to be concerned with losing the source data, since you can always re-read it into the database from the archive.
You should be able to reduce your storage space and should not need twice as much storage space, as you state.
Indexing is going to be the hard part; you'd better have a good idea of what subset of the data you need to be able to query efficiently.
One of these nosql dbs might work. I highly doubt any are configurable to sit on top of flat, delimited files. You might look at one of the open source projects and write your own database layer.
Scalability begins at a point beyond tab-separated ASCII.
Just be practical - don't academicise it - convention frees your fingers as well as your mind.
I would upvote Jason's recommendation if I had the reputation. My only add is that if you do not store it in a different format like the database Jason was suggesting you pay the parsing cost on every operation instead of just once when you initially process it.
You can do this with LINQ to Objects if you are in a .NET environment. Streaming/deferred execution, functional programming model and all of the SQL operators. The joins will work in a streaming model, but one table gets pulled in so you have to have a large table joined to a smaller table situation.
The ease of shaping the data and the ability to write your own expressions would really shine in a scientific application.
LINQ against a delimited text file is a common demonstration of LINQ. You need to provide the ability to feed LINQ a tabular model. Google LINQ for text files for some examples (e.g., see http://www.codeproject.com/KB/linq/Linq2CSV.aspx, http://www.thereforesystems.com/tutorial-reading-a-text-file-using-linq/, etc.).
Expect a learning curve, but it's a good solution for your problem. One of the best treatments on the subject is Jon Skeet's C# in depth. Pick up the "MEAP" version from Manning for early access of his latest edition.
I've done work like this before with large mailing lists that need to be cleansed, dedupped and appended. You are invariably IO bound. Try Solid State Drives, particularly Intel's "E" series which has very fast write performance, and RAID them as parallel as possible. We also used grids, but had to adjust the algorithms to do multi-pass approaches that would reduce the data.
Note I would agree with the other answers that stress loading into a database and indexing if the data is very regular. In that case, you're basically doing ETL which is a well understood problem in the warehouseing community. If the data is ad-hoc however, you have scientists that just drop their results in a directory, you have a need for "agile/just in time" transformations, and if most transformations are single pass select ... where ... join, then you're approaching it the right way.
You can do this with VelocityDB. It is is very fast at reading tab seperated data into C# objects and databases. The entire Wikipedia text is a 33GB xml file. This file takes 18 minutes to read in and persist as objects (1 per Wikipedia topic) and store in compact databases. Many samples are shown for how to read in tab seperated text files as part of the download.
The question's already been answered, and I agree with the bulk of the statements.
At our centre, we have a standard talk we give, "so you have 40TB of data", as scientists are newly finding themselves in this situation all the time now. The talk is nominally about visualization, but primarly about managing large amounts of data for those that are new to it. The basic points we try to get across:
Plan your I/O
Binary files
As much as possible, large files
File formats that can be read in parallel, subregions extracted
Avoid zillions of files
Especially avoid zillions of files in single directory
Data Management must scale:
Include metadata for provenance
Reduce need to re-do
Sensible data management
Hierarchy of data directories only if that will always work
Data bases, formats that allow metadata
Use scalable, automatable tools:
For large data sets, parallel tools - ParaView, VisIt, etc
Scriptable tools - gnuplot, python, R, ParaView/Visit...
Scripts provide reproducability!
We have a fair amount of stuff on large-scale I/O generally, as this is an increasingly common stumbling block for scientists.
Why is it that every RDBMS insists that you tell it what the max length of a text field is going to be... why can't it just infer this information form the data that's put into the database?
I've mostly worked with MS SQL Server, but every other database I know also demands that you set these arbitrary limits on your data schema. The reality is that this is not particulay helpful or friendly to work with becuase the business requirements change all the time and almost every day some end-user is trying to put a lot of text into that column.
Does any one with some inner working knowledge of a RDBMS know why we just don't infer the limits from the data that's put into the storage? I'm not talking about guessing the type information, but guessing the limits of a particular text column.
I mean, there's a reason why I don't use nvarchar(max) on every text column in the database.
Because computers (and databases) are stupid. Computers don't guess very well and, unless you tell them, they can't tell that a column is going to be used for a phone number or a copy of War and Peace. Obviously, the DB could be designed to so that every column could contain an infinite amount of data -- or at least as much as disk space allows -- but that would be a very inefficient design. In order to get efficiency, then, we make a trade-off and make the designer tell the database how much we expect to put in the column. Presumably, there could be a default so that if you don't specify one, it simply uses it. Unfortunately, any default would probably be inappropriate for the vast majority of people from an efficiency perspective.
This post not only answers your question about whether to use nvarchar(max) everywhere, but it also gives some insight into why databases historically didn't allow this.
It has to do with speed. If the max size of a string is specified you can optimize the way information is stored for faster i/o on it. When speed is key the last thing you want is a sudden shuffling of all your data just because you changed a state abbreviation to the full name.
With the max size set the database can allocate the max space to every entity in that column and regardless of the changes to the value no address space needs to change.
This is like saying, why can't we just tell the database we want a table and let it infer what type and how many columns we need from the data we give it.
Simply, we know better than the database will. Supposed you have a one in a million chance of putting a 2,000 character string into the database, most of the time, it's 100 characters. The database would probably blow up or refuse the 2k character string. It simply cannot know that you're going to need 2k length if for the first three years you've only entered 100 length strings.
Also, the length of the characters are used to optimize row placement so that rows can be read/skipped faster.
I think it is because the RDBMS use random data access. To do random data access, they must know which address in the hard disk they must jump into to fastly read the data. If every row of a single column have different data length, they can not infer what is the start point of the address they have to jump directly to get it. The only way is they have to load all data and check it.
If RDBMS change the data length of a column to a fixed number (for example, max length of all rows) everytime you add, update and delete. It is an extremely time consuming
What would the DB base its guess on? If the business requirements change regularly, it's going to be just as surprised as you. If there's a reason you don't use nvarchar(max), there's probably a reason it doesn't default to that as well...
check this tread http://www.sqlservercentral.com/Forums/Topic295948-146-1.aspx
For the sake of an example, I'm going to step into some quicksand and suggest you compare it with applications allocating memory (RAM). Why don't programmers ask for/allocate all the memory they need when the program starts up? Because often they don't know how much they'll need. This can lead to apps grabbing more and more memory as they run, and perhaps also releasing memory. And you have multiple apps running at the same time, and new apps starting, and old apps closing. And apps always want contiguous blocks of memory, they work poorly (if at all) if their memory is scattered all over the address space. Over time, this leads to fragmented memory, and all those garbage collection issues that people have been tearing their hair out over for decades.
Jump back to databases. Do you want that to happen to your hard drives? (Remember, hard drive performance is very, very slow when compared with memory operations...)
Sounds like your business rule is: Enter as much information as you want in any text box so you don't get mad at the DBA.
You don't allow users to enter 5000 character addresses since they won't fit on the envelope.
That's why Twitter has a text limit and saves everyone the trouble of reading through a bunch of mindless drivel that just goes on and on and never gets to the point, but only manages to infuriate the reader making them wonder why you have such disreguard for their time by choosing a self-centered and inhumane lifestyle focused on promoting the act of copying and pasting as much data as the memory buffer gods will allow...
What I want to create is a huge index over an even bigger collection of data. The data is a huge collection of images (and I mean millions of photos!) and I want to build an index on all unique images.
So I calculate a hash value of every image and append this with the width, height and file size of the image. This would generate a very unique key for every image. This would be combined with the location of the image, or locations in case of duplicates.
Technically speaking, this would fit perfectly in a single database table. An unique index on file name, plus an additional non-unique index on hash-width-height-size would be enough. However, I could use an existing database system to solve this, or just write my own, optimized version. It will be a single-user application anyway and the main purpose is to detect when I add a duplicate image to the collection so it will warn me that I already have it in my collection and display the locations where the other copies are. I can then decide to still add the duplicate or to discard it.
I've written hash-table implementations before and it's not that difficult once you know what you have to be aware of. So I could just implement my own file format for this data. It's unlikely that I'll ever need to add more information to these images and I'm not interested in similar images, just exact images. I'm not storing the original images in this file either, just the hash, size and location.
From experience, I know this could run extremely fast. I've done it before and have been doing similar things for nearly three decades so it's likely that I will chose this solution.
But I do wonder... Doing the same with an existing database system like SQL Server, Oracle, Interbase or MySQL, would performance still be high enough? There would be about 750 TB of images indexed in this database, which roughly translates to around 30 million records in a single, small table. Is it even worth considering the use of a regular database?
I have doubts about the usability of a database for this project. The amount of data is huge, yet the structure is real simple. I don't need multi-user support or most other features that most databases provide. So I don't see a need for a database. But I'm interested in the opinions of other programmers about this. (Although I expect most will agree with me here.)
The project itself, which is still just an idea in my head, is supposed to be some tool or add-on for explorer or whatever. Basically, it builds an index for any external hard disk that I attach to the system and when I copy an image to this disk somewhere, it's supposed to tell me if the image already exists at this disk. It will allow me to avoid filling up my backup disks with duplicates, although I sometimes would like to add duplicates. (E.g. because they're part of a series.) Since I like to create my own rendered artwork I have plenty of images. Plus, I've been taking digital pictures with digital cameras since 1996 so I also have a huge collection of photos. Add some other large collections to this and you'll soon realise that the amount of data will be huge. (And yes, there are already plenty of duplicates in my collection...)
Since it's a single-user application that you are considering, I'd probably have a look at SQLite. It ought to fit your other requirements rather nicely, I'd say.
I just tested the performance of PostgreSQL on my laptop (Core 2 Duo T5800 2.0 GHz 3.0 GiB RAM). I have a table with slightly more than 100M records, 5 columns and some indexes. I performed a range query on one indexed column (not the primary key) and returned all columns. A mean query returned 75 rows and executed in 750ms. You have to decide if this is fast enough.
I would avoid DIY-ing it unless you know all the repocussions of what you're doing.
Transactional Consistency for example, is not trivial.
I would suggest designing your code in such a way the backend can be easily replaced later, and then run with something sane ( SQLite is a good starting choice ), develop it the most sane and rational way possible, and then try slotting in the alternative backing store.
Then profile the differences, and run regression tests against it to make sure your database is not worse than SQLite.
Exisiting database solutions tend to win because they've had years of improvement and fine tuning to get their benefits, an a naïve attempt will likely be slower, buggier, and do less, all the while Increasing your development load to purely MONUMENTAL proportions.
http://fetter.org/optimization.html
The first rule of Optimization is, you do not talk about Optimization.
The second rule of Optimization is, you DO NOT talk about Optimization.
If your app is running faster than the underlying transport protocol, the optimization is over.
One factor at a time.
No marketroids, no marketroid schedules.
Testing will go on as long as it has to.
If this is your first night at Optimization Club, you have to write a test case.
Also, with databases, there is one thing you utterly MUST get ingrained.
Speed is unimportant
Your data being there when you need it, that is important.
When you have the assuredness that your data will always be there, then you may worry about trivial concerns like speed.
Hashes
You also lament that you'll be using image SHA's/MD5's etc to deduplicate images. This is a fallacious notion of its own, Hashes of files are only able to tell if the files are different, not if they're the same.
The logic is akin to asking 30 people to flip a coin, and you see the first one get heads, and thus decide to delete every other person who gets a head, because they're obviously the same person.
https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer
Although you may think it unlikely you'd have 2 different files with the same hash, your odds are about as good as winning the lotto. The chances of you winning the lotto are low, but somebody wins the lotto every day. Don't let it be you.