Database that consumes less disk space - database

I'm looking at solutions to store a massive quantity of information consuming the less possible disk space.
The information structure is very simple and the queries will also be very simple.
I've looked at solutions like Apache Cassandra and relations databases but couldn't find a comparison where disk usage is mentioned.
Any ideas on this would be great.

Speaking about Apache Cassandra - it's just a disk space hog. 200 MB of logs resulted in 1.2 GB files produced by Cassandra - and the keyspace was just 4 columns with 200 length strings.

Take a look at Oracle Berkeley DB - very simple robust database (key/value):
"Berkeley DB enables the development of custom data management solutions, without the overhead traditionally associated with such custom projects. Berkeley DB provides a collection of well-proven building-block technologies that can be configured to address any application need from the handheld device to the datacenter, from a local storage solution to a world-wide distributed one, from kilobytes to petabytes."

Redis might worth a check if you can store your data in key-value

Newest version of Microsoft's SQL Server (2008) supports several levels of compression (row compression and page compression, in addition to backup compression). Might be worth investigating.
Some relevant resources:
Linchi Shea shows that compression can sometimes improve performance
Official MS Best Pracices doc for SQL 2008 compression

Related

Choosing the right database engine for relational data with billions of records

My Python application data structure is pure relational.
My estimation for the biggest table is around 10 billion rows each year (all the other tables are very small).
Each row size is about 20-30 bytes
What is the right database engine for me?
You might consider the following that I have used, but of course this will depend on what your data looks like and how your APP/Users need to interact with it. This is not an exhaustive list, it's only the stuff I have used.
Greenplum database is a open source distributed Postgres database. http://greenplum.org/
It scales nicely and supports pretty much all Postgres stuff except for full text indexing last I knew
Apache Phoenix: An open source sql layer on top of Hadoop/HBase. It scales nicely, but the ecosystem is a bit complex (as Per Hadoop). Cloudera's Impala is similar. https://phoenix.apache.org/
Oracle Partitioning (preferably on RAC). If you can afford the license, Oracle partitioning allows for sharding of your data in various ways. If you have it with RAC, that will also provide parallel query execution
Just partition your data (on any RDBMS) and put the partitions on good disk
Those are the 4 ideas I have actually used, and remember, on good hardware, with some table partitioning, 10B rows isn't really all that much, so you might just need to get a better box[s] and hook it to a SAN with SSD of some kind over 10G network or better. ALso think about putting indexes on a separate disk from where the db files are, and always use SSD if you can afford it.
Anyway, HTH
MG
At 30 bytes per row that's less than 300GB, which is a small database, well within the capabilities of Oracle or SQL Server Enterprise editions. You won't need Oracle RAC.
You'll need to pay attention to application design and indexing/partitioning. Query and storage optimization will have a greater impact on performance than the choice of DBMS will.

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

What is the best database technology for storing OHLC historical prices?

Just for End-Of-Day data there will be billions of rows. What is the best way to store all that data. Is SQL Server 2008 good enough for that or should I look towards NoSQL solution, like MongoDB. Any suggestions?
That would be cool to have one master db with read/write permissions and one ore more replications of it for read only operations. Only master database will be used for adding new prices into the storage. Also that would be cool to be able replicate OHLC prices for most popular securities individually in order to optimize read access.
This data then will be streamed to a trading platform on clients' machines.
You should consider Oracle Berkeley DB which is in production doing this within the infrastructure of a few well known stock exchanges. Berkeley DB will allow you to record information at a master as simple key/value pairs, in your case I'd imagine a timestamp for the key and an encoded OHLC set for the value. Berkeley DB supports single master multi-replica replication (called "HA" for High Availability) to support exactly what you've outlined - read scalability. Berkeley DB HA will automatically fail-over to a new master if/when necessary. Using some simple compression and other basic features of Berkeley DB you'll be able to meet your scalability and data volume targets (billions of rows, tens of thousands of transactions per second - depending on your hardware, OS, and configuration of BDB - see the 3n+1 benchmark with BDB for help) without issue.
When you start working on accessing that OHLC data consider Berkeley DB's support for bulk-get and make sure you use the B-Tree access method (because your data has order and locality will provide much faster access). Also consider the Berkeley DB partitioning API to split your data (perhaps based on symbol or even based on time). Finally, because you'll be replicating the data you can relax the durability constraints to DB_TXN_WRITE_NOSYNC as long as your replication acknowledgement policy is requires a quorum of replicas ACK a write before considering it durable. You'll find that a fast network beats a fast disk in this case. Also, to offload some work from your master, enable peer-to-peer log replica distribution.
But, first read the replication manager getting started guide and review the rep quote example - which already implements some of what you're trying to do (handy, eh?).
Just for the record, full disclosure I work as a product manager at Oracle on Berkeley DB products. I have for the past nine years, so I'm a tad biased. I'd guess that the other solutions - SQL based or not - might eventually give you a working system, but I'm confident that Berkeley DB can without too much effort.
If you're really talking billions of new rows a day (Federal Express' data warehouse isn't that large), then you need an SQL database that can partition across multiple computers, like Oracle or IBM's DB2.
Another alternative would be a heavy-duty system managed storage like IBM's DFSMS.

What database to use for big data storage and manipulation?

I have to make a decision of which database server to use for my next project, but the simple decision to use MySQL like almost all the projects I did is harder now, because I expect very much records.
The database will store a user list, some other irrelevant tables, and the last one, some user-collected data. Let's say, if I have 6000 users responding to a quiz about each other. Simple math shows that from those users, if each one completes the quiz about everyone (and in my project that is 99% sure that will happen) I'll end up with 35.99million records(they will exclude themselves and in this particular situation the operation is 6000*5999). Unfortunately 6000 maybe is a small number, the real one growing day by day.
What to choose? MySQL and maybe if things go well and the project grows to expand it in a cluster? PostgreSQL, MSSQL? Oracle?
I've read about all of them, each one has it's pros and cons, but still don't know what to choose. The advantage of MySQL and PostgreSQL is of course, the starting price of $0 which is pretty nice in a usual self-funded startup.
Any opinions, pieces of advice? If you encountered this situation in your experience as developers, I'd love to hear from you.
These days, free isn't something that differenciates between databases any more. Both Oracle and SQL Server have free versions, but the limitations is resources - 4 GB database, RAM & single CPU utilization. Millions of records is not a concern - it's what datatypes you're using.
I saw the OPs comment about not liking MS software - that's your prerogative, but using the free versions of either Oracle or SQL Server do benefit from seamless transition to upscale versions of the respective database.
Personally, my choice would be either Oracle or SQL Server because of IMHO, real feature considerations like hierarchical query support, subquery factoring/CTE, packages (long before I get concerned with functions/procedures), full text searching, xml support, etc.
MySQL will handle 35 million records no problem. Worry about scalability when you get there. You can easily add raid hard disks backing your database tables, and if you really start getting big you can get a compellant SAN that will scream... Don't worry about the DB engine as much as the underlying hardware.. MySQL rocks for us with millions of records.
I've had no problems handling tables as large as 36,000,000 rows on MySQL and Oracle.
Just be sure that you index the proper columns, run EXPLAINs for your queries, and maintain proper design principles.
Most of the truly large scale web properties use a distributed key-value store. That said, 35 million is large, but not that large. With most modern databases, your main two scaling worries should be throughput and what happens when no single box can contain your entire database anymore. And both of these problems can be solved to some degree for any database you choose to use. (Caching, replication, sharding, etc.)
Use MySQL until you can't anymore. At that point, you ought to be rolling in dough anyways and you now have a very desirable problem.
Use MySQL as it's free and you have experience with it.
Besides in my opinion it matters more on how you design the tables than which database you use.
35 million records can be easily handled by MS SQL Server (assuming proper database design, indices, etc.). You can start with the free SQL Server Express edition and later, if you need, you can upgrade to the full version which supports clustering, etc.
SQL Server Express does have some limitations - single CPU, 1 GB memory, max 4 GB database size and a few other things. I'm not sure how quickly these limitations will become a problem but you can always move to the full version when you run into them.
MySQL(i) & Postgre
0$ of costs
large community
many tutorials
well documentated
MSSQL
You can get "money" from MS if you promote that you are using MSSQL (secret information from some companies I worked for)
MS tools work very well
Complete tool set from C# IDE over .NET lib to Windows Server 2003
Oracle
Professional and commercial provider
Used by many large companies (I also heard about Blizzard (World of Warcraft) using Oracle)
- expensive
The final decision depends on the very special requirements of your project.
Make yourself a quick list of things , that ARE IMPORTANT for your project (e.g. quick performed queries) and look up which Database pros are matching the most to your requirements.
Everything is about design. SQL Database are some kind of cars, you just have to know which component has to be placed here and which there.
Make a clear design and you won't struggle with any of them.
May be you can test Firebird
Blog post about big Firebird database here
MySQL licence is here (not allways free).
Postgresql and Firebird are free.
First of all, don't think about performance. Premature optimization being the root of all evil and all that. You can always throw more hardware and/or tuning at it later.
All of the mentioned should perform nicely if tuned/maintained correctly. I'd focus on manageability and familiarity. IMHO open source databases excels on manageability (perhaps not the best GUIs, but the CLI has been my home for a long long time).
And if the database becomes the bottleneck, why limit yourself to those choices? How about a key-value distributed database? Or perhaps serialize data directly to disk? Storing data outside of a RDBMS, while often frowned upon, might be the correct path. Or simply use the common route of denormalization.
Always remember not to optimize prematurely.
As far as opinions go (since you specifically asked for it) I favor open source databases, specifically PostgreSQL. It's rock solid, fast and very well-featured. And even with (relatively) large datasets it has performed superbly on mediocre hardware (some tuning involved, of course, but you can't skip that step no matter which db you end up choosing).

Speed of online backups with BLOBs

In Oracle 8 doing an online backup with BLOBs in the database is extremely slow. By slow, I mean over an hour to backup a database with 100MB of BLOB data. Oracle acknowledged it was slow, but wouldn't fix the problem (so much for paying for support.) Does anyone know if Oracle has fixed this problem with subsequent releases? Also, how fast do online backups work with BLOBs work in SQL Server and MySQL?
I've had this issue in the past, and the only decent workarounds we found were to make sure that the LOBs were in their own tablespace, and use a different backup strategy with them, or to switch to using the BFILE type. Whether or not you can get by with BFILE will depend on how you're using the LOBs.
Some usage info on BFILE:
http://download-uk.oracle.com/docs/cd/B10501_01/java.920/a96654/oralob.htm#1059942
Note that BFILEs live on the filesystem outside of Oracle, so you'd need to back them up in a process outside of your normal Oracle backup. On one project we just had a scheduled rsync to offsite backup. Also important to note is that you cannot create/update BFILEs via JDBC, but you can read them.
To answer your question about the speed of online backups of BLOBs in SQL Server, it's the same speed as backing up regular data for SQL 2000/2005/2008 - it's typically limited by the speed of your storage. I usually get over 100mb/sec on my database backups with BLOBs.
Be wary of using backup compression tools with those, though - if the BLOB is binary-style data that's heavily random, then you'll waste CPU cycles trying to compress the data, and compression can make the backup slower instead of faster.
I use SQL Backup from Redgate for SQL Server -- it is ridiculously fast, even with my BLOB data.
I keep a copy of every file that I do EDI with, so while they aren't huge, they are numerous and BLOBs. I'm well over 100Megs of just these text files.
It's important to note that Redgate's SQL Backup is just a front-end to the standard SQL Backup...it gives you additional management features, basically, but still utilizes the SQL Server backup engine.
Depending on the size of the BLOBs, make sure you're storing them in-line / out of line appropriately.
See http://www.dba-oracle.com/t_table_blob_lob_storage.htm
Can you put the export file you're creating and the Oracle tablespaces on different disks? You I/O throughput may be the constraining factor...?
exp on 8i was slow, but not as much as you describe. I have backed-up gigabytes of blobs in minutes in 10g..(to disk - using expdp)

Resources