We have ~1Tb of user profiles and need to perform two types operations on them:
random reads and writes (~20k profile updates per second)
queries on predefined dimensions (e.g. for reporting)
For example, if we encounter user in a transaction, we want to update his profile with a URL he came from. At the end of the day we want to see all users who visited particular URL. We don't need joins, aggregations, etc., only filtering by one or several fields.
We don't really care about latency, but need high throughput.
Most databases we looked at belong to one of two categories - key-value DBs with fast random access or batch DBs optimized for querying and analytics.
Key-value storages
Aerospike can store terabyte-scale data and is very well-optimized for fast key-based lookup. However, queries on secondary index are deadly slow, which makes it unsuitable for our purposes.
MongoDB is pretty flexible, but requires too much hardware to handle our load. In addition, we encountered particular issues with massive exports from it.
HBase looks attractive since we already have Hadoop cluster. Yet, it's not really clear how to create secondary index for it and what its performance will be.
Cassandra - may be an option, but we don't have experience with it (if you do, please share it)
Couchbase - may be an option, but we don't have experience with it (if you do, please share it)
Analytic storages
Relational DBMS (e.g. Oracle, PostreSQL) provide both - random access and efficient queries, but we have doubts that they can handle terabyte data.
HDFS / Hive / SparkSQL - excellent for batch processing, but doesn't support indexing. The closest thing is partitioning, but it's not applicable given many-to-many relations (e.g. many users visited many URLs). Also, to our knowledge none of HDFS-backed tools except for HBase support updates, so you can only append new data and read latest version, which is not very convenient.
Vertica has very efficient queries, but updates boil down to rewriting the whole file, so are terribly slow.
(Because of limited experience some of information above may be subjective or wrong, please feel free to comment about it)
Do any of the mentioned databases have useful options that we missed?
Is there any other database(s) optimized for your use case? If not, how would you address this task?
This is a follow up to a previous question of mine after definitely deciding on partition switching as the best way to quickly get data into a heavily indexed fact type table that needs to remain available to readers.
While it seems to be the best way, it is not quite good enough to really satisfy the requirement to allow several (< 5) users to bulk insert at the same time, have the new data indexed and to appear in the indexed views (not necessarily real indexed views, just selects that rely on indices).
The idea of partitioning was that each partition and the index subtree rooted at the partition could, in parallel, be locked as read-only, copied into a working table, new data inserted/updated and the indexes rebuilt then switched back into the main table so readers aren't affected.
The problem is the single working table. Each parallel bulk insert needs its own copy, with the same constraints as the main table to allow switching.
So far I've hit several walls trying to get around this bottleneck:
I tried partitioning the working table using the same partition
function. This doesn't work because you can't disable the indexes on
a partition basis to insert into one while rebuilding the index on
another.
Creating a temporary table as the working table. This
doesn't work because, while you can use the same index names, you
can't easily dynamically create the constraints and can't switch
that in anyway.
Have a fixed set of named working tables? How can I select one and work with it under an alias so I have just one stored proc?
Dynamic SQL? I've tried very hard to avoid going that route. It's complicated as it is.
Big challenge but has anyone got any ideas before I accept the bottleneck? Would Sql 2012 help? How do proper data warehouses cope with this?
How do proper data warehouses cope with this? Compromise and set realistic goals for the EDW. The data warehouse can't be everything to everyone. Make sure that what you're implementing is the best solution for the business (not just the techies/analysts). Are your goals realistic if you cannot find solutions from experienced peers and experts?
Associate a cost with all of the hoops you jump through. Does the data really need to be up to the minute? What if I told you that we needed to spend another $200,000 on storage because we're constantly duplicating partitions and rebuilding indexes and the current solution can't keep up with the IOPS demand? At some point, they're going to figure out that it's not free. While you don't need to just say no, you do need to be realistic and up-front about the cost associated. Additionally, your storage admin will thank you.
As for 2012, there is a new columnstore index which can reduce or replace all of the current nonclustereds you're using to cover all you're analysts search requests. It's highly compressed, covers a very wide variety of search arguments, and utilizes the new Batch execution mode. It performs best on low selectivity queries like the ones frequently performed on fact tables. The one catch is that you can't directly do updates. You'll have to switch the partition out to a staging table, drop the columnstore on the staging table, update the staging table, add the columnstore back, then switch the partition back into the fact table. It sounds like alot, but could be significantly faster and require less IO than maintaining all of those nonclustereds.
My question has always been "Is it really a fact table if it is constantly changing?". This is not OLTP is it? Try offsetting transactions or at least push all updates to a scheduled off-peak time. Updating fact tables is becoming a thing of the past. All of the big boys are moving toward the "Update frowned upon" column oriented architecture for data warehousing. PowerPivot and the Analysis Services Tabular Model are built on the columnstore technology.
Finally, Review Kimballs' DW Toolkit books. He has several that lay out best practices and cover edge-case scenarios. What I learned from them was that Data Warehouse Development is not just Database Development on steroids. It also involves politics and focusing resources on what's best for the business.
I'm currently working on a home-automation project which provides the user with the possibility to view their energy usage over a period of time. Currently we request data every 15 minutes and we are expecting around 2000 users for our first big pilot.
My boss is requesting we that we store at least half a year of data. A quick sum leads to estimates of around 35 million records. Though these records are small (around 500bytes each) I'm still wondering whether storing these in our database (Postgres) is a correct decision.
Does anyone have some good reference material and/or advise about how to deal with this amount of information?
For now, 35M records of 0.5K each means 37.5G of data. This fits in a database for your pilot, but you should also think of the next step after the pilot. Your boss will not be happy when the pilot will be a big success and that you will tell him that you cannot add 100.000 users to the system in the next months without redesigning everything. Moreover, what about a new feature for VIP users to request data at each minutes...
This is a complex issue and the choice you make will restrict the evolution of your software.
For the pilot, keep it as simple as possible to get the product out as cheap as possible --> ok for a database. But tell you boss that you cannot open the service like that and that you will have to change things before getting 10.000 new users per week.
One thing for the next release: have many data repositories: one for your user data that is updated frequently, one for you queries/statistics system, ...
You could look at RRD for your next release.
Also keep in mind the update frequency: 2000 users updating data each 15 minutes means 2.2 updates per seconds --> ok; 100.000 users updating data each 5 minutes means 333.3 updates per seconds. I am not sure a simple database can keep up with that, and a single web service server definitely cannot.
We frequently hit tables that look like this. Obviously structure your indexes based on usage (do you read or write a lot, etc), and from the start think about table partitioning based on some high level grouping of the data.
Also, you can implement an archiving idea to keep the live table thin. Historical records are either never touched, or reported on, both of which are no good to live tables in my opinion.
It's worth noting that we have tables around 100m records and we don't perceive there to be a performance problem. A lot of these performance improvements can be made with little pain afterwards, so you could always start with a common-sense solution and tune only when performance is proven to be poor.
With appropriate indexes to avoid slow queries, I wouldn't expect any decent RDBMS to struggle with that kind of dataset. Lots of people are using PostgreSQL to handle far more data than that.
It's what databases are made for :)
First of all, I would suggest that you make a performance test - write a program that generates test entries that corresponds to the number of entries you'll see over half a year, insert them and check results to see if query times are satisfactory. If not, try indexing as suggested by other answers. It is, btw, also worth trying write performance to ensure that you can actually insert the amount of data you're generating in 15 minutes in.. 15 minutes or less.
Making a test will avoid the mother of all problems - assumptions :-)
Also think about production performance - your pilot will have 2000 users - will your production environment have 4000 users or 200000 users in a year or two?
If we're talking a really big environment, you need to think about a solution that allows you to scale out by adding more nodes instead of relying on always being able to add more CPU, disk and memory to a single machine. You can either do this in your application by keeping track on which out of multiple database machines is hosting details for a specific user, or you can use one of the Postgresql clustering methods, or you could go a completely different path - the NoSQL approach, where you walk away completely from RDBMS and use systems which are built to scale horizontally.
There are a number of such systems. I only have personal experience of Cassandra. You have to think completely different compared to what you're used to from the RDBMS world which is something of a challenge - think more about how you want
to access the data rather than how to store it. For your example, I think storing the data with the user-id as key and then add a column with the column name being the timestamp and the column value being your data for that timestamp would make sense. You can then ask for slices of those columns for example for graphing results in a Web UI - Cassandra has good enough response times for UI applications.
The upside of investing time in learning and using a nosql system is that when you need more space - you just add a new node. Same thing if you need more write performance, or more read performance.
Are you not better off not keeping individual samples for the full period? You could possibly implement some sort of consolidation mechanism, which concatenates weekly/monthly samples into one record. And run said consolidation on a schedule.
You decision has to depend on the type of queries you need to be able to run on the database.
There are lots of techniques to handle this problem. you will only get performance if you touch minimum number of records. in your case you can use following techniques.
Try to keep old data in separate table here your can use table partitioning or can use a different kind of approach where you can store your old data in file system and can serve them directly from your application without connecting to database, this way your database will be free. I am doing this for one of my project and it already has more than 50GB of data but it is running very smoothly.
Try to index table columns but be careful as it will affect your insertion speed.
Try batch processing for your insertion or select queries. you can handle this issue very smartly here.
Example: suppose you are getting request to insert record in any table after every 1 second then you make a mechanism where you process this request in batch of 5 record in this way you will hit your database after 5 second which is much better. Yes, you can make users to wait for 5 second to wait for their record inserted like in Gmail where you send email and it ask you to wait/processing. for select you can put your resultset periodically in file system and can serve them directly to user without touching database like most stock market data company do.
You can also use some ORM like Hibernate. They will use some caching techniques to boost speed of your data.
For any further query you can mail me on ranjeet1985#gmail.com
I have an interesting database problem. I have a DB that is 150GB in size. My memory buffer is 8GB.
Most of my data is rarely being retrieved, or mainly being retrieved by backend processes. I would very much prefer to keep them around because some features require them.
Some of it (namely some tables, and some identifiable parts of certain tables) are used very often in a user facing manner
How can I make sure that the latter is always being kept in memory? (there is more than enough space for these)
More info:
We are on Ruby on rails. The database is MYSQL, our tables are stored using INNODB. We are sharding the data across 2 partitions. Because we are sharding it, we store most of our data using JSON blobs, while indexing only the primary keys
Update 2
The tricky thing is that the data is actually being used for both backend processes as well as user facing features. But they are accessed far less often for the latter
Update 3
Some people are commenting than 8Gb is toy these days. I agree, but just increasing the size of the db is pure LAZINESS if there is a smarter, efficient solution
This is why we have Data Warehouses. Separate the two things into either (a) separate databases or (b) separate schema within one database.
Data that is current, for immediate access, being updated.
Data that is historical fact, for analysis, not being updated.
150Gb is not very big and a single database can handle your little bit of live data and your big bit of history.
Use a "periodic" ETL process to get things out of active database, denormalize into a star schema and load into the historical data warehouse.
If the number of columns used in the customer facing tables are small you can make indexes with all the columns being used in the queries. This doesn't mean that all the data stays in memory but it can make the queries much faster. Its trading space for response time.
This calls for memcached! I'd recommend using cache-money, a great ActiveRecord write-through caching library. The ngmoco branch has support for enabling caching per-model, so you could only cache those things you knew you wanted to keep in memory.
You could also do the caching by hand using $cache.set/get/expire calls in controller actions or model hooks.
With MySQL, proper use of the Query Cache will keep frequently queried data in memory. You can provide a hint to MySQL not to cache certain queries (e.g. from the backend processes) with the SQL_NO_CACHE keyword.
If the backend processes are accessing historical data, or accessing data for reporting purposes, certainly follow S. Lott's suggestion to create a separate data warehouse and query that instead. If a data warehouse is too much to accomplish in the short term, you can replicate your transactional database to a different server and perform queries there (a Data Warehouse gives you MUCH more flexibility and capability, so go down that path if possible)
UPDATE:
See documentation of SELECT and scroll down to SQL_NO_CACHE.
Read about the Query Cache
Ensure query_cache_type set appropriate for your needs.
UPDATE 2:
I confirmed with MySQL support that there is no mechanism to selectively cache certain tables etc. in the innodb buffer pool.
So, what is the problem?
First, 150gb is not very large today. It was 10 years ago.
Second any non-total-crap database system will utilize your memory as cache. If the cache is big enough (compared to the amount of data that is in use) it will be efficient. If not, the only thing you CAN do is get more memory (because, sorry, 8gb of memory is VERY low for a modern server - it was low 2 years ago).
You should not have to do anything for the memory to be efficiently used. At least not on a commercial level database - maybe mysql sucks, but I would not assume this.
I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).
The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.
I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).
I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.
If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?
Storing ~3.5TB of data and inserting about 1K/sec 24x7, and also querying at a rate not specified, it is possible with SQL Server, but there are more questions:
what availability requirement you have for this? 99.999% uptime, or is 95% enough?
what reliability requirement you have? Does missing an insert cost you $1M?
what recoverability requirement you have? If you loose one day of data, does it matter?
what consistency requirement you have? Does a write need to be guaranteed to be visible on the next read?
If you need all these requirements I highlighted, the load you propose is going to cost millions in hardware and licensing on an relational system, any system, no matter what gimmicks you try (sharding, partitioning etc). A nosql system would, by their very definition, not meet all these requirements.
So obviously you have already relaxed some of these requirements. There is a nice visual guide comparing the nosql offerings based on the 'pick 2 out of 3' paradigm at Visual Guide to NoSQL Systems:
After OP comment update
With SQL Server this would e straight forward implementation:
one single table clustered (GUID, time) key. Yes, is going to get fragmented, but is fragmentation affect read-aheads and read-aheads are needed only for significant range scans. Since you only query for specific GUID and date range, fragmentation won't matter much. Yes, is a wide key, so non-leaf pages will have poor key density. Yes, it will lead to poor fill factor. And yes, page splits may occur. Despite these problems, given the requirements, is still the best clustered key choice.
partition the table by time so you can implement efficient deletion of the expired records, via an automatic sliding window. Augment this with an online index partition rebuild of the last month to eliminate the poor fill factor and fragmentation introduced by the GUID clustering.
enable page compression. Since the clustered key groups by GUID first, all records of a GUID will be next to each other, giving page compression a good chance to deploy dictionary compression.
you'll need a fast IO path for log file. You're interested in high throughput, not on low latency for a log to keep up with 1K inserts/sec, so stripping is a must.
Partitioning and page compression each require an Enterprise Edition SQL Server, they will not work on Standard Edition and both are quite important to meet the requirements.
As a side note, if the records come from a front-end Web servers farm, I would put Express on each web server and instead of INSERT on the back end, I would SEND the info to the back end, using a local connection/transaction on the Express co-located with the web server. This gives a much much better availability story to the solution.
So this is how I would do it in SQL Server. The good news is that the problems you'll face are well understood and solutions are known. that doesn't necessarily mean this is a better than what you could achieve with Cassandra, BigTable or Dynamo. I'll let someone more knowleageable in things no-sql-ish to argument their case.
Note that I never mentioned the programming model, .Net support and such. I honestly think they're irrelevant in large deployments. They make huge difference in the development process, but once deployed it doesn't matter how fast the development was, if the ORM overhead kills performance :)
Contrary to popular belief, NoSQL is not about performance, or even scalability. It's mainly about minimizing the so-called Object-Relational impedance mismatch, but is also about horizontal scalability vs. the more typical vertical scalability of an RDBMS.
For the simple requirement of fasts inserts and fast lookups, almost any database product will do. If you want to add relational data, or joins, or have any complex transactional logic or constraints you need to enforce, then you want a relational database. No NoSQL product can compare.
If you need schemaless data, you'd want to go with a document-oriented database such as MongoDB or CouchDB. The loose schema is the main draw of these; I personally like MongoDB and use it in a few custom reporting systems. I find it very useful when the data requirements are constantly changing.
The other main NoSQL option is distributed Key-Value Stores such as BigTable or Cassandra. These are especially useful if you want to scale your database across many machines running commodity hardware. They work fine on servers too, obviously, but don't take advantage of high-end hardware as well as SQL Server or Oracle or other database designed for vertical scaling, and obviously, they aren't relational and are no good for enforcing normalization or constraints. Also, as you've noticed, .NET support tends to be spotty at best.
All relational database products support partitioning of a limited sort. They are not as flexible as BigTable or other DKVS systems, they don't partition easily across hundreds of servers, but it really doesn't sound like that's what you're looking for. They are quite good at handling record counts in the billions, as long as you index and normalize the data properly, run the database on powerful hardware (especially SSDs if you can afford them), and partition across 2 or 3 or 5 physical disks if necessary.
If you meet the above criteria, if you're working in a corporate environment and have money to spend on decent hardware and database optimization, I'd stick with SQL Server for now. If you're pinching pennies and need to run this on low-end Amazon EC2 cloud computing hardware, you'd probably want to opt for Cassandra or Voldemort instead (assuming you can get either to work with .NET).
Very few people work at the multi-billion row set size, and most times that I see a request like this on stack overflow, the data is no where near the size it is being reported as.
36 billion, 3 billion per month, thats roughly 100 million per day, 4.16 million an hour, ~70k rows per minute, 1.1k rows a second coming into the system, in a sustained manner for 12 months, assuming no down time.
Those figures are not impossible by a long margin, i've done larger systems, but you want to double check that is really the quantities you mean - very few apps really have this quantity.
In terms of storing / retrieving and quite a critical aspect you have not mentioned is aging the older data - deletion is not free.
The normal technology is look at is partitioning, however, the lookup / retrieval being GUID based would result in a poor performance, assuming you have to get every matching value across the whole 12 month period. You could place a clustered indexes on the GUID column will get your associated data clusterd for read / write, but at those quantities and insertion speed, the fragmentation will be far too high to support, and it will fall on the floor.
I would also suggest that you are going to need a very decent hardware budget if this is a serious application with OLTP type response speeds, that is by some approximate guesses, assuming very few overheads indexing wise, about 2.7TB of data.
In the SQL Server camp, the only thing that you might want to look at is the new parrallel data warehouse edition (madison) which is designed more for sharding out data and running parallel queries against it to provide high speed against large datamarts.
"I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).
The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net."
I can tell you from experience that this is possible in SQL Server, because I have done it in early 2009 ... and it's still operation to this day and quite fast.
The table was partitioned in 256 partitions, keep in mind this was 2005 SQL version ... and we did exactly what you're saying, and that is to store bits of info by GUID and retrieve by GUID quickly.
When i left we had around 2-3 billion records, and data retrieval was still quite good (1-2 seconds if get through UI, or less if on RDBMS) even though the data retention policy was just about to be instantiated.
So, long story short, I took the 8th char (i.e. somewhere in the middle-ish) from the GUID string and SHA1 hashed it and cast as tiny int (0-255) and stored in appropriate partition and used same function call when getting the data back.
ping me if you need more info...
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the slower it becomes to import unsorted records into it. At some
point, it becomes too slow to be practical.
If you want to export your table to the smallest possible file, make it native format. This works best with tables containing
mostly numeric columns because they’re more compactly represented
in binary fields than character data. If all your data is
alphanumeric, you won’t gain much by exporting it in native format.
Not allowing nulls in the numeric fields can further compact the
data. If you allow a field to be nullable, the field’s binary
representation will contain a 1-byte prefix indicating how many
bytes of data will follow.
You can’t use BCP for more than 2,147,483,647 records because the BCP counter variable is a 4-byte integer. I wasn’t able to find any
reference to this on MSDN or the Internet. If your table consists of
more than 2,147,483,647 records, you’ll have to export it in chunks
or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk space. In my test, my log exploded to 10 times the original
table size before completion.
When importing a large number of records using the BULK INSERT statement, include the BATCHSIZE parameter and specify how many
records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which
requires a lot of log space.
The fastest way of getting data into a table with a clustered index is to presort the data first. You can then import it using the BULK
INSERT statement with the ORDER parameter.
There is an unusual fact that seems to overlooked.
"Basically after inserting 30Mil rows in a day, I need to fetch all the rows with the same GUID (maybe 20 rows) and be reasonably sure I'd get them all back"
Needing only 20 columns, a non-clustered index on the GUID will work just fine. You could cluster on another column for data dispersion across partitions.
I have a question regarding the data insertion: How is it being inserted?
Is this a bulk insert on a certain schedule (per min, per hour, etc)?
What source is this data being pulled from (flat files, OLTP, etc)?
I think these need to be answered to help understand one side of the equation.
Amazon Redshift is a great service. It was not available when the question was originally posted in 2010, but it is now a major player in 2017. It is a column based database, forked from Postgres, so standard SQL and Postgres connector libraries will work with it.
It is best used for reporting purposes, especially aggregation. The data from a single table is stored on different servers in Amazon's cloud, distributed by on the defined table distkeys, so you rely on distributed CPU power.
So SELECTs and especially aggregated SELECTs are lightning fast. Loading large data should be preferably done with the COPY command from Amazon S3 csv files. The drawbacks are that DELETEs and UPDATEs are slower than usual, but that is why Redshift in not primarily a transnational database, but more of a data warehouse platform.
You can try using Cassandra or HBase, though you would need to read up on how to design the column families as per your use case.
Cassandra provides its own query language but you need to use Java APIs of HBase to access the data directly.
If you need to use Hbase then I recommend querying the data with Apache Drill from Map-R which is an Open Source project. Drill's query language is SQL-Compliant(keywords in drill have the same meaning they would have in SQL).
With that many records per year you're eventually going to run out of space.
Why not filesystem storage like xfs which supports 2^64 files and using smaller boxes.
Regardless of how fancy people want to get or the amount of money one would end up spend getting a system with whatever database SQL NoSQL ..whichever these many records are usually made by electric companies and weather stations/providers like ministry of environment who control smaller stations throughout the country.
If you're doing something like storing pressure.. temperature..wind speed.. humidity etc...and guid is the location..you can still divide the data by year/month/day/hour.
Assuming you store 4 years of data per hard-drive.
You can then have it run on a smaller Nas with mirror where it would
also provide better read speeds and have multiple mount points..based on the year when it was created.
You can simply make a web-interface for searches
So dumping location1/2001/06/01//temperature and location1/2002/06/01//temperature would only dump the contents of hourly temperature for the 1st day of summer in those 2 years (24h*2) 48 small files vs searching a database with billions of records and possibly millions spent.
Simple way of looking at things.. 1.5 billion websites in the world with God knows how many pages each
If a company like Google had to spend millions per 3 billion searches to pay for super-computers for this they'd be broke.
Instead they have the power-bill...couple million crap computers.
And caffeine indexing...future-proof..keep adding more.
And yeah where indexing running off SQL makes sense then great
Building super-computers for crappy tasks with fixed things like weather...statistics and so on so techs can brag their systems crunches xtb in x seconds...waste of money that can be spent somewhere else..maybe that power-bill that won't run into the millions anytime soon by running something like 10 Nas servers.
Store records in plain binary files, one file per GUID, wouldn't get any faster than that.
You can use MongoDB and use the guid as the sharding key, this means that you can distribute your data over multiple machines but the data you want to select is only on one machine because you select by the sharding key.
Sharding in MongoDb is not yet production ready.