Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
We're building a measurement system that will eventually consist of thousands of measurement stations. Each station will save around 500 million measurements consisting of 30 scalar values over its lifetime. These will be float values. We're now wondering how to save this data on each station, considering we'll be building a web app on each station such that
we want to visualize the data on multiple timescales (eg measurements of one week, month, year)
we need to build moving averages over the data (eg average over a month to show in a year graph)
the database needs to be crash resistant (power outages)
we are only doing writes and reads, no updates or deletes on the data
additionally we'd like one more server that can show the data of, say, 1000 measurement stations. That would be ~50TB of data in 500 billion measurements. To transmit the data from measurement station to server, I thought that some type of database-level replication would be a clean and efficient way.
Now I'm wondering if a noSQL solution might be better than mySQL for these purposes. Especially couchDB, Cassandra and maybe key-value stores like Redis look appealing to me. Which of those would suit the "measurement time series" data model best in your opinion? What about other advantages like crash-safety and replication from measurement station to main server?
I think CouchDB is a great database -- but it's ability to deal with large data is questionable. CouchDB's primary focus is on simplicity of development and offline replication, not necessarily on performance or scalability. CouchDB itself does not support partitioning, so you'll be limited by the maximum node size unless you use BigCouch or invent your own partitioning scheme.
No foolin, Redis is an in-memory database. It's extremely fast and efficient at getting data in and out of RAM. It does have the ability to use disk for storage, but it's not terribly good at it. It's great for bounded quantities of data that change frequently. Redis does have replication, but does not have any built-in support for partitioning, so again, you'll be on your own here.
You also mentioned Cassandra, which I think is more on target for your use case. Cassandra is well suited for databases that grow indefinitely, essentially it's original use case. The partitioning and availability is baked in so you won't have to worry about it very much. The data model is also a bit more flexible than the average key/value store, adding a second dimension of columns, and can practically accomodate millions of columns per row. This allows time-series data to be "bucketed" into rows that cover time ranges, for example. The distribution of data across the cluster (partitioning) is done at the row level, so only one node is necessary to perform operations within a row.
Hadoop plugs right into Cassandra, with "native drivers" for MapReduce, Pig, and Hive, so it could potentially be used to aggregate the collected data and materialize the running averages. The best practice is to shape data around queries, so probably want to store multiple copies of the data in "denormalized" form, one for each type of query.
Check out this post on doing time-series in Cassandra:
http://rubyscale.com/2011/basic-time-series-with-cassandra/
For highly structured data of this nature (time series of float vectors) I tend to shy away from databases all together. Most of the features of a database aren't very interesting; you basically aren't interested in things like atomicity or transactional semantics. The only feature that is desirable is resilience to crashing. That feature, however, is trivially easy to implement when you don't ever need to undo a write (no updates/deletes), just by appending to a file. crash recovery is simple; open a new file with an incremented serial number in the filename.
A logical format for this is plain-old csv. after each measurement is taken, call flush() on the underlying file. Getting the data replicated back to the central server is a job efficiently solved by rsync(1). You can then import the data in the analysis tool of your choice.
I would persionally shy away from "csv" and "plaintext" files. These are convenient when you have low volume and want to skip the tools to quickly look at the data or make small alterations to the data.
When you're talking about "50Tb" of data, that's quite a lot. If a simple trick will reduce that by a factor of two, that will pay itself back in storage costs and bandwidth charges.
If the measurements are taken on a regular basis that would mean that instead of saving the timestamp with every measurement, you store the start time and interval and just store the measurments.
I'd go for a file format that has a small header and then just a bunch of floating point measurements. To prevent files getting really really large, decide on a maximum file size. If you initiallize the file by fully writing it before starting to use the file, it will be completely allocated on the disk by the time you start to use it. Now you can mmap the file and alter the data. If power goes down when you are changing the data, it simply either makes it to disk or it doesn't.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm currently receiving 2000 prices per second from a stock exchange and need to save those in an appropriate database. My current choice is PostgresQL which is way too slow. I need to save those prices (ticks) in an aggregated form like OHLC. So if I want to save D1 data for instance, I need to first get the previous D1 record for the stock from the database, check if the high or low price has changed and set a new close price and then save it to the database again. This is taking forever and is not possible with Postgres. I don't want to save the OHLC data, I prefer querying (aggregating) those in real-time.
So my requirements are:
persistance
fast writes (currently 2k per second, up to 10k)
queries, e.g. aggregating OHLC data in real-time (50-100 per second)
adoptable to any modern programming language without writing raw queries (SDK for Python or JS for that database)
deployable on AWS or GCP without hassle
I was thinking about Apache Cassandra. I'm not familiar with Cassandra, are powerful queries like OHLC one possible? Are there any alternatives to Cassandra?
Thanks in advance!
Given what I've understood from your question, I believe Cassandra should easily fit your use-case.
Regarding your requirements:
persistence : Cassandra will not only persist your data but also cover redundancy with minimal configuration;
fast writes : this is what Cassandra is most optimized for and while the exact throughput depends on a lot of factors, in general Cassandra will manage writes measured in the thousands/sec/core; Also, the eventual number o writes is not really relevant as Cassandra can scale linearly with no real penalty so 5k,10k, 100k or more are all doable;
adaptability : Cassandra has official drivers for the most common languages(Python, C family, NodeJs, Java, Ruby, PHP, Scala) as well as community developed ones for more languages (list of divers);
deployable : It's very easy to deploy in the cloud. You can chose to deploy it manually on independent instances or maybe use a managed Cassandra cluster (AWS has one, it's called 'AWS Keyspaces', Datastax(the company driving most of the development behind Cassandra) has one called 'Astra' and there are even more possible solutions. Given that Cassandra is one of the major players when it comes to big-data storage finding a place for you DB in the cloud should be easy.
I have only mentioned 4 of the 5 requirements. That is because when talking about reading, things get more complex and a larger discussion is needed.
500-100 reads/s given the 2k+ writes per second seem to be in line with the general idea of Cassandra being optimized for write intensive tasks. In Cassandra the way you will model your tables will dictate how well things can work. For a task like you have described my first thoughts are:
You bucket each stock per day => you get a partition with around 30k rows (1 update/s for 8 trading hours) and a size of under 0.2MB (30k * 4B). This would be well within the recommended values and clearly under the worst case scenario ones;
when you need the aggregated data you have 2 options:
2a. You read the partition as is and aggregate it application side (what I would recommend);
2b. You implement an "User-Defined Aggregate" function on your database that will do the work (docs). This should be doable although I won't guarantee it. Apart from being harder to implement, the problem is that putting this kind of extra workload on the DB might not be want you want given your apparent use-case. Let me explain: I'd expect your reading load to be most active during certain times, (before, during and after trading hours) with times when the load is lighter. Depending on your architecture, you could have multiple application instances up during peak times, and then scale them back during off-peak in order to lower costs. While applications can be easily scaled up and down on cloud providers like AWS and GC. Cassanadra cannot be scaled up and down like this (5 nodes in the morning, 3 in the night and so on)(well it could but it's not designed to and would be a terrible decision). So moving as much of the non-constant workload to the application seems the best idea;
(Optional) have a worker that at the end of the day/trading day will aggregate the values for each stock and save them to another table so that when looking at historic data it will be easier. This data could even be bucketed by week, month or even year depending on how much space the aggregated data takes.
You could also add Spark and Kafka in front of Casandra for a more powerful approach to the real-time aggregation but we should't deviate that much from the question at hand.
Cassandra is very powerful with the right modeling and the right architecture. At first glance what you need seems to be a good fit for Cassandra however as powerful as it can be, as bad as it can get if you use it in ways it wasn't designed to. I hope this answer puts you on a path into making the right decision.
Cheers.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to develop a plan to move data from SQL server DB to any of the bigdata databases? Some of the questions that I have thought of are :
How big is the data?
What is the expected growth rate for this data?
What kind of queries will be run frequently? eg: look-up, range-scan, full-scan etc
How frequently the data moved from source to destination?
Can anyone help add to this questionnaire?
Firstly, How big is the data doesn't matter! This point barely can be used to decide on which NoSQL DB to use as most NoSQL DBs are made for easy scalability & storage. So all that matters is the query you fire rather than how much data is there. (Unless of course you intend to use it for storage & access of very small amounts of data because they would be a little expensive in many of the NoSQL DBs) Your first question must be Why consider NoSQL? Can't RDBMS handle it?
Expected growth-rate is a considerable parameter but then again not so valid, since most of the NOSQL DBs support storage of large amounts of data (without any scalability issues).
The most important one in your list is What kind of queries will be run?
This matters most since the RDBMS stores data as tuples and its easier to select tuples & output them with smaller amounts of data. Its faster at executing * queries(as its row-wise storage). But coming to NoSQL, most DBs are columnar or Column-oriented DBMS.
Row-oriented system : As data is inserted into the table, it is assigned an internal ID, the rowid that is used internally in the system to refer to data. In this case the records have sequential rowids independent of the user-assigned empid.
Column-oriented systems : A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.
Comparisons between row-oriented and column-oriented databases are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other bottlenecks in computers.
How frequently the data will be moved/accessed? is again a good question as accesses are costly and few of the NoSQL DBs are very slow the first time a query is shot(Eg: Hive).
Other parameters you may consider are :
Are update of rows(data in the table) required? (Hive has problems with updation, you usually have to delete and insert again)
Why are you using the database? (Search, derive relationships or analytics, etc) What type of operations would you want to perform on the data?
Will it require relationship searches? Like in case of Facebook Db(Presto)
Will it require aggregations?
Will it be used to relate various columns to derive insights?(like analytics to be done)
Last but a very important one, Do you want to store that data on HDFS(Hadoop distributed File System) as files or your DB's specific storage format or anything else? This is important since your processing depends on how your data is stored, whether it can be accessed directly or needs a query call which may be time consuming , etc.
couple more pointers
Type of no-sql DB that suits your requirement. i.e. key-value, document, column family and graph databases
CAP theorem to decide which is more critical amongst Consistency, Availability and Partition tolerance
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Following the Prometheus webpage one main difference between Prometheus and InfluxDB is the usecase: while Prometheus stores time series only InfluxDB is better geared towards storing individual events. Since there was some major work done on the storage engine of InfluxDB I wonder if this is still true.
I want to setup a time series database and apart from the push/push model (and probably a difference in performance) I can see no big thing which separates both projects. Can someone explain the difference in usecases?
InfluxDB CEO and developer here. The next version of InfluxDB (0.9.5) will have our new storage engine. With that engine we'll be able to efficiently store either single event data or regularly sampled series. i.e. Irregular and regular time series.
InfluxDB supports int64, float64, bool, and string data types using different compression schemes for each one. Prometheus only supports float64.
For compression, the 0.9.5 version will have compression competitive with Prometheus. For some cases we'll see better results since we vary the compression on timestamps based on what we see. Best case scenario is a regular series sampled at exact intervals. In those by default we can compress 1k points timestamps as an 8 byte starting time, a delta (zig-zag encoded) and a count (also zig-zag encoded).
Depending on the shape of the data we've seen < 2.5 bytes per point on average after compactions.
YMMV based on your timestamps, the data type, and the shape of the data. Random floats with nanosecond scale timestamps with large variable deltas would be the worst, for instance.
The variable precision in timestamps is another feature that InfluxDB has. It can represent second, millisecond, microsecond, or nanosecond scale times. Prometheus is fixed at milliseconds.
Another difference is that writes to InfluxDB are durable after a success response is sent to the client. Prometheus buffers writes in memory and by default flushes them every 5 minutes, which opens a window of potential data loss.
Our hope is that once 0.9.5 of InfluxDB is released, it will be a good choice for Prometheus users to use as long term metrics storage (in conjunction with Prometheus). I'm pretty sure that support is already in Prometheus, but until the 0.9.5 release drops it might be a bit rocky. Obviously we'll have to work together and do a bunch of testing, but that's what I'm hoping for.
For single server metrics ingest, I would expect Prometheus to have better performance (although we've done no testing here and have no numbers) because of their more constrained data model and because they don't append writes to disk before writing out the index.
The query language between the two are very different. I'm not sure what they support that we don't yet or visa versa so you'd need to dig into the docs on both to see if there's something one can do that you need. Longer term our goal is to have InfluxDB's query functionality be a superset of Graphite, RRD, Prometheus and other time series solutions. I say superset because we want to cover those in addition to more analytic functions later on. It'll obviously take us time to get there.
Finally, a longer term goal for InfluxDB is to support high availability and horizontal scalability through clustering. The current clustering implementation isn't feature complete yet and is only in alpha. However, we're working on it and it's a core design goal for the project. Our clustering design is that data is eventually consistent.
To my knowledge, Prometheus' approach is to use double writes for HA (so there's no eventual consistency guarantee) and to use federation for horizontal scalability. I'm not sure how querying across federated servers would work.
Within an InfluxDB cluster, you can query across the server boundaries without copying all the data over the network. That's because each query is decomposed into a sort of MapReduce job that gets run on the fly.
There's probably more, but that's what I can think of at the moment.
We've got the marketing message from the two companies in the other answers. Now let's ignore it and get back to the sad real world of time-data series.
Some History
InfluxDB and prometheus were made to replace old tools from the past era (RRDtool, graphite).
InfluxDB is a time series database. Prometheus is a sort-of metrics collection and alerting tool, with a storage engine written just for that. (I'm actually not sure you could [or should] reuse the storage engine for something else)
Limitations
Sadly, writing a database is a very complex undertaking. The only way both these tools manage to ship something is by dropping all the hard features relating to high-availability and clustering.
To put it bluntly, it's a single application running only a single node.
Prometheus has no goal to support clustering and replication whatsoever. The official way to support failover is to "run 2 nodes and send data to both of them". Ouch. (Note that it's seriously the ONLY existing way possible, it's written countless times in the official documentation).
InfluxDB has been talking about clustering for years... until it was officially abandoned in March. Clustering ain't on the table anymore for InfluxDB. Just forget it. When it will be done (supposing it ever is) it will only be available in the Enterprise Edition.
https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/
Within the next few years, we will hopefully have a well-engineered time-series database that is handling all the hard problems relating to databases: replication, failover, data safety, scalability, backup...
At the moment, there is no silver bullet.
What to do
Evaluate the volume of data to be expected.
100 metrics * 100 sources * 1 second => 10000 datapoints per second => 864 Mega-datapoints per day.
The nice thing about times series databases is that they use a compact format, they compress well, they aggregate datapoints, and they clean old data. (Plus they come with features relevant to time data series.)
Supposing that a datapoint is treated as 4 bytes, that's only a few Gigabytes per day. Lucky for us, there are systems with 10 cores and 10 TB drives readily available. That could probably run on a single node.
The alternative is to use a classic NoSQL database (Cassandra, ElasticSearch or Riak) then engineer the missing bits in the application. These databases may not be optimized for that kind of storage (or are they? modern databases are so complex and optimized, can't know for sure unless benchmarked).
You should evaluate the capacity required by your application. Write a proof of concept with these various databases and measures things.
See if it falls within the limitations of InfluxDB. If so, it's probably the best bet. If not, you'll have to make your own solution on top of something else.
InfluxDB simply cannot hold production load (metrics) from 1000 servers. It has some real problems with data ingestion and ends up stalled/hanged and unusable. We tried to use it for a while but once data amount reached some critical level it could not be used anymore. No memory or cpu upgrades helped.
Therefore our experience is definitely avoid it, it's not mature product and has serious architectural design problems. And I am not even talking about sudden shift to commercial by Influx.
Next we researched Prometheus and while it required to rewrite queries it now ingests 4 times more metrics without any problems whatsoever compared to what we tried to feed to Influx. And all that load is handled by single Prometheus server, it's fast, reliable, and dependable. This is our experience running huge international internet shop under pretty heavy load.
IIRC current Prometheus implementation is designed around all the data fitting on a single server. If you have gigantic quantities of data, it may not all fit in Prometheus.
I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).
The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.
I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).
I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.
If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?
Storing ~3.5TB of data and inserting about 1K/sec 24x7, and also querying at a rate not specified, it is possible with SQL Server, but there are more questions:
what availability requirement you have for this? 99.999% uptime, or is 95% enough?
what reliability requirement you have? Does missing an insert cost you $1M?
what recoverability requirement you have? If you loose one day of data, does it matter?
what consistency requirement you have? Does a write need to be guaranteed to be visible on the next read?
If you need all these requirements I highlighted, the load you propose is going to cost millions in hardware and licensing on an relational system, any system, no matter what gimmicks you try (sharding, partitioning etc). A nosql system would, by their very definition, not meet all these requirements.
So obviously you have already relaxed some of these requirements. There is a nice visual guide comparing the nosql offerings based on the 'pick 2 out of 3' paradigm at Visual Guide to NoSQL Systems:
After OP comment update
With SQL Server this would e straight forward implementation:
one single table clustered (GUID, time) key. Yes, is going to get fragmented, but is fragmentation affect read-aheads and read-aheads are needed only for significant range scans. Since you only query for specific GUID and date range, fragmentation won't matter much. Yes, is a wide key, so non-leaf pages will have poor key density. Yes, it will lead to poor fill factor. And yes, page splits may occur. Despite these problems, given the requirements, is still the best clustered key choice.
partition the table by time so you can implement efficient deletion of the expired records, via an automatic sliding window. Augment this with an online index partition rebuild of the last month to eliminate the poor fill factor and fragmentation introduced by the GUID clustering.
enable page compression. Since the clustered key groups by GUID first, all records of a GUID will be next to each other, giving page compression a good chance to deploy dictionary compression.
you'll need a fast IO path for log file. You're interested in high throughput, not on low latency for a log to keep up with 1K inserts/sec, so stripping is a must.
Partitioning and page compression each require an Enterprise Edition SQL Server, they will not work on Standard Edition and both are quite important to meet the requirements.
As a side note, if the records come from a front-end Web servers farm, I would put Express on each web server and instead of INSERT on the back end, I would SEND the info to the back end, using a local connection/transaction on the Express co-located with the web server. This gives a much much better availability story to the solution.
So this is how I would do it in SQL Server. The good news is that the problems you'll face are well understood and solutions are known. that doesn't necessarily mean this is a better than what you could achieve with Cassandra, BigTable or Dynamo. I'll let someone more knowleageable in things no-sql-ish to argument their case.
Note that I never mentioned the programming model, .Net support and such. I honestly think they're irrelevant in large deployments. They make huge difference in the development process, but once deployed it doesn't matter how fast the development was, if the ORM overhead kills performance :)
Contrary to popular belief, NoSQL is not about performance, or even scalability. It's mainly about minimizing the so-called Object-Relational impedance mismatch, but is also about horizontal scalability vs. the more typical vertical scalability of an RDBMS.
For the simple requirement of fasts inserts and fast lookups, almost any database product will do. If you want to add relational data, or joins, or have any complex transactional logic or constraints you need to enforce, then you want a relational database. No NoSQL product can compare.
If you need schemaless data, you'd want to go with a document-oriented database such as MongoDB or CouchDB. The loose schema is the main draw of these; I personally like MongoDB and use it in a few custom reporting systems. I find it very useful when the data requirements are constantly changing.
The other main NoSQL option is distributed Key-Value Stores such as BigTable or Cassandra. These are especially useful if you want to scale your database across many machines running commodity hardware. They work fine on servers too, obviously, but don't take advantage of high-end hardware as well as SQL Server or Oracle or other database designed for vertical scaling, and obviously, they aren't relational and are no good for enforcing normalization or constraints. Also, as you've noticed, .NET support tends to be spotty at best.
All relational database products support partitioning of a limited sort. They are not as flexible as BigTable or other DKVS systems, they don't partition easily across hundreds of servers, but it really doesn't sound like that's what you're looking for. They are quite good at handling record counts in the billions, as long as you index and normalize the data properly, run the database on powerful hardware (especially SSDs if you can afford them), and partition across 2 or 3 or 5 physical disks if necessary.
If you meet the above criteria, if you're working in a corporate environment and have money to spend on decent hardware and database optimization, I'd stick with SQL Server for now. If you're pinching pennies and need to run this on low-end Amazon EC2 cloud computing hardware, you'd probably want to opt for Cassandra or Voldemort instead (assuming you can get either to work with .NET).
Very few people work at the multi-billion row set size, and most times that I see a request like this on stack overflow, the data is no where near the size it is being reported as.
36 billion, 3 billion per month, thats roughly 100 million per day, 4.16 million an hour, ~70k rows per minute, 1.1k rows a second coming into the system, in a sustained manner for 12 months, assuming no down time.
Those figures are not impossible by a long margin, i've done larger systems, but you want to double check that is really the quantities you mean - very few apps really have this quantity.
In terms of storing / retrieving and quite a critical aspect you have not mentioned is aging the older data - deletion is not free.
The normal technology is look at is partitioning, however, the lookup / retrieval being GUID based would result in a poor performance, assuming you have to get every matching value across the whole 12 month period. You could place a clustered indexes on the GUID column will get your associated data clusterd for read / write, but at those quantities and insertion speed, the fragmentation will be far too high to support, and it will fall on the floor.
I would also suggest that you are going to need a very decent hardware budget if this is a serious application with OLTP type response speeds, that is by some approximate guesses, assuming very few overheads indexing wise, about 2.7TB of data.
In the SQL Server camp, the only thing that you might want to look at is the new parrallel data warehouse edition (madison) which is designed more for sharding out data and running parallel queries against it to provide high speed against large datamarts.
"I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).
The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net."
I can tell you from experience that this is possible in SQL Server, because I have done it in early 2009 ... and it's still operation to this day and quite fast.
The table was partitioned in 256 partitions, keep in mind this was 2005 SQL version ... and we did exactly what you're saying, and that is to store bits of info by GUID and retrieve by GUID quickly.
When i left we had around 2-3 billion records, and data retrieval was still quite good (1-2 seconds if get through UI, or less if on RDBMS) even though the data retention policy was just about to be instantiated.
So, long story short, I took the 8th char (i.e. somewhere in the middle-ish) from the GUID string and SHA1 hashed it and cast as tiny int (0-255) and stored in appropriate partition and used same function call when getting the data back.
ping me if you need more info...
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the slower it becomes to import unsorted records into it. At some
point, it becomes too slow to be practical.
If you want to export your table to the smallest possible file, make it native format. This works best with tables containing
mostly numeric columns because they’re more compactly represented
in binary fields than character data. If all your data is
alphanumeric, you won’t gain much by exporting it in native format.
Not allowing nulls in the numeric fields can further compact the
data. If you allow a field to be nullable, the field’s binary
representation will contain a 1-byte prefix indicating how many
bytes of data will follow.
You can’t use BCP for more than 2,147,483,647 records because the BCP counter variable is a 4-byte integer. I wasn’t able to find any
reference to this on MSDN or the Internet. If your table consists of
more than 2,147,483,647 records, you’ll have to export it in chunks
or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk space. In my test, my log exploded to 10 times the original
table size before completion.
When importing a large number of records using the BULK INSERT statement, include the BATCHSIZE parameter and specify how many
records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which
requires a lot of log space.
The fastest way of getting data into a table with a clustered index is to presort the data first. You can then import it using the BULK
INSERT statement with the ORDER parameter.
There is an unusual fact that seems to overlooked.
"Basically after inserting 30Mil rows in a day, I need to fetch all the rows with the same GUID (maybe 20 rows) and be reasonably sure I'd get them all back"
Needing only 20 columns, a non-clustered index on the GUID will work just fine. You could cluster on another column for data dispersion across partitions.
I have a question regarding the data insertion: How is it being inserted?
Is this a bulk insert on a certain schedule (per min, per hour, etc)?
What source is this data being pulled from (flat files, OLTP, etc)?
I think these need to be answered to help understand one side of the equation.
Amazon Redshift is a great service. It was not available when the question was originally posted in 2010, but it is now a major player in 2017. It is a column based database, forked from Postgres, so standard SQL and Postgres connector libraries will work with it.
It is best used for reporting purposes, especially aggregation. The data from a single table is stored on different servers in Amazon's cloud, distributed by on the defined table distkeys, so you rely on distributed CPU power.
So SELECTs and especially aggregated SELECTs are lightning fast. Loading large data should be preferably done with the COPY command from Amazon S3 csv files. The drawbacks are that DELETEs and UPDATEs are slower than usual, but that is why Redshift in not primarily a transnational database, but more of a data warehouse platform.
You can try using Cassandra or HBase, though you would need to read up on how to design the column families as per your use case.
Cassandra provides its own query language but you need to use Java APIs of HBase to access the data directly.
If you need to use Hbase then I recommend querying the data with Apache Drill from Map-R which is an Open Source project. Drill's query language is SQL-Compliant(keywords in drill have the same meaning they would have in SQL).
With that many records per year you're eventually going to run out of space.
Why not filesystem storage like xfs which supports 2^64 files and using smaller boxes.
Regardless of how fancy people want to get or the amount of money one would end up spend getting a system with whatever database SQL NoSQL ..whichever these many records are usually made by electric companies and weather stations/providers like ministry of environment who control smaller stations throughout the country.
If you're doing something like storing pressure.. temperature..wind speed.. humidity etc...and guid is the location..you can still divide the data by year/month/day/hour.
Assuming you store 4 years of data per hard-drive.
You can then have it run on a smaller Nas with mirror where it would
also provide better read speeds and have multiple mount points..based on the year when it was created.
You can simply make a web-interface for searches
So dumping location1/2001/06/01//temperature and location1/2002/06/01//temperature would only dump the contents of hourly temperature for the 1st day of summer in those 2 years (24h*2) 48 small files vs searching a database with billions of records and possibly millions spent.
Simple way of looking at things.. 1.5 billion websites in the world with God knows how many pages each
If a company like Google had to spend millions per 3 billion searches to pay for super-computers for this they'd be broke.
Instead they have the power-bill...couple million crap computers.
And caffeine indexing...future-proof..keep adding more.
And yeah where indexing running off SQL makes sense then great
Building super-computers for crappy tasks with fixed things like weather...statistics and so on so techs can brag their systems crunches xtb in x seconds...waste of money that can be spent somewhere else..maybe that power-bill that won't run into the millions anytime soon by running something like 10 Nas servers.
Store records in plain binary files, one file per GUID, wouldn't get any faster than that.
You can use MongoDB and use the guid as the sharding key, this means that you can distribute your data over multiple machines but the data you want to select is only on one machine because you select by the sharding key.
Sharding in MongoDb is not yet production ready.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Ok, dumb question I know but I see the nebulous comment 'a large database' as well as small and medium and I wonder just what that means. Can someone define what a small, medium and large database is for us SQL neophytes?
There isn't a threshold where a small database becomes medium or a medium database becomes large. Generally, when I hear these terms, I think of particular orders of magnitude in terms of total records being stored.
Small: Fits in a spreadsheet.
Medium: Fits in memory on a commodity server.
Large: Fits in a commodity cloud offering.
Very large: Fits in a specialized environment; unusual storage, latency, or throughput characteristics.
As poster dkretz suggested, you could also think about it in terms of the properties each kind of database has. Categorizing it this way, I'd say:
Small: Performance is not a concern. Your queries run fine without making any special optimizations. You see only a marginal performance difference when using front-line enhancements like indexes.
Medium: Your database probably has one or more staff that are assigned part-time to its maintenance and care. These people pay attention to the database's health; their primary administrative responsibility is to prevent unacceptable performance problems and minimize downtime.
Large: Probably has dedicated staff member(s) whose job is to work on the database and improve performance, as well as make sure that application changes don't cause schema breakage over the lifetime of the database. Metrics about the health and status of the database are monitored closely. Significant expertise is required to understand and perform optimizations.
Very large: The database stores vast amounts of information that must be readily accessible. Performance optimizations are absolutely required to wring every last ounce of speed out of each queries, and without it, the database would be much less usable or even impossible to use. The database may be using sophisticated or innovative replication or clustering techniques, pushing the boundaries of current technology.
Note that these are entirely subjective, and that someone may very well have a perfectly legitimate alternate definition of "large".
One way to figure it is by observing your test queries.
A small database is one where indexes don't matter.
A medium database is one where queries take longer than one second if you don't have an appropriate index in place.
A big database is one where queries often take hours to optimize, using a combination of query design, index modification, and many test cycles.
Large database are ones that force you have to stop using relational databases.
In other words, a normalized, relational database where all the indexes in the world can't help you meet your response time requirements because of the massive JOINs.
If you've ever had to abandon relational databases for something else, you're either a poor database developer, have no expert DBA, or have a very large database.
“Large Database” is indeed a nebulous concept. There are already very different answers and opinions posted in the answers to this question. Some approaches to define “small”, “medium” and “large” Databases may make more sense than others BUT THEN, at some point, I consider each definition is right, true and valid.
Some definitions make more sense than others because they focus on different aspects of importance for the design, programming, use, maintenance and administration of a Database and these different aspects are what really matter for a usable Database. It just happens that all these aspects are impacted by the nebulous concept of “Database size”.
So, Does this mean that it does not matter if you are able to define if a particular Database is big or not?
Certainly not. What it mean is you will apply the concept differently while evaluating different design/operational/administrative aspects of your Database. It also means that every time this concept will be nebulous.
As an example: Database Index strategy (an aspect of Database design) is impacted by record count for each table (a measure of “size”), by record size times record count (another measure of “size”), and by Query Vs. Creation/Update/Delete operations ratio (an aspect of Database usage).
Query response times are better if indexes are used for tables with large amount of records. Depending on the nature of your WHERE, ORDER BY and record-aggregation clauses you may need several indexes for certain tables.
Creation, Update and Delete operations are impacted negatively with the increase of number of indexes on the affected table(s). More indexes for an affected table means more changes that the RDBMS must perform, spending more time and more resources to apply those changes.
Also, if your RDBMS spends more time to apply those changes, then the locks are maintained for longer times also, impacting the response times other queries being sent to the system at the same time.
So, How do you balance the quantity and design of your indexes? How do you know if you need an additional index and if by adding that index you will not be introducing a big negative impact on query response times? Answer: You test and profile your database against a target load as per your load/performance requirements and analyze the profiling data in order to discover if further optimizations/redesigns/indexes are needed.
Different Index strategies are required for different Query Vs. Creation/Update/Delete operations ratios. If your Database is under a heavy load of queries but is rarely updated, the performance for the overall application will be better if you add every index that improves query response times. On the other hand, if your Database is constantly being updated but there are not large query operations, then the performance will be better if you use less indexes.
There are other aspects of course: Database Schema design, Storage Strategy, Network design, Backup strategy, Stored Procedures/Triggers/Etc. programming, Application Programming (against the Database), Etc. All these aspects are impacted differently by distinct concepts of “size” (record size, record count, index size, index count, schema design, storage size, etc.).
I'd like to have more time as this topic is fascinating. I hope this small contribution serves as an starting point for you in this fascinating world of SQL.
You have to account for hardware advancement for this definition:
Small database: working set fits into the physical RAM of a single commodity server (about 16GB now)
Medium database: fits into a single or several (through RAID) commodity hard drives on a single machine (up to several TBs now)
Large database: Data needs to distributed across multiple commodity servers in order to fit (up to several PBs now.)
According to wikipedia article on Very Large Database
A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time.
If you have a database that is large enough that you can't just "back it up" to put on a development or test box, you likely have a "large database".
I think something like wikipedia, or the US census data is a 'big' database. My personal address lists or todos is a small database. A middle sized database is something in between.
You could try and define the sizes by how many servers you needed. A small database is a component of an application you run on your desktop, a mid-sized database would be a single mysql (whatever) server somewhere, and a large database is going to require multiple servers with some kind of replication/failover support.
Alternatively, consider the "size" of the database as the amount of time it takes to change the schema used to represent a domain of information. (In actual implementations, databases may contain multiple schemas and disparate domains at once.)
Days = "Small database."
Weeks = "Medium database."
Months = "Large database."
Years = "HUGE database."
With this heuristic, "size" ultimately an aspect of the information stored and the rate at which the information can be fully transformed. Such an approach based on time also maintains some semblance of how-does-this-affect-design-decisions as the sheer amount of data / number of rows increases and the performance of technology & implementations increases.
A variation of the above is to consider the “size” based on the amount of time required for management and routine maintenance. At the amount of data increases, so do the time for tasks such as backups, rebuilds, and upgrades. Without significant investment this may outpace the time available for such tasks.
Regardless, the key factor of “size” is time.