Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Following the Prometheus webpage one main difference between Prometheus and InfluxDB is the usecase: while Prometheus stores time series only InfluxDB is better geared towards storing individual events. Since there was some major work done on the storage engine of InfluxDB I wonder if this is still true.
I want to setup a time series database and apart from the push/push model (and probably a difference in performance) I can see no big thing which separates both projects. Can someone explain the difference in usecases?
InfluxDB CEO and developer here. The next version of InfluxDB (0.9.5) will have our new storage engine. With that engine we'll be able to efficiently store either single event data or regularly sampled series. i.e. Irregular and regular time series.
InfluxDB supports int64, float64, bool, and string data types using different compression schemes for each one. Prometheus only supports float64.
For compression, the 0.9.5 version will have compression competitive with Prometheus. For some cases we'll see better results since we vary the compression on timestamps based on what we see. Best case scenario is a regular series sampled at exact intervals. In those by default we can compress 1k points timestamps as an 8 byte starting time, a delta (zig-zag encoded) and a count (also zig-zag encoded).
Depending on the shape of the data we've seen < 2.5 bytes per point on average after compactions.
YMMV based on your timestamps, the data type, and the shape of the data. Random floats with nanosecond scale timestamps with large variable deltas would be the worst, for instance.
The variable precision in timestamps is another feature that InfluxDB has. It can represent second, millisecond, microsecond, or nanosecond scale times. Prometheus is fixed at milliseconds.
Another difference is that writes to InfluxDB are durable after a success response is sent to the client. Prometheus buffers writes in memory and by default flushes them every 5 minutes, which opens a window of potential data loss.
Our hope is that once 0.9.5 of InfluxDB is released, it will be a good choice for Prometheus users to use as long term metrics storage (in conjunction with Prometheus). I'm pretty sure that support is already in Prometheus, but until the 0.9.5 release drops it might be a bit rocky. Obviously we'll have to work together and do a bunch of testing, but that's what I'm hoping for.
For single server metrics ingest, I would expect Prometheus to have better performance (although we've done no testing here and have no numbers) because of their more constrained data model and because they don't append writes to disk before writing out the index.
The query language between the two are very different. I'm not sure what they support that we don't yet or visa versa so you'd need to dig into the docs on both to see if there's something one can do that you need. Longer term our goal is to have InfluxDB's query functionality be a superset of Graphite, RRD, Prometheus and other time series solutions. I say superset because we want to cover those in addition to more analytic functions later on. It'll obviously take us time to get there.
Finally, a longer term goal for InfluxDB is to support high availability and horizontal scalability through clustering. The current clustering implementation isn't feature complete yet and is only in alpha. However, we're working on it and it's a core design goal for the project. Our clustering design is that data is eventually consistent.
To my knowledge, Prometheus' approach is to use double writes for HA (so there's no eventual consistency guarantee) and to use federation for horizontal scalability. I'm not sure how querying across federated servers would work.
Within an InfluxDB cluster, you can query across the server boundaries without copying all the data over the network. That's because each query is decomposed into a sort of MapReduce job that gets run on the fly.
There's probably more, but that's what I can think of at the moment.
We've got the marketing message from the two companies in the other answers. Now let's ignore it and get back to the sad real world of time-data series.
Some History
InfluxDB and prometheus were made to replace old tools from the past era (RRDtool, graphite).
InfluxDB is a time series database. Prometheus is a sort-of metrics collection and alerting tool, with a storage engine written just for that. (I'm actually not sure you could [or should] reuse the storage engine for something else)
Limitations
Sadly, writing a database is a very complex undertaking. The only way both these tools manage to ship something is by dropping all the hard features relating to high-availability and clustering.
To put it bluntly, it's a single application running only a single node.
Prometheus has no goal to support clustering and replication whatsoever. The official way to support failover is to "run 2 nodes and send data to both of them". Ouch. (Note that it's seriously the ONLY existing way possible, it's written countless times in the official documentation).
InfluxDB has been talking about clustering for years... until it was officially abandoned in March. Clustering ain't on the table anymore for InfluxDB. Just forget it. When it will be done (supposing it ever is) it will only be available in the Enterprise Edition.
https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/
Within the next few years, we will hopefully have a well-engineered time-series database that is handling all the hard problems relating to databases: replication, failover, data safety, scalability, backup...
At the moment, there is no silver bullet.
What to do
Evaluate the volume of data to be expected.
100 metrics * 100 sources * 1 second => 10000 datapoints per second => 864 Mega-datapoints per day.
The nice thing about times series databases is that they use a compact format, they compress well, they aggregate datapoints, and they clean old data. (Plus they come with features relevant to time data series.)
Supposing that a datapoint is treated as 4 bytes, that's only a few Gigabytes per day. Lucky for us, there are systems with 10 cores and 10 TB drives readily available. That could probably run on a single node.
The alternative is to use a classic NoSQL database (Cassandra, ElasticSearch or Riak) then engineer the missing bits in the application. These databases may not be optimized for that kind of storage (or are they? modern databases are so complex and optimized, can't know for sure unless benchmarked).
You should evaluate the capacity required by your application. Write a proof of concept with these various databases and measures things.
See if it falls within the limitations of InfluxDB. If so, it's probably the best bet. If not, you'll have to make your own solution on top of something else.
InfluxDB simply cannot hold production load (metrics) from 1000 servers. It has some real problems with data ingestion and ends up stalled/hanged and unusable. We tried to use it for a while but once data amount reached some critical level it could not be used anymore. No memory or cpu upgrades helped.
Therefore our experience is definitely avoid it, it's not mature product and has serious architectural design problems. And I am not even talking about sudden shift to commercial by Influx.
Next we researched Prometheus and while it required to rewrite queries it now ingests 4 times more metrics without any problems whatsoever compared to what we tried to feed to Influx. And all that load is handled by single Prometheus server, it's fast, reliable, and dependable. This is our experience running huge international internet shop under pretty heavy load.
IIRC current Prometheus implementation is designed around all the data fitting on a single server. If you have gigantic quantities of data, it may not all fit in Prometheus.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm currently receiving 2000 prices per second from a stock exchange and need to save those in an appropriate database. My current choice is PostgresQL which is way too slow. I need to save those prices (ticks) in an aggregated form like OHLC. So if I want to save D1 data for instance, I need to first get the previous D1 record for the stock from the database, check if the high or low price has changed and set a new close price and then save it to the database again. This is taking forever and is not possible with Postgres. I don't want to save the OHLC data, I prefer querying (aggregating) those in real-time.
So my requirements are:
persistance
fast writes (currently 2k per second, up to 10k)
queries, e.g. aggregating OHLC data in real-time (50-100 per second)
adoptable to any modern programming language without writing raw queries (SDK for Python or JS for that database)
deployable on AWS or GCP without hassle
I was thinking about Apache Cassandra. I'm not familiar with Cassandra, are powerful queries like OHLC one possible? Are there any alternatives to Cassandra?
Thanks in advance!
Given what I've understood from your question, I believe Cassandra should easily fit your use-case.
Regarding your requirements:
persistence : Cassandra will not only persist your data but also cover redundancy with minimal configuration;
fast writes : this is what Cassandra is most optimized for and while the exact throughput depends on a lot of factors, in general Cassandra will manage writes measured in the thousands/sec/core; Also, the eventual number o writes is not really relevant as Cassandra can scale linearly with no real penalty so 5k,10k, 100k or more are all doable;
adaptability : Cassandra has official drivers for the most common languages(Python, C family, NodeJs, Java, Ruby, PHP, Scala) as well as community developed ones for more languages (list of divers);
deployable : It's very easy to deploy in the cloud. You can chose to deploy it manually on independent instances or maybe use a managed Cassandra cluster (AWS has one, it's called 'AWS Keyspaces', Datastax(the company driving most of the development behind Cassandra) has one called 'Astra' and there are even more possible solutions. Given that Cassandra is one of the major players when it comes to big-data storage finding a place for you DB in the cloud should be easy.
I have only mentioned 4 of the 5 requirements. That is because when talking about reading, things get more complex and a larger discussion is needed.
500-100 reads/s given the 2k+ writes per second seem to be in line with the general idea of Cassandra being optimized for write intensive tasks. In Cassandra the way you will model your tables will dictate how well things can work. For a task like you have described my first thoughts are:
You bucket each stock per day => you get a partition with around 30k rows (1 update/s for 8 trading hours) and a size of under 0.2MB (30k * 4B). This would be well within the recommended values and clearly under the worst case scenario ones;
when you need the aggregated data you have 2 options:
2a. You read the partition as is and aggregate it application side (what I would recommend);
2b. You implement an "User-Defined Aggregate" function on your database that will do the work (docs). This should be doable although I won't guarantee it. Apart from being harder to implement, the problem is that putting this kind of extra workload on the DB might not be want you want given your apparent use-case. Let me explain: I'd expect your reading load to be most active during certain times, (before, during and after trading hours) with times when the load is lighter. Depending on your architecture, you could have multiple application instances up during peak times, and then scale them back during off-peak in order to lower costs. While applications can be easily scaled up and down on cloud providers like AWS and GC. Cassanadra cannot be scaled up and down like this (5 nodes in the morning, 3 in the night and so on)(well it could but it's not designed to and would be a terrible decision). So moving as much of the non-constant workload to the application seems the best idea;
(Optional) have a worker that at the end of the day/trading day will aggregate the values for each stock and save them to another table so that when looking at historic data it will be easier. This data could even be bucketed by week, month or even year depending on how much space the aggregated data takes.
You could also add Spark and Kafka in front of Casandra for a more powerful approach to the real-time aggregation but we should't deviate that much from the question at hand.
Cassandra is very powerful with the right modeling and the right architecture. At first glance what you need seems to be a good fit for Cassandra however as powerful as it can be, as bad as it can get if you use it in ways it wasn't designed to. I hope this answer puts you on a path into making the right decision.
Cheers.
Reaching out to the community to pressure test our internal thinking.
We are building a simplified business intelligence platform that will aggregate metrics (i.e. traffic, backlinks) and text list (i.e search keywords, used technologies) from several data providers.
The data will be somewhat loosely structured and may change over time with vendors potentially changing their response formats.
Data volume may be long term 100,000 rows x 25 input vectors.
Data would be updated and read continuously but not at massive concurrent volume.
We'd expect to need to do some ETL transformations on the gathered data from partners along the way to the UI (e.g show trending information over the past five captured data points).
We'd want to archive every single data snapshot (i.e. version it) vs just storing the most current data point.
The persistence technology should be readily available through AWS.
Our assumption is our requirements lend themselves best towards DynamoDB (vs Amazon Neptune or Redshift or Aurora).
Is that fair to assume? Are there any other questions / information I can provide to elicit input from this community?
Because of your requirement to have a schema-less structure, and to version each item, DynamoDB is a great choice. You will likely want to build the table as a composite Partition/Sort key structure, with the Sort key being the Version, and there are several techniques you can use to help you locate the 'latest' version etc. This is a very common pattern, and with DDB Autoscaling you can ensure that you only provision the amount of capacity that you actually need.
I need to choose a Database for storing data remotely from a big number (thousands to tens of thousands) of sensors that would generate around one entry per minute each.
The said data needs to be queried in a variety of ways from counting data with certain characteristics for statistics to simple outputting for plotting.
I am looking around for the right tool, I started with MySQL but I feel like it lacks the scalability needed for this project, and this lead me to noSQL databases which I don't know much about.
Which Database, either relational or not would be a good choice?
Thanks.
There is usually no "best" database since they all involve trade-offs of one kind or another. Your question is also very vague because you don't say anything about your performance needs other than the number of inserts per minute (how much data per insert?) and that you need "scalability".
It also looks like a case of premature optimization because you say you "feel like [MySQL] lacks the scalability needed for this project", but it doesn't sound like you've run any tests to confirm whether this is a real problem. It's always better to get real data rather than base an important architectural decision on "feelings".
Here's a suggestion:
Write a simple test program that inserts 10,000 rows of sample data per minute
Run the program for a decent length of time (a few days or more) to generate a sizable chunk of test data
Run your queries to see if they meet your performance needs (which you haven't specified -- how fast do they need to be? how often will they run? how complex are they?)
You're testing at least two things here: whether your database can handle 10,000 inserts per minute and whether your queries will run quickly enough once you have a huge amount of data. With large datasets these will become competing priorities since you need indexes for fast queries, but indexes will start to slow down your inserts over time. At some point you'll need to think about data archival as well (or purging, if historical data isn't needed) both for performance and for practical reasons (finite storage space).
These will be concerns no matter what database you select. From what little you've told us about your retrieval needs ("counting data with certain characteristics" and "simple outputting for plotting") it sounds like any type of database will do. It may be that other concerns are more important, such as ease of development (what languages and tools are you using?), deployment, management, code maintainability, etc.
Since this is sensor data we're talking about, you may also want to look at a round robin database (RRD) such as RRDTool to see if that approach better serves your needs.
Found this question while googling for "database for sensor data"
One of very helpful search-results (along with this SO question) was this blog:
Actually I've started a similar project (http://reatha.de) but realized too late, that I'm using not the best technologies available. My approach was similar MySQL + PHP. Finally I realized that this is not scalable and stopped the project.
Additionally, a good starting point is looking at the list of data-bases in Heroku:
If they use one, then it should be not the worst one.
I hope this helps.
you can try to use Redis noSQL database
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
We're building a measurement system that will eventually consist of thousands of measurement stations. Each station will save around 500 million measurements consisting of 30 scalar values over its lifetime. These will be float values. We're now wondering how to save this data on each station, considering we'll be building a web app on each station such that
we want to visualize the data on multiple timescales (eg measurements of one week, month, year)
we need to build moving averages over the data (eg average over a month to show in a year graph)
the database needs to be crash resistant (power outages)
we are only doing writes and reads, no updates or deletes on the data
additionally we'd like one more server that can show the data of, say, 1000 measurement stations. That would be ~50TB of data in 500 billion measurements. To transmit the data from measurement station to server, I thought that some type of database-level replication would be a clean and efficient way.
Now I'm wondering if a noSQL solution might be better than mySQL for these purposes. Especially couchDB, Cassandra and maybe key-value stores like Redis look appealing to me. Which of those would suit the "measurement time series" data model best in your opinion? What about other advantages like crash-safety and replication from measurement station to main server?
I think CouchDB is a great database -- but it's ability to deal with large data is questionable. CouchDB's primary focus is on simplicity of development and offline replication, not necessarily on performance or scalability. CouchDB itself does not support partitioning, so you'll be limited by the maximum node size unless you use BigCouch or invent your own partitioning scheme.
No foolin, Redis is an in-memory database. It's extremely fast and efficient at getting data in and out of RAM. It does have the ability to use disk for storage, but it's not terribly good at it. It's great for bounded quantities of data that change frequently. Redis does have replication, but does not have any built-in support for partitioning, so again, you'll be on your own here.
You also mentioned Cassandra, which I think is more on target for your use case. Cassandra is well suited for databases that grow indefinitely, essentially it's original use case. The partitioning and availability is baked in so you won't have to worry about it very much. The data model is also a bit more flexible than the average key/value store, adding a second dimension of columns, and can practically accomodate millions of columns per row. This allows time-series data to be "bucketed" into rows that cover time ranges, for example. The distribution of data across the cluster (partitioning) is done at the row level, so only one node is necessary to perform operations within a row.
Hadoop plugs right into Cassandra, with "native drivers" for MapReduce, Pig, and Hive, so it could potentially be used to aggregate the collected data and materialize the running averages. The best practice is to shape data around queries, so probably want to store multiple copies of the data in "denormalized" form, one for each type of query.
Check out this post on doing time-series in Cassandra:
http://rubyscale.com/2011/basic-time-series-with-cassandra/
For highly structured data of this nature (time series of float vectors) I tend to shy away from databases all together. Most of the features of a database aren't very interesting; you basically aren't interested in things like atomicity or transactional semantics. The only feature that is desirable is resilience to crashing. That feature, however, is trivially easy to implement when you don't ever need to undo a write (no updates/deletes), just by appending to a file. crash recovery is simple; open a new file with an incremented serial number in the filename.
A logical format for this is plain-old csv. after each measurement is taken, call flush() on the underlying file. Getting the data replicated back to the central server is a job efficiently solved by rsync(1). You can then import the data in the analysis tool of your choice.
I would persionally shy away from "csv" and "plaintext" files. These are convenient when you have low volume and want to skip the tools to quickly look at the data or make small alterations to the data.
When you're talking about "50Tb" of data, that's quite a lot. If a simple trick will reduce that by a factor of two, that will pay itself back in storage costs and bandwidth charges.
If the measurements are taken on a regular basis that would mean that instead of saving the timestamp with every measurement, you store the start time and interval and just store the measurments.
I'd go for a file format that has a small header and then just a bunch of floating point measurements. To prevent files getting really really large, decide on a maximum file size. If you initiallize the file by fully writing it before starting to use the file, it will be completely allocated on the disk by the time you start to use it. Now you can mmap the file and alter the data. If power goes down when you are changing the data, it simply either makes it to disk or it doesn't.
I am writing a web application with nodeJS that can be used by other applications to store logs and accessed later in a web interface or by applications themselves providing an API. Similar to Graylog2 but schema free.
I've already tried couchDB in which each document would be a log doc but since I'm not really using revisions it seems to me I'm not using its all features. And beside that I think if the logs exceeds a limit it would be pretty hard to manage in couchDB.
What I'm really looking for, is a big array of logs that can be sorted, filtered, searched and capped on. Then the last events of it accessed. It should be schema free and writing to it should be non-blocking.
I'm considering using Cassandra(I'm not really familiar with it) due to the points here said. MongoDB seems good here too, since Graylog2 uses in mongoDB, in here it has some good points about it.
I've already have seen this question, but not satisfied with the answers.
Edit:
For some reasons I can't use Cassandra in production, now I'm trying MongoDB.
One more reason to use mongoDB :
http://www.slideshare.net/WombatNation/logging-app-behavior-to-mongo-db
More edits:
It is similar to graylog2, but the difference I want to make that instead of having a message field, having fileds defined by the client, which is why I want it to be schema free, and because of that, I may need to query in the user defined fields. We can build it on SQL, but querying on the user defined fields would be reinventing wheel. Same goes with files.
Technically what I'm looking for is to get rich statistical data in the end, or easy debugging and a lot of other stuff that we can't get out of the logs.
Where shall it be stored and how shall it be retrieved?
I guess it depends on how much data you are dealing with. If you have a huge amount (terabytes and petabytes per day) of logs then Apache Kafka, which is designed to allow data to be PULLED by HDFS in parallel, is a interesting solution - still in the incubation stage. I believe if you want to consume Kafka messages with MongoDb, you'd need to develop your own adapter to ingest it as a consumer of a particular Kafka topic. Although MongoDb data (e.g. shards and replicas) is distributed, it may be a sequential process to ingest each message. So, there may be a bottleneck or even race conditions depending on the rate and size of message traffic. Kafka is optimized to pump and append that data to HDFS nodes using message brokers FAST. Then once it is in HDFS you can map/reduce to analyze your information in a variety of ways.
If MongoDb can handle the ingestion load, then it is an excellent, scalable, real-time solution to find information, particularly documents. Otherwise, if you have more time to process data (i.e. batch processes that take hours and sometimes days), then Hadoop or some other Map Reduce database is warranted. Finally, Kafka can distribute that load of messages and hookup that fire-hose to a variety of consumers. Overall, these new technologies spread the load and huge amounts of data across cheap hardware using software to manage failure and recover with a very low probability of losing data.
Even with a small amount of data, MongoDb is a nice option to traditional relational database solutions which require more overhead of developer resources to design, build and maintain.
General Approach
You have a lot of work ahead of you. Whichever database you use, you have many features which you must build on top of the DB foundation. You have done good research about all of your options. It sounds like you suspect that all have pros and cons but all are imperfect. Your suspicion is correct. At this point it is probably time to start writing code.
You could just choose one arbitrarily and start building your application. If your guess was correct that the pros and cons balance out and it's all about the same, then why not simply start building immediately? When you hit difficulty X on your database, remember that it gave you convenience Y and Z and that's just life.
You could also establish the fundamental core of your application and implement various prototypes on each of the databases. That might give you true insight to help discriminate between the databases for your specific application. For example, besides the interface, indexing, and querying questions, what about deployment? What about backups? What about maintenance and security? Maybe "wasting" time to build the same prototype on each platform will make the answer very clear for you.
Notes about CouchDB
I suppose CouchDB is "NoSQL" if you say so. Other things which are "no SQL" include bananas, poems, and cricket. It is not a very meaningful word. We have general-purpose languages and domain-specific languages; similarly CouchDB is a domain-specific database. It can save you time if you need the following features:
Built-in web API: clients may query directly
Incremental map-reduce: CouchDB runs the job once, but you can query repeatedly at no cost. Updates to the data set are immediately reflected in the map/reduce result without full re-processing
Easy to start small but expand to large clusters without changing application code.
Have you considered Apache Kafka?
Kafka is a distributed messaging system developed at LinkedIn for
collecting and delivering high volumes of log data with low latency.
Our system incorporates ideas from existing log aggregators and
messaging systems, and is suitable for both offline and online message
consumption.