Is there a powerful database system for time series data? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
In multiple projects we have to store, aggregate, evaluate simple measurement values. One row typcially consists of a time stamp, a value and some attributes to the value. In some applications we would like to store 1000 values per second and more. These values must not only be inserted but also deleted at the same rate, since the lifetime of a value is restricted to a year or so (in different aggregation steps, we do not store 1000/s for the whole year).
Until now, we have developped different solutions. One based on Firebird, one on Oracle and one on some self-made storage mechanism. But none of these are very satisfying solutions.
Both RDBMS solutions cannot handle the desired data flow. Besides that, the applications that deliver the values (e.g. device drivers) cannot be easily attached to databases, the insert statements are cumbersome. And finally, while having an SQL interface to the data is strongly desired, typical evaluations are hard to formulate in SQL and slow in the execution. E.g. find the maximum value with time stamp per 15 minutes for all measurements during the last month.
The self-made solution can handle the insertion rate and has a client-friendly API to do it, but it has nothing like a query language and cannot be used by other applications via some standard interface e.g. for reporting.
The best solution in my dreams would be a database system that:
has an API for very fast insertion
is able to remove/truncate the values in the same speed
provides a standard SQL interface with specific support for typical time series data
Do you know some database that comes near those requirements or would you approach the problem in a different way?

Most other answers seem to mention SQL based databases. NoSQL based databases are far superior at this kind of thing.
Some Open source time-series databases:
https://prometheus.io - Monitoring system and time series database
http://influxdb.com/ - time series database with no external dependencies (only basic server is open-source)
http://square.github.io/cube/ - Written ontop of MongoDB
http://opentsdb.net/ - Written on top of Apache HBase
https://github.com/kairosdb/kairosdb - A rewrite of OpenTSDB that also enables using Cassandra instead of Hadoop
http://www.gocircuit.org/vena.html - A tutorial on writing a substitute of OpenTSDB using Go-circuits
https://github.com/rackerlabs/blueflood - Based on Cassandra
https://github.com/druid-io/druid - Column oriented & hadoop based distributed data store
Cloud-based:
https://www.tempoiq.com

influxdb :: An open-source distributed time series database with no external dependencies.
http://influxdb.org/

Consider IBM Informix Dynamic Server with the TimeSeries DataBlade.
That is, however, an extreme data rate that you are working with. (Not quite up to sub-atomic physics at CERN, but headed in that general direction.)
Fair disclosure: I work for IBM on the Informix DBMS, though not on the TimeSeries DataBlade per se.

SQL Server StreamInsight
Microsoft StreamInsight BOL

You can try HDF5 for time series data. It is extremely fast for such applications.

As Jonathan Leffler said, you should try Informix Timeseries feature. It is included in all editions of Informix at no additional charge. You can take a look at the TimeSeries functions it supports:
IBM Informix Time series SQL routines
You can access the data through sql functions or virtual view interfaces, you can even insert into the view.

Related

Database choice for saving and querying stock prices [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm currently receiving 2000 prices per second from a stock exchange and need to save those in an appropriate database. My current choice is PostgresQL which is way too slow. I need to save those prices (ticks) in an aggregated form like OHLC. So if I want to save D1 data for instance, I need to first get the previous D1 record for the stock from the database, check if the high or low price has changed and set a new close price and then save it to the database again. This is taking forever and is not possible with Postgres. I don't want to save the OHLC data, I prefer querying (aggregating) those in real-time.
So my requirements are:
persistance
fast writes (currently 2k per second, up to 10k)
queries, e.g. aggregating OHLC data in real-time (50-100 per second)
adoptable to any modern programming language without writing raw queries (SDK for Python or JS for that database)
deployable on AWS or GCP without hassle
I was thinking about Apache Cassandra. I'm not familiar with Cassandra, are powerful queries like OHLC one possible? Are there any alternatives to Cassandra?
Thanks in advance!
Given what I've understood from your question, I believe Cassandra should easily fit your use-case.
Regarding your requirements:
persistence : Cassandra will not only persist your data but also cover redundancy with minimal configuration;
fast writes : this is what Cassandra is most optimized for and while the exact throughput depends on a lot of factors, in general Cassandra will manage writes measured in the thousands/sec/core; Also, the eventual number o writes is not really relevant as Cassandra can scale linearly with no real penalty so 5k,10k, 100k or more are all doable;
adaptability : Cassandra has official drivers for the most common languages(Python, C family, NodeJs, Java, Ruby, PHP, Scala) as well as community developed ones for more languages (list of divers);
deployable : It's very easy to deploy in the cloud. You can chose to deploy it manually on independent instances or maybe use a managed Cassandra cluster (AWS has one, it's called 'AWS Keyspaces', Datastax(the company driving most of the development behind Cassandra) has one called 'Astra' and there are even more possible solutions. Given that Cassandra is one of the major players when it comes to big-data storage finding a place for you DB in the cloud should be easy.
I have only mentioned 4 of the 5 requirements. That is because when talking about reading, things get more complex and a larger discussion is needed.
500-100 reads/s given the 2k+ writes per second seem to be in line with the general idea of Cassandra being optimized for write intensive tasks. In Cassandra the way you will model your tables will dictate how well things can work. For a task like you have described my first thoughts are:
You bucket each stock per day => you get a partition with around 30k rows (1 update/s for 8 trading hours) and a size of under 0.2MB (30k * 4B). This would be well within the recommended values and clearly under the worst case scenario ones;
when you need the aggregated data you have 2 options:
2a. You read the partition as is and aggregate it application side (what I would recommend);
2b. You implement an "User-Defined Aggregate" function on your database that will do the work (docs). This should be doable although I won't guarantee it. Apart from being harder to implement, the problem is that putting this kind of extra workload on the DB might not be want you want given your apparent use-case. Let me explain: I'd expect your reading load to be most active during certain times, (before, during and after trading hours) with times when the load is lighter. Depending on your architecture, you could have multiple application instances up during peak times, and then scale them back during off-peak in order to lower costs. While applications can be easily scaled up and down on cloud providers like AWS and GC. Cassanadra cannot be scaled up and down like this (5 nodes in the morning, 3 in the night and so on)(well it could but it's not designed to and would be a terrible decision). So moving as much of the non-constant workload to the application seems the best idea;
(Optional) have a worker that at the end of the day/trading day will aggregate the values for each stock and save them to another table so that when looking at historic data it will be easier. This data could even be bucketed by week, month or even year depending on how much space the aggregated data takes.
You could also add Spark and Kafka in front of Casandra for a more powerful approach to the real-time aggregation but we should't deviate that much from the question at hand.
Cassandra is very powerful with the right modeling and the right architecture. At first glance what you need seems to be a good fit for Cassandra however as powerful as it can be, as bad as it can get if you use it in ways it wasn't designed to. I hope this answer puts you on a path into making the right decision.
Cheers.

ARX Anonymization Tool - supported databases [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
We are looking into options for open source data masking tools. ARX seems to provide some great functionality, but only lists SQLServer and DB2 (along with flat files and Excel in it's list). Does anyone know what types of things are supported? Oracle for example? How about old-school things like VSAM?
https://arx.deidentifier.org/anonymization-tool/
Anyone have any other great options for data masking? Hopefully something UI-configured, as it's typically not programmers managing the data.
Many great tools exist to help you anonymize data, and it’s a growing field, given the increasing need for data privacy and the demands of recent regulations. Here are just a few of the leading products for data anonymization; quotations are from product websites.
Open Source
ARX Data Anonymization Tool - https://arx.deidentifier.org/
“ARX features a cross-platform graphical tool, which supports data import and cleansing, wizards for creating transformation rules, intuitive ways for tailoring the anonymized dataset to your requirements and visualizations of data utility and re-identification risks.”
Masquerade - https://github.com/TonicAI/masquerade
“Masquerade can anonymize data in real-time enabling anonymous analytics, application development, and QA testing with next to no overhead. It does this by operating a TCP proxy between your Postgres client and Postgres database and modifying the result-sets generated by SELECT statements according to a set of user-defined rules.”
Amnesia - https://amnesia.openaire.eu/
“Amnesia is a data anonymization tool, that allows to remove identifying information from data. Amnesia not only removes direct identifiers like names, SSNs etc but also transforms secondary identifiers like birth date and zip code so that individuals cannot be identified in the data. Amnesia supports k-anonymity and km-anonymity.”
SaaS / Enterprise
Tonic (Synthetic Data Generator) - https://www.tonic.ai/
“Tonic uses pre-trained models and feature extraction to generate synthetic data that is based on your data. It preserves all the characteristics that make your data unique—constraints, statistical correlations, distributions, interdependencies, etc. Mask, anonymize, obscure, or generate brand new data, all at the click of a mouse.”
Informatica (Dynamic or Persistent Data Masking Products) - https://www.informatica.com/in/products/data-security/data-masking.html#fbid=3YKt13oZ5As
“De-identify, de-sensitize, and anonymize sensitive data from unauthorized access for application users, business intelligence, application testing, and outsourcing.”
Oracle (Data Masking and Subsetting Pack) - https://www.oracle.com/database/technologies/security/data-masking-subsetting.html
“Oracle Data Masking and Subsetting helps database customers improve security, accelerate compliance, and reduce IT costs by sanitizing copies of production data for testing, development, and other activities and by easily discarding unnecessary data.”
This list could be much longer; the above is just a sampling. Other companies that offer data masking products include Delphix, IBM, Microsoft SQL Server, Aircloak.
Full disclosure: I'm a founder of Tonic.
ARX developer here. When using the Java library you can connect to any database with a JDBC driver. We also support connections to Oracle via the GUI. AFAICT some users have reported problems when connecting to Oracle databases, though. You have to check.

Usecases: InfluxDB vs. Prometheus [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Following the Prometheus webpage one main difference between Prometheus and InfluxDB is the usecase: while Prometheus stores time series only InfluxDB is better geared towards storing individual events. Since there was some major work done on the storage engine of InfluxDB I wonder if this is still true.
I want to setup a time series database and apart from the push/push model (and probably a difference in performance) I can see no big thing which separates both projects. Can someone explain the difference in usecases?
InfluxDB CEO and developer here. The next version of InfluxDB (0.9.5) will have our new storage engine. With that engine we'll be able to efficiently store either single event data or regularly sampled series. i.e. Irregular and regular time series.
InfluxDB supports int64, float64, bool, and string data types using different compression schemes for each one. Prometheus only supports float64.
For compression, the 0.9.5 version will have compression competitive with Prometheus. For some cases we'll see better results since we vary the compression on timestamps based on what we see. Best case scenario is a regular series sampled at exact intervals. In those by default we can compress 1k points timestamps as an 8 byte starting time, a delta (zig-zag encoded) and a count (also zig-zag encoded).
Depending on the shape of the data we've seen < 2.5 bytes per point on average after compactions.
YMMV based on your timestamps, the data type, and the shape of the data. Random floats with nanosecond scale timestamps with large variable deltas would be the worst, for instance.
The variable precision in timestamps is another feature that InfluxDB has. It can represent second, millisecond, microsecond, or nanosecond scale times. Prometheus is fixed at milliseconds.
Another difference is that writes to InfluxDB are durable after a success response is sent to the client. Prometheus buffers writes in memory and by default flushes them every 5 minutes, which opens a window of potential data loss.
Our hope is that once 0.9.5 of InfluxDB is released, it will be a good choice for Prometheus users to use as long term metrics storage (in conjunction with Prometheus). I'm pretty sure that support is already in Prometheus, but until the 0.9.5 release drops it might be a bit rocky. Obviously we'll have to work together and do a bunch of testing, but that's what I'm hoping for.
For single server metrics ingest, I would expect Prometheus to have better performance (although we've done no testing here and have no numbers) because of their more constrained data model and because they don't append writes to disk before writing out the index.
The query language between the two are very different. I'm not sure what they support that we don't yet or visa versa so you'd need to dig into the docs on both to see if there's something one can do that you need. Longer term our goal is to have InfluxDB's query functionality be a superset of Graphite, RRD, Prometheus and other time series solutions. I say superset because we want to cover those in addition to more analytic functions later on. It'll obviously take us time to get there.
Finally, a longer term goal for InfluxDB is to support high availability and horizontal scalability through clustering. The current clustering implementation isn't feature complete yet and is only in alpha. However, we're working on it and it's a core design goal for the project. Our clustering design is that data is eventually consistent.
To my knowledge, Prometheus' approach is to use double writes for HA (so there's no eventual consistency guarantee) and to use federation for horizontal scalability. I'm not sure how querying across federated servers would work.
Within an InfluxDB cluster, you can query across the server boundaries without copying all the data over the network. That's because each query is decomposed into a sort of MapReduce job that gets run on the fly.
There's probably more, but that's what I can think of at the moment.
We've got the marketing message from the two companies in the other answers. Now let's ignore it and get back to the sad real world of time-data series.
Some History
InfluxDB and prometheus were made to replace old tools from the past era (RRDtool, graphite).
InfluxDB is a time series database. Prometheus is a sort-of metrics collection and alerting tool, with a storage engine written just for that. (I'm actually not sure you could [or should] reuse the storage engine for something else)
Limitations
Sadly, writing a database is a very complex undertaking. The only way both these tools manage to ship something is by dropping all the hard features relating to high-availability and clustering.
To put it bluntly, it's a single application running only a single node.
Prometheus has no goal to support clustering and replication whatsoever. The official way to support failover is to "run 2 nodes and send data to both of them". Ouch. (Note that it's seriously the ONLY existing way possible, it's written countless times in the official documentation).
InfluxDB has been talking about clustering for years... until it was officially abandoned in March. Clustering ain't on the table anymore for InfluxDB. Just forget it. When it will be done (supposing it ever is) it will only be available in the Enterprise Edition.
https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/
Within the next few years, we will hopefully have a well-engineered time-series database that is handling all the hard problems relating to databases: replication, failover, data safety, scalability, backup...
At the moment, there is no silver bullet.
What to do
Evaluate the volume of data to be expected.
100 metrics * 100 sources * 1 second => 10000 datapoints per second => 864 Mega-datapoints per day.
The nice thing about times series databases is that they use a compact format, they compress well, they aggregate datapoints, and they clean old data. (Plus they come with features relevant to time data series.)
Supposing that a datapoint is treated as 4 bytes, that's only a few Gigabytes per day. Lucky for us, there are systems with 10 cores and 10 TB drives readily available. That could probably run on a single node.
The alternative is to use a classic NoSQL database (Cassandra, ElasticSearch or Riak) then engineer the missing bits in the application. These databases may not be optimized for that kind of storage (or are they? modern databases are so complex and optimized, can't know for sure unless benchmarked).
You should evaluate the capacity required by your application. Write a proof of concept with these various databases and measures things.
See if it falls within the limitations of InfluxDB. If so, it's probably the best bet. If not, you'll have to make your own solution on top of something else.
InfluxDB simply cannot hold production load (metrics) from 1000 servers. It has some real problems with data ingestion and ends up stalled/hanged and unusable. We tried to use it for a while but once data amount reached some critical level it could not be used anymore. No memory or cpu upgrades helped.
Therefore our experience is definitely avoid it, it's not mature product and has serious architectural design problems. And I am not even talking about sudden shift to commercial by Influx.
Next we researched Prometheus and while it required to rewrite queries it now ingests 4 times more metrics without any problems whatsoever compared to what we tried to feed to Influx. And all that load is handled by single Prometheus server, it's fast, reliable, and dependable. This is our experience running huge international internet shop under pretty heavy load.
IIRC current Prometheus implementation is designed around all the data fitting on a single server. If you have gigantic quantities of data, it may not all fit in Prometheus.

What are the approaches to the Big-Data problems? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).
But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.
Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.
What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?
I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.
1) The first requirement we want to be able to write to and to read from the database quickly.
You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).
References:
Big Data and NoSQL: Five Key Insights
NoSQL standouts: New databases for new applications
Big data woes: Which database should I use?
Cassandra and Spark: A match made in big data heaven
List of NoSQL databases (currently 150)
2) We also want to have a web interface to the database
See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.
Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.
References:
150 NoSQL databases
Cassandra Web
Cassandra Cluster Admin
3) We want to be able to run different data-analysis algorithm on the data
Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.
References:
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
Cassandra at eBay - Cassandra Summit
An Introduction to Real-Time Analytics with Cassandra and Hadoop
4) We want to run machine learning algorithms on the data to be able to learn "relations"
Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,
Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in
2010, and became a top-level Apache project in February, 2014. It has
since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).
Regarding Hadoop,
With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.
References:
Getting Started with Apache Spark and Cassandra
What is Apache Mahout?
Data Science with Apache Hadoop: Predicting Airline Delays
5) Finally, we want to have a nice clicks-based interface that visualize the data.
Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).
References:
Data Science Central - Resources
Big Data Visualization
Start looking at:
what kind of data you want to store in the Database?
what kind of relationship between data you got?
how this data will be accessed? (for instance you need to access a certain set of data quite often)
are they documents? text? something else?
Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.
You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases.
Which one will be the best fit can be determined answering the question above.
There are ready to use stack that may really help to start with your project:
Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)
Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs
Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.
Some people refers to them as the ELK stack.
I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.
A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.
Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.
Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.
You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.
AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R
Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/
BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/
Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.
Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.
Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.
Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.
These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Resources