Is there a database with git-like qualities? [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.

There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.

Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.

It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Related

Is there any high performance POSIX-like filesystem without a single point of failure? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
We have a web service that needs a somewhat POSIX-compatible shared filesystem for the application servers (multiple redundant systems running in parallel behind redundant load balancers). We're currently running GlusterFS as the shared filesystem for the application servers but I'm not happy with the performance of the system. Compared to actual raw performance of the storage servers running GlusterFS, it starts to look more sensible to run DRBD and single NFS server with all the other GlusterFS servers (currently 3 servers) waiting in hot-standby role.
Our workload is highly read oriented and usually deals with small files and I'd be happy to use "eventually consistent" system as long as a client can request sync for a single file if needed (that is, client is prepared to wait until the file has been successfully stored in the backend storage). I'd even accept a system where such "sync" requires querying the state of the file via some other way than POSIX fdatasync(). File metadata such as modification times is not important, only filename and the contents.
I'm currently aware of possible candidates and the problems each one currently has:
GlusterFS: overall performance is pretty poor in practice, performance goes down while adding new servers/bricks.
Ceph: highly complex to configure/administrate, POSIX compatibility sacrifices performance a lot as far as I know.
MooseFS: partially obfuscated open source (huge dumps of internally written code published seldomly with intentionally lost patch history), documentation leaves lots to desire.
SeaweedFS: pretty simple design and supposedly high performance, future of this project is unclear because pretty much all code is written and maintained by Chris Lu - what happens if he no longer writes any code? Unclear if the "Filer" component supports no single point of failure.
I know that CAP theorem prevents ever having truly consistent and always available system. Is there any good system for distributed file system where writes must be durable, but read performance is really good and the system has no single point of failure?
I am Chris Lu working on SeaweedFS. There are plans to commercialize it. (By adding more advanced features.)
The filer does not have simple point of failure, you can have multiple filer instances. The filer store can be any key-value store. If you need no SPOF, you can use Cassandra, Redis cluster, CockroachDB, TiDB, or Etcd. Or you can add your own key-value store option, which is pretty easy.

Data masking for data in AWS RDS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
I have an AWS RDS (AuroraDB) and I want to mask the data on the DB. Does Amazon provides any service for data masking?
I have seen RDS encryption but I am looking for data masking because the database contains sensitive data. So I want to know is there any service they provide for data masking or is there any other tool which can be used to mask the data and add it manually into the DB?
A list of tools which can be used for data masking is most appreciated if any for mine case. Because I need to mask those data for testing as the original DB contains sensitive information like PII(Personal Identifiable information). I also have to transfer these data to my co-workers, so I consider data masking an important factor.
Thanks.
This is a fantastic question and I think your pro-active approach to securing the most valuable asset of your business is something that a lot of people should heed, especially if you're sharing the data with your co-workers. Letting people see only what they need to see is an undeniably good way to reduce your attack surfaces. Standard cyber security methods are no longer enough imo, demonstrated by numerous attacks/people losing laptops/usbs with sensitive data on. We are just humans after all. With the GDPR coming in to force in May next year, any company with customers in the EU will have to demonstrate privacy by design and anonymisation techniques such as masking have been cited as way to show this.
NOTE: I have a vested interest in this answer because I am working on such a service you're talking about.
We've found that depending on your exact use case, size of data set and contents will depend on your masking method. If your data set has minimal fields and you know where the PII is, you can run standard queries to replace sensitive values. i.e. John -> XXXX. If you want to maintain some human readability there are libraries such as Python's Faker that generate random locale based PII you can replace your sensitive values with. (PHP Faker, Perl Faker and Ruby Faker also exist).
DISCLAIMER: Straight forward masking doesn't guarantee total privacy. Think someone identifying individuals from a masked Netflix data set by cross referencing with time stamped IMDB data or Guardian reporters identifying a Judges porn preferences from masked ISP data.
Masking does get tedious as your data set increases in fields/tables and you perhaps want to set up different levels of access for different co-workers. i.e. data science get lightly anonymised data, marketing get a access to heavily anonymised data. PII in free text fields is annoying and generally understanding what data is available in the world that attackers could use to cross reference is a big task.
The service i'm working on aims to alleviate all of these issues by automating the process with NLP techniques and a good understanding of anonymisation maths. We're bundling this up in to a web-service and we're keen to launch on the AWS marketplace. So I would love to hear more about your use-case and if you want early access we're in private beta at the moment so let me know.
If you are exporting or importing data using CSV or JSON files (i.e. to share with your co-workers) then you could use FileMasker. It can be run as an AWS Lamdbda function reading/writing CSV/JSON files on S3.
It's still in development but if you would like to try a beta now then contact me.
Disclaimer: I work for DataVeil, the developer of FileMasker.

Framework for partial bidirectional database synchronization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to optimize the backend for an information system for high-availability, which involves splitting off a part needed for time-critical client requests (front office) from the rest (back office).
Front office will have redundant application servers with load balancing for maximum performance and will use a database with pre-computed data. Back office will periodically prepare data for the front office based on client statistics and some external data.
A part of the data schema will be shared between both back and front office, but not the whole databases, only parts of some tables. The data will not need to correspond all the time, it will be synchronized between the two databases periodically. Continuous synchronization is also viable, but there is no real-time consistency requierement and it seems that batch-style synchronization would be better in terms of control, debug and backup possibilities. I expect no need for solving conflicts because data will mostly grow and change only on one side.
The solution should allow defining corresponding tables and columns and then it will insert/update new/changed rows. The solution should ideally use data model defined in Groovy classes (probably through annotations?), as both applications run on Grails. The synchronization may use the existing Grails web applications or run externally, maybe even on the database server alone (Postgresql).
There are systems for replicating whole mirrored databases, but I wasn't able to find any solution suiting my needs. Do you know of any existing framework to do help with that or to make my own is the only possibility?
I ended up using Londiste from SkyTools. The project page on pgFoundry site lists quite old binaries (and is currently down), so you better build it from source.
It's one direction (master-slave) only, so one has to set up two synchronization instances for bidirectional sync. Note that each instance consists of two Londiste binaries (master and slave worker) and a ticker daemon that pushes the changes.
To reduce synchronization traffic, you can extend the polling period (by default 1 second) in the configuration file or even turn it off completely by stopping the ticker and then trigger the sync manually by running SQL function pgq.ticker on master.
I solved the issue of partial column replication by writing a simple custom handler (londiste.handler.TableHandler subclass) with column-mapping configured in database. The mapping configuration is not model-driven (yet) as I originally planned, but I only need to replicate common columns, so this solution is sufficient for now.

Which key-value database to use for BLOB storage? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a service that currently runs on top of a MySQL database and uses JBoss to run the application. The rate of the database growth is accelerating and I am looking to change the setup to improve scaling. The issue is not in a large number of rows nor (yet) a particularly high volume of queries but rather in the large number of BLOBs stored in the db. Particularly the time it takes to create or restore a backup (we use mysqldump and Percona Xtrabackup ) is a concern as well as the fact that we will need to scale horizontally to keep expanding the disk space in the future. At the moment the db size is around 500GB.
The kind of arrangement that I figure would work well for our future needs is a hybrid database that uses both MySQL and some key-value database. The latter would only store the BLOBs. The meta data as well as data for user management and business logic of the application would remain in the MySQL db and benefit from structured tables and full consistency. The application itself would handle the issue of consistency between the databases.
The question is which database to use? There are lots of NoSQL databases to choose from. Here are some points on what qualities I am looking for:
Distributed over multiple nodes, which are flexible to add or remove.
Redundancy of storage, with the database automatically making sure each value object is stored on at least two different nodes.
Value objects' size could range from a few dozen bytes to around 100MB.
The database is accessed from a java EJB application on top of JBoss as well as a program written in C++ that processes the data in the db. Some sort of connector for each would be needed.
No need for structure for the data. A single string or even just a large integer would suffice for the key, pure byte array for the value.
No updates for the value objects are needed, only inserts and deletes. If a particular object is made obsolete by a new object that fulfills the same role, the old object is deleted and a new object with a new key is inserted.
Having looked around a bit, Riak sounds good except for its problems with storing large value objects. Which database would you recommend?

Looking for a disk-based redis-like database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I am currently using Redis for my app, and its features are really excellent for my application (lists, sets, sorted sets etc.).
My application relies heavily on sorted sets, lists, sets. And their related functions (push to a list, get list, union of sets etc. The only problem I am facing right now is that my data is large, and most of my data does not need to be in memory, and I want to store them on disk.
**I need an on-disk database with redis data structures **
I read about Cassandra but I am not sure if it supports sorted sets, sets, lists. Or at least if it does, I could not find methods to manipulate them the way Redis does.
Thanks.
https://github.com/yinqiwen/ardb
another REDIS protocol replacement
with LMDB, RocksDB and LevelDB disk-based backend
nice benchmarks
There are numerous on-disk databases with Redis-like datastructures or even trying to be drop-in protocol-compatible replacements for Redis.
There are excellent recommendations in "Is there something like Redis DB, but not limited with RAM size?" - pity the community considers such questions to be off-topic.
In particular, SSDB is an actively-maintained Redis-like on-disk database (but not directly compatible), and Ardb is an actively-maintained drop-in replacement for Redis that stores the data on disk. Disclaimer: I have not used either of them (yet).
try Edis - Erlang implementation of Redis based on leveldb http://inaka.github.io/edis/
I am encouraging you to learn Cassandra, while it has some things similar to key/value and sets, it is very different from Redis.
We currently moving one project from Redis (we use sadd / spop) to TokyoCabinet / KyotoCabinet via Memcached protocol. For the moment things looks good and very soon I will publish the lib on github - will be available here:
https://github.com/nmmmnu
and project will be called Simple Message Queue. It will support sadd / spop / sismember only. Also in Python you will be able to use new object instead of Redis object, but only for these three commands.
Hope this helps.
Update 2014.07:
Here is the project I am speaking about.
https://github.com/nmmmnu/MessageQueue
It implements both Redis and Memcached protocols. For back-end it uses memory ndb/mdb or berkeley db.

Resources