Framework for partial bidirectional database synchronization [closed] - database

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to optimize the backend for an information system for high-availability, which involves splitting off a part needed for time-critical client requests (front office) from the rest (back office).
Front office will have redundant application servers with load balancing for maximum performance and will use a database with pre-computed data. Back office will periodically prepare data for the front office based on client statistics and some external data.
A part of the data schema will be shared between both back and front office, but not the whole databases, only parts of some tables. The data will not need to correspond all the time, it will be synchronized between the two databases periodically. Continuous synchronization is also viable, but there is no real-time consistency requierement and it seems that batch-style synchronization would be better in terms of control, debug and backup possibilities. I expect no need for solving conflicts because data will mostly grow and change only on one side.
The solution should allow defining corresponding tables and columns and then it will insert/update new/changed rows. The solution should ideally use data model defined in Groovy classes (probably through annotations?), as both applications run on Grails. The synchronization may use the existing Grails web applications or run externally, maybe even on the database server alone (Postgresql).
There are systems for replicating whole mirrored databases, but I wasn't able to find any solution suiting my needs. Do you know of any existing framework to do help with that or to make my own is the only possibility?

I ended up using Londiste from SkyTools. The project page on pgFoundry site lists quite old binaries (and is currently down), so you better build it from source.
It's one direction (master-slave) only, so one has to set up two synchronization instances for bidirectional sync. Note that each instance consists of two Londiste binaries (master and slave worker) and a ticker daemon that pushes the changes.
To reduce synchronization traffic, you can extend the polling period (by default 1 second) in the configuration file or even turn it off completely by stopping the ticker and then trigger the sync manually by running SQL function pgq.ticker on master.
I solved the issue of partial column replication by writing a simple custom handler (londiste.handler.TableHandler subclass) with column-mapping configured in database. The mapping configuration is not model-driven (yet) as I originally planned, but I only need to replicate common columns, so this solution is sufficient for now.

Related

Strategies to building a database of 30m images [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?
Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!
It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)

creating database for each new company [duplicate]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am building a SAAS application and we are discussing about one database per client vs shared databases. I have read a lot, incluisve some topics here at SO but I have still many doubts.
Our plataform should be highly customizable by each client. (they should be able to have custom tables and add custom fields to existing tables).
The multiple database aproach seems great in this case.
The problem is. should my "users" table be in the master database or in each client database?.
A user might have one or more organizations, so it would be present in multiple databases.
Also, what about generic tables like countries table, etc?
It makes sense to be in the master database. But I have many tables with a created_by field which have a foreign key to the user. Also have some permission related tables by client.
I would loose the power of foreign keys if multiple databases, which means more queries to the database. I know I can use cross-join between databases if they are in the same server but then i loose scalability. (I might need to have multiple database servers in future).
I have tought about federated tables. Not sure about performance.
The technologies I am using are php and symfony 2 framework and mysql for the database.
Also, I am afraid about the maintenance of such a system. We could create some scripts to automate the schema changes in all databases, but if we have 10k clients that would mean 10k databases.
What is your opiniion about this?
The main caracteristic of my app should be flexibility so if a client needs something more specific than the base plataform doesnt have, it should be possible to do it for him.
Some classic problems here. Have you ever been to http://highscalability.com/? Some good case studies there.
From personal experience if you try to share clients on one server, you will find that a very successful/active user will take up all the resources of the machine over time. We had one client in a SAAS that destroyed a shared server and we had to move him somewhere else.
I would rip out global enumerations into a service. You can make one central database for things like list of countries, list of states, etc. and put it behind a web service layer. Also in that database you can have user management/managing what server belongs to what user etc. You can make a management portal that reads/writes to this database for managing your user base.
If I was doing a SAAS again, I would start small and wait for the pain to hit. What you really want are good tools to address the scaling issues when they happen. Have some scripts ready to do rolling schema changes across servers (no way to avoid this once you have more than one server). Have scripts to take down machines while you are modifying the schema. Have scripts to migrate a user from a shared server to a dedicated one.
Consider setting up replication from a central database. This would pump down global information that each user partition/database would need without you having to write a lot of code.
But the biggest piece of advice I've seen - and experienced first hand - don't try too hard to build the next Facebook for scale. Start simple and see what actually happens before worrying about major scalability issues. You might be surprised as the user base grows what scales well and what does not.

Which key-value database to use for BLOB storage? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
We have a service that currently runs on top of a MySQL database and uses JBoss to run the application. The rate of the database growth is accelerating and I am looking to change the setup to improve scaling. The issue is not in a large number of rows nor (yet) a particularly high volume of queries but rather in the large number of BLOBs stored in the db. Particularly the time it takes to create or restore a backup (we use mysqldump and Percona Xtrabackup ) is a concern as well as the fact that we will need to scale horizontally to keep expanding the disk space in the future. At the moment the db size is around 500GB.
The kind of arrangement that I figure would work well for our future needs is a hybrid database that uses both MySQL and some key-value database. The latter would only store the BLOBs. The meta data as well as data for user management and business logic of the application would remain in the MySQL db and benefit from structured tables and full consistency. The application itself would handle the issue of consistency between the databases.
The question is which database to use? There are lots of NoSQL databases to choose from. Here are some points on what qualities I am looking for:
Distributed over multiple nodes, which are flexible to add or remove.
Redundancy of storage, with the database automatically making sure each value object is stored on at least two different nodes.
Value objects' size could range from a few dozen bytes to around 100MB.
The database is accessed from a java EJB application on top of JBoss as well as a program written in C++ that processes the data in the db. Some sort of connector for each would be needed.
No need for structure for the data. A single string or even just a large integer would suffice for the key, pure byte array for the value.
No updates for the value objects are needed, only inserts and deletes. If a particular object is made obsolete by a new object that fulfills the same role, the old object is deleted and a new object with a new key is inserted.
Having looked around a bit, Riak sounds good except for its problems with storing large value objects. Which database would you recommend?

multi-tenant database architecture [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am building a SAAS application and we are discussing about one database per client vs shared databases. I have read a lot, incluisve some topics here at SO but I have still many doubts.
Our plataform should be highly customizable by each client. (they should be able to have custom tables and add custom fields to existing tables).
The multiple database aproach seems great in this case.
The problem is. should my "users" table be in the master database or in each client database?.
A user might have one or more organizations, so it would be present in multiple databases.
Also, what about generic tables like countries table, etc?
It makes sense to be in the master database. But I have many tables with a created_by field which have a foreign key to the user. Also have some permission related tables by client.
I would loose the power of foreign keys if multiple databases, which means more queries to the database. I know I can use cross-join between databases if they are in the same server but then i loose scalability. (I might need to have multiple database servers in future).
I have tought about federated tables. Not sure about performance.
The technologies I am using are php and symfony 2 framework and mysql for the database.
Also, I am afraid about the maintenance of such a system. We could create some scripts to automate the schema changes in all databases, but if we have 10k clients that would mean 10k databases.
What is your opiniion about this?
The main caracteristic of my app should be flexibility so if a client needs something more specific than the base plataform doesnt have, it should be possible to do it for him.
Some classic problems here. Have you ever been to http://highscalability.com/? Some good case studies there.
From personal experience if you try to share clients on one server, you will find that a very successful/active user will take up all the resources of the machine over time. We had one client in a SAAS that destroyed a shared server and we had to move him somewhere else.
I would rip out global enumerations into a service. You can make one central database for things like list of countries, list of states, etc. and put it behind a web service layer. Also in that database you can have user management/managing what server belongs to what user etc. You can make a management portal that reads/writes to this database for managing your user base.
If I was doing a SAAS again, I would start small and wait for the pain to hit. What you really want are good tools to address the scaling issues when they happen. Have some scripts ready to do rolling schema changes across servers (no way to avoid this once you have more than one server). Have scripts to take down machines while you are modifying the schema. Have scripts to migrate a user from a shared server to a dedicated one.
Consider setting up replication from a central database. This would pump down global information that each user partition/database would need without you having to write a lot of code.
But the biggest piece of advice I've seen - and experienced first hand - don't try too hard to build the next Facebook for scale. Start simple and see what actually happens before worrying about major scalability issues. You might be surprised as the user base grows what scales well and what does not.

Is there a database with git-like qualities? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a database where multiple users can contribute and commit new data; other users can then pull that data into their own database repository, all in a git-like manner. A transcriptional database, if you like; does such a thing exist?
My current thinking is to dump the database to a single file as SQL, but that could well get unwieldy once it is of any size. Another option is to dump the database and use the filesystem, but again it gets unwieldy once of any size.
There's Irmin: https://github.com/mirage/irmin
Currently it's only offered as an OCaml API, but there's future plans for a GraphQL API and a Cap'n'Proto one.
Despite the complex API and the still scarce documentation, it allows you to plug any backend (In-Memory, Unix Filesystem, Git In-Memory and Git On-Disk). Therefore, it runs even on Unikernels and Browsers.
It also offers a bidirectional model where changes on the Git local repository are reflected upon Application State and vice-versa. With the complex API, you can operate on any Git-level:
Append-only Blob storage.
Transactional/compound Tree layer.
Commit layer featuring chain of changes and metadata.
Branch/Ref/Tag layer (only-local, but offers also remotes) for mutability.
The immutable store is often associated/regarded for the blobs + trees + commits on documentation.
Due the Content-addressable inherited Git-feature, Irmin allows deduplication and thus, reduced memory-consumption. Some functionally persistent data structures fit perfectly on this database, and the 3-way merge is a novel approach to handle merge conflicts on a CRDT-style.
Answer from: How can I put a database under version control?
I have been looking for the same feature for Postgres (or SQL databases in general) for a while, but I found no tools to be suitable (simple and intuitive) enough. This is probably due to the binary nature of how data is stored. Klonio sounds ideal but looks dead. Noms DB looks interesting (and alive). Also take a look at Irmin (OCaml-based with Git-properties).
Though this doesn't answer the question in that it would work with Postgres, check out the Flur.ee database. It has a "time-travel" feature that allows you to query the data from an arbitrary point in time. I'm guessing it should be able to work with a "branching" model.
This database was recently being developed for blockchain-purposes. Due to the nature of blockchains, the data needs to be recorded in increments, which is exactly how git works. They are targeting an open-source release in Q2 2019.
Because each Fluree database is a blockchain, it stores the entire history of every transaction performed. This is part of how a blockchain ensures that information is immutable and secure.
It's not SQL, but CouchDB supports replicating the database and pushing/pulling changes between users in a way similar to what you describe.
Some more information in the chapter on replication in the O'Reilly CouchDB book.

Resources