Can anyone just explain me the precise difference between distributed database and decentralised database?
Decentralized
It means that there is no central storage. Some servers provide information to the clients. The servers are connected with each other.
Distributed
There are no data storages. All the nodes contain information. The clients are equal and have equal rights.
Main Difference
A distributed database is a single logical database, which is installed on a set of computers that are geographically located at different locations and linked through a data communication network whereas A decentralized database is a database that is installed on systems that are geographically located at different locations but not linked through a data communication network.
Coming to Blockchain it works on Centralized Relational Database and especially Distributed Database that leverage cryptography to provide multi version concurrency control mechanism and to maintain consensus about the existence and status of shared facts in trust-less environment.
Source
Database in blockchain video
The hierarchy is centralized to de-centralized to distributed.
de-centralized is simply centralized on a smaller (larger?) scale. The risk of losing data/some catastrophic event is reduced because there are lots of little 'centers'.
Distributed databases/ledgers have no 'center'. The risk of a catastrophic event is further reduced (although other risks (forks etc.) arise)
Related
What is database clustering? If you allow the same database to be on 2 different servers how do they keep the data between synchronized. And how does this differ from load balancing from a database server perspective?
Database clustering is a bit of an ambiguous term, some vendors consider a cluster having two or more servers share the same storage, some others call a cluster a set of replicated servers.
Replication defines the method by which a set of servers remain synchronized without having to share the storage being able to be geographically disperse, there are two main ways of going about it:
master-master (or multi-master) replication: Any server can update the database. It is usually taken care of by a different module within the database (or a whole different software running on top of them in some cases).
Downside is that it is very hard to do well, and some systems lose ACID properties when in this mode of replication.
Upside is that it is flexible and you can support the failure of any server while still having the database updated.
master-slave replication: There is only a single copy of authoritative data, which is the pushed to the slave servers.
Downside is that it is less fault tolerant, if the master dies, there are no further changes in the slaves.
Upside is that it is easier to do than multi-master and it usually preserve ACID properties.
Load balancing is a different concept, it consists distributing the queries sent to those servers so the load is as evenly distributed as possible. It is usually done at the application layer (or with a connection pool). The only direct relation between replication and load balancing is that you need some replication to be able to load balance, else you'd have a single server.
From SQL Server point of view:
Clustering will give you an active - passive configuration. Meaning in a 2 node cluster, one of them will be the active (serving) and the other one will be passive (waiting to take over when the active node fails). It's a high availability from hardware point of view.
You can have an active-active cluster, but it will require multiple instances of SQL Server running on each node. (i.e. Instance 1 on Node A failing over to Instance 2 on Node B, and instance 1 on Node B failing over to instance 2 on Node A).
Load balancing (at least from SQL Server point of view) does not exists (at least in the same sense of web server load balancing). You can't balance load that way. However, you can split your application to run on some database on server 1 and also run on some database on server 2, etc. This is the primary mean of "load balancing" in SQL world.
Clustering uses shared storage of some kind (a drive cage or a SAN, for example), and puts two database front-ends on it. The front end servers share an IP address and cluster network name that clients use to connect, and they decide between themselves who is currently in charge of serving client requests.
If you're asking about a particular database server, add that to your question and we can add details on their implementation, but at its core, that's what clustering is.
Database Clustering is actually a mode of synchronous replication between two or possibly more nodes with an added functionality of fault tolerance added to your system, and that too in a shared nothing architecture. By shared nothing it means that the individual nodes actually don't share any physical resources like disk or memory.
As far as keeping the data synchronized is concerned, there is a management server to which all the data nodes are connected along with the SQL node to achieve this(talking specifically about MySQL).
Now about the differences: load balancing is just one result that could be achieved through clustering, the others include high availability, scalability and fault tolerance.
The system in question is for a company with multiple locations. Unreliable internet speeds/availability at some locations have led to the path of a local server at each location off of which a location and a central server.
The role of the local server is for each location to be able to run no matter if it is connected to the outside world or not, or to eliminate high latency if the the connection speed is less than optimal.
The role of the central server is two-fold:
Configuration, policy, user, etc, management. For example, new products, price changes, promotions, user changes, etc, are done on the central server and then distributed to the local servers so they have the most up to date info.
Centralize all data created at each location to run reports, analytics and warehouse data.
The question of how much data to keep on the local server is debatable. For example some processes are dependent upon not just that one location, like customer loyalty, so a query must be run to the central server to check user activity and determine incentives. On the other hand, active customer base should be within the scope of the local servers data.
I lack experience in these types of distributed systems. My question is what database should we use that will facilitate this type of setup, hopefully incorporating the functionality to work automatically without much coding needed to achieve the data syncs to/from central server.
Master-Slave Replication:
In this type of replication one server (the master) accepts writes and will replicate the changes to read replicas(slaves)
Characteristics
Asynchronous
Read Scalability
Master is a point of failure for all the nodes (SPOF)
Master-Master:
In this setup all the database servers accepts read and writes and synchronize together.
Characteristics
Synchronous(hopefully)
Read and Write Scalability
Performance is worse than Master-Slave
No SPOF
Master-Master is harder to setup and maintain. Possibility of id collisions.
Any Popular Database Server these days supports the features above.
I would like to know how replication works in a distributed database. It would be nice if this could be explained in a thorough, yet easy to understand way.
It would also be nice if you could make a comparison between distributed transactions and distributed replication.
Single point of failure
The database server is a central part of an enterprise system, and, if it goes down, service availability might get compromised.
If the database server is running on a single server, then we have a single point of failure. Any hardware issue (e.g., disk drive failure) or software malfunction (e.g., driver problems, malfunctioning updates) will render the system unavailable.
Limited resources
If there is a single database server node, then vertical scaling is the only option when it comes to accommodating a higher traffic load. Vertical scaling, or scaling up, means buying more powerful hardware, which provides more resources (e.g., CPU, Memory, I/O) to serve the incoming client transactions.
Up to a certain hardware configuration, vertical scaling can be a viable and simple solution to scale a database system. The problem is that the price-performance ratio is not linear, so after a certain threshold, you get diminishing returns from vertical scaling.
Another problem with vertical scaling is that, in order to upgrade the server, the database service needs to be stopped. So, during the hardware upgrade, the application will not be available, which can impact underlying business operations.
Database Replication
To overcome the aforementioned issues associated with having a single database server node, we can set up multiple database server nodes. The more nodes, the more resources we will have to process incoming traffic.
Also, if a database server node is down, the system can still process requests as long as there are spare database nodes to connect to. For this reason, upgrading the hardware or software of a given database server node can be done without affecting the overall system availability.
The challenge of having multiple nodes is data consistency. If all nodes are in-sync at any given time, the system is Linearizable, which is the strongest guarantee when it comes to data consistency across multiple registers.
The process of synchronizing data across all database nodes is called replication, and there are multiple strategies that we can use.
Single-Primary Database Replication
The Single-Primary Replication scheme looks as follows:
The primary node, also known as the Master node, is the one accepting writes while the replica nodes can only process read-only transactions. Having a single source of truth allows us to avoid data conflicts.
To keep the replicas in-sync, the primary nodes must provide the list of changes that were done by all committed transactions.
Relational database systems have a Redo Log, which contains all data changes that were successfully committed.
PostgreSQL uses the WAL (Write-Ahead Log) records to ensure transaction Durability and for Streaming Replication.
Because the storage engine is separated from the MySQL server, MySQL uses a separate Binary Log for replication. The Redo Log is generated by the InnoDB storage engine, and its goal is to provide transaction Durability while the Binary Log is created by the MySQL Server, and it stores the logical logging records, as opposed to physical logging created by the Redo Log.
By applying the same changes recorded in the WAL or Binary Log entries, the replica node can stay in-sync with the primary node.
Horizontal scaling
The Single-Primary Replication provides horizontal scalability for read-only transactions. If the number of read-only transactions increases, we can create more replica nodes to accommodate the incoming traffic.
This is what horizontal scaling, or scaling out, is all about. Unlike vertical scaling, which requires buying more powerful hardware, horizontal scaling can be achieved using commodity hardware.
On the other hand, read-write transactions can only be scaled up (vertical scaling) as there is a single primary node.
I would recommend initially spending time reviewing the MySQL Docs on Replication. It's a good example of database replication. They are here:
http://dev.mysql.com/doc/refman/5.5/en/replication.html
Covering the entire scope of your question seems like too much for one question.
If you have some specific questions, please feel free to post them. Thanks!
Clustrix is a distributed database with a shared nothing architecture that supports both distributed transactions and replication. There is some technical documentation available that describes data distribution, distributed evaluation model, and built in fault tolerance, as well as an overview of the architecture.
As a MySQL replacement, Clustrix implements MySQL's replication policy and produces binlogs in the MySQL format, which are serialized so that Clustrix can act as either a Master or Slave to MySQL.
What is database clustering? If you allow the same database to be on 2 different servers how do they keep the data between synchronized. And how does this differ from load balancing from a database server perspective?
Database clustering is a bit of an ambiguous term, some vendors consider a cluster having two or more servers share the same storage, some others call a cluster a set of replicated servers.
Replication defines the method by which a set of servers remain synchronized without having to share the storage being able to be geographically disperse, there are two main ways of going about it:
master-master (or multi-master) replication: Any server can update the database. It is usually taken care of by a different module within the database (or a whole different software running on top of them in some cases).
Downside is that it is very hard to do well, and some systems lose ACID properties when in this mode of replication.
Upside is that it is flexible and you can support the failure of any server while still having the database updated.
master-slave replication: There is only a single copy of authoritative data, which is the pushed to the slave servers.
Downside is that it is less fault tolerant, if the master dies, there are no further changes in the slaves.
Upside is that it is easier to do than multi-master and it usually preserve ACID properties.
Load balancing is a different concept, it consists distributing the queries sent to those servers so the load is as evenly distributed as possible. It is usually done at the application layer (or with a connection pool). The only direct relation between replication and load balancing is that you need some replication to be able to load balance, else you'd have a single server.
From SQL Server point of view:
Clustering will give you an active - passive configuration. Meaning in a 2 node cluster, one of them will be the active (serving) and the other one will be passive (waiting to take over when the active node fails). It's a high availability from hardware point of view.
You can have an active-active cluster, but it will require multiple instances of SQL Server running on each node. (i.e. Instance 1 on Node A failing over to Instance 2 on Node B, and instance 1 on Node B failing over to instance 2 on Node A).
Load balancing (at least from SQL Server point of view) does not exists (at least in the same sense of web server load balancing). You can't balance load that way. However, you can split your application to run on some database on server 1 and also run on some database on server 2, etc. This is the primary mean of "load balancing" in SQL world.
Clustering uses shared storage of some kind (a drive cage or a SAN, for example), and puts two database front-ends on it. The front end servers share an IP address and cluster network name that clients use to connect, and they decide between themselves who is currently in charge of serving client requests.
If you're asking about a particular database server, add that to your question and we can add details on their implementation, but at its core, that's what clustering is.
Database Clustering is actually a mode of synchronous replication between two or possibly more nodes with an added functionality of fault tolerance added to your system, and that too in a shared nothing architecture. By shared nothing it means that the individual nodes actually don't share any physical resources like disk or memory.
As far as keeping the data synchronized is concerned, there is a management server to which all the data nodes are connected along with the SQL node to achieve this(talking specifically about MySQL).
Now about the differences: load balancing is just one result that could be achieved through clustering, the others include high availability, scalability and fault tolerance.
Situation: Some Bank has an old legacy ABS (Automatic bank system).
Bank wants to:
notify old legacy CRM system about client's account changes (Publish operation).
check PIN codes of client cards (Request/Response operation) - in synchronious mode.
ABS is implemented in very old private technologies with StoredProcedures calls. So, I can connect to this system via database only.
Which ways of Java/.Net (ESB) application integration with old/legacy database system do you know?
Write/Publish operation
Any vendor's databse server:
Scan tables for new entries - too low speed.
Trigger (if they're supported) which handles SQL updates and inserts and writes event information to some table. And application listener should be checking this table for events.
Oracle serevr : PL/SQL TRIGGERS + Oracle AQ. And listener for JMS.
Reading operation
Just write result into tables of ABS - dangerous.
...
How to notify legacy database system about responses in synchronious mode??? How to implement Write/Read in synchronious mode???
Again, which ways of Java/.Net (ESB) application integration with old/legacy database system do you know?
Lot of vendors hype about DataServices. I think the most value of these products is when integrating different datasources.
I would consider making a simple "application" that exposes this data as a service
It depends on many factors; particularly read/write throughput and performance sensitivity of the database.
Databases tend to be kinda sensitive things and are often very fragile to general purpose access from arbitrary other systems when they are finely tuned for production use in a specific system; so often folks replicate the database to another read-only slave database that can be then used for doing integration work & querying and so forth.
You can then use triggers/polling/JMS based on whatever you need without impacting the original database.
Depending on the database replication technology used; you can then often install triggers in the replica database (which can afford to get a little behind from the master from time to time) - to minimise impact in the production database
I can propose you to use Mule as ESB in your bank (see also http://www.mulesource.org/display/MULE/Home).
It allows you to communicate to database directly (jdbc level which has to be OK with stored procedures as well as tables/views level). I have positive experience with it for integration core banking system (database level, Oracle) with standalone application (web services level).
Frankly, I din't got all your questions (your can ask me in Russian directly if you are prefere),but IMO Mule is your way - it can consume JMS, JDBC, file level and many others and process syncronouse and asyncornouse events as well (see also http://www.mulesource.org/display/MULE2USER/Available+Transports).
Reagrds.
P.S. To be more clear for English speaking audience, I can propose you use more standard term core banking system instead of ABS (which means the same in xUSSR countries).