How decentralized database works from technical point of view.
In general I understand that all the users have its own copy of data and somehow it gets in sync.
But this sync logic totally not clear do clients send http requests or what is happening ?
Any suggestion what to read or how to dive deeper into the topic is highly appreciated
UPDATE ===================
Thanks all for good replies, I want to make question a little bit more narrow
It states that blockchain is dencetralized and owned by nobody so no public servers are in place
So what confuses me
e.g. I have copy of DB another guy have the same database but slightly different how we can get synced in case we both don't have any public IPs?
In my mind if at least someone has a public IP it is already not completely decentralized
Logic: State calculation from transactions
Decentralized systems / datastore works by calculating the state, where state is the actual data that you want to use from Database.
Say, you made modifications to the database by making 2 insert/update queries, the system actually takes these 2 queries and stores them as transactions in the database and these transactions will be synced across nodes(other machines in the network which make it decentralized). And that other node will be calculating state again based on the synced transactions and update it to the database.
Let's try to understand this by an example:
Consider a movie ticketing system. The same theater will be listed in multiple vendor websites.
Consider the movie theater to be a 100 seater (with 100 seats).
And say there are 2 vendors. BookMyShow and TicketNew.
below is a list of transactions that happened on the network from time t1 to t9.
Transaction (T1) - At time(t1) someone bought 10 tickets on BookMyShow.
Transaction (T2) - At time t2 someone else bought 3 more tickets on BookMyShow.
In BookMyShow database, the state will initially be 100 unreserved tickets.
After these 2 transactions(T1 and T2) state will be => remaining 87 seats, booked 13 seats.
(So, the database of BookMyShow will have this information.)
Now the BookMyShow will add these 2 transactions T1 and T2 into so called a block ( this is generally a list of transactions and some other parameters ), B1, and publish it to the network. The other vendor, TicketNew, will get this block B1 synced to his machine and will now know that these transactions happened. Now the TicketNew vendor's machine will start calculating the state of ticketing system by understanding the transactions.
100 Ticket initially.
10 Booked in T1 => 90 Seats remaining
3 Booked in T2 => 87 remaining
Update into TicketNew's database => remaining 87 seats, booked 13 seats
Similarly, T3, ...T6 transactions happen at times t3, ... t6 on TicketNew vendor's website. These transactions will be added to another block, B2 by TicketNew and pushed to the network. BookMyShow syncs the block B2 and updates state in its database that 26 tickets are booked now and 74 are free.
87 tickets before syncing block B2.
4 Booked in T3 => 83 Seats remaining
2 Booked in T4 => 81 remaining
6 Booked in T4 => 75 remaining
1 Booked in T4 => 74 remaining
Update into BookMyShow database => remaining 74 seats, booked 26 seats
How exactly Decentralised databases work is specific to its implementation.
CouchDB is an example of a kind of distributed database. They use the CouchDB Replication Protocol for synchronizing JSON documents between 2 peers over HTTP/1.1.
Couch DB basically uses incremental replication for syncing. You can read more about it.
Hyperledger Fabric is a blockchain by IBM which uses Couch DB as the ledger state database .
Decentralized database like blockchain, whatever hyperledger or Bitcoin, is a kind of EventSource/CQRS implement, the ledger is append only EventSource, and sync to other node by consensus protocol, every node reply event log and get a state database for query, such as hyperledger uses couchdb or leveldb as state database.
How decentralized database works from technical point of view.
This is very broad and can take up an entire course to explain. A brief summery is, generally, a change will be a change in state made to one node. After this change in state, the node will broadcast the change to the rest of the nodes in the distributed system.
For blockchain, after a node adds a block to the chain, it will broadcast the new block to the
rest of the nodes. These nodes will then come to a concensus that the new block is valid. If
the other nodes do not come to a consensus, then they will keep working off the chain without the new block. That is a broad overview of how the system comes to a consensus.
But this sync logic totally not clear do clients send http requests or
what is happening?
For Blockchain, the node that creates a new block will broadcast over http(s) the details of the new block to the other nodes. These nodes have to validate the block against the chain that they have on their system. When a node is working off different chains, this is called a fork. When the nodes come to a consensus, the fork that is deemed as the fork of truth is determined by the system.
It's all about synchronization. Just think about git: Someone changes code, commits his or her changes, pulls change of others that occured in the meanwhile and after that pushes the his or her final changes. Sure, sometimes you will have to merge stuff, but a good framework does it normally for you.
Update: Commit your changes
Synchronize passively: Pull others changes
Synchronize actively: Push your changes
If you are expirienced in JavaScript, then have a look at https://github.com/automerge/automerge. The documentation is very precise and explains the processes in a broder range. So it is not only intresting for web developers.
Related
Flow Diagram
Our microservices are calling third party service, using REST API to access (read/update) same record in shared database
Use case 1: Same microservice making multiple calls to access same record
A customer bought 10 units of a product A
Another customer bought 5 units of same product A
2 calls made by UI API to end point /decrementProduct within few milliseconds
Both calls may end up reading the same inventory count for product A at that time, and both will decrement the purchase units for product A, based on the inventory count they reads.
Example:
Inventory Count before calls: 10
Call 1 decrement 5 units from 10, and updated back 5 as current inventory count.
Call 2 decrement 2 units from 10, and updated back 8 as current inventory count.
Inventory Count after calls: 8
Correct Inventory Count after calls should be: 3
Use case 2: Multiple microservices making multiple calls to access same record
The problem explained in use case 1 will exacerbate in this case, due to number of calls to update same record in a database.
Edit: 13-April-2021
The shared database is exposed to our microservices using REST API and we don't have any control over the physical database and the exposed REST API, to implement any transactions or locking mechanism at the database level.
I don't know which database you are using but traditional relational databases like: oracle, postgresql, mysql, db2, etc... already address these kind of issues using locks in the records that are being updated to ensure that there are not concurrent problems.
I mean, if you open a transaction where you read a value and then update it, There won't be any problem because if you try to update a row in database with different version number that the one that is currently set (an outdated version), The database w'ont let update it
Imagine simple transactions on a bank account as a graph database. With events like this:
+$1000 cash input.
-$10 for buying a book by VISA.
+$100 cash for selling an old bike.
-$50 cash for buying groceries.
In a graph structure we could define the nodes as the transactions with the properties:
id - Transaction ID
time - Timestamp of when the transaction happened.
delta - Amount of money used (+/-) on the transaction
description - Reason for transaction.
The edges would then be pointing to the previous transaction. We could have other edges pointing to other accounts (for transfers between accounts), owners etc. but for simplicity we have this structure.
g.addV('transactions').property('id','1').property('time',0).property('delta',1000).property('description','cash input')
g.addV('transactions').property('id','2').property('time',1).property('delta,-10).property('description','for buying a book by VISA')
g.V('2').addE('previous').to(g.V('1'))
g.addV('transactions').property('id','3').property('time',2).property('delta',100).property('description','cash for selling an old bike.')
g.V('3').addE('previous').to(g.V('2'))
g.addV('transactions').property('id','4').property('time',3).property('delta',-50).property('description','cash for buying groceries')
g.V('4').addE('previous').to(g.V('3'))
Now to get the current inventory of this account we would just traverse the previous edge from the latest transaction, until specific date, or to the beginning, like this:
g.V('4').emit().repeat(out('previous')).until(has('time',0)).properties('delta').value().sum()
==>1040
Which is all good and fast for 4 transactions. BUT, when doing this for 100 000 transactions it takes around 8 minutes, and with more complex operations or more data it takes even longer.
In my test case i have set up an Azure Cosmos-DB with the Graph API and a throughput of 2000 RU/s.
Since I am fairly new to Graph databases and queries, I realize that there might be faster and better ways of doing this, and ways to optimize this, that I don't know about. Maybe even graph database is not the right tool for this job?
What I want to achieve here is a reasonable fast query into the transactions, that may fork into multiple accounts, and many more events.
How can I make this work better?
Why don't you just add a current property for each transaction vertex? This would still keep the history as you have it now, but also provide much faster access to the current inventory value (at any given point in time). Also, should the value of any transaction change afterwards, it would be easy to update all newer transactions accordingly (but this would only be a long running write query, reads would would still be blazing fast).
Keep in mind, that it's generally a bad idea to have so many hops in an OLTP query.
I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs
I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.
What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?
I have an app (vb.net) which collects data from users and stores the data locally on their laptop until they sync it up with a central SQLServer 2008 database. The sync needs to be in both directions. So right now, I have a timestamp on each record that gets set when that record gets updated. Then I compare times on the records to see which is more recent. If a record on the laptop is more recent than the one on the central DB, the laptop record gets sent up. And if the record on the central DB is more recent than the laptop, that record gets sent down to the laptop.
I have several hundred thousand records spread over about 15 tables. It is taking 3 to 4 minutes to run through all of them if you are local on the network. The problem really gets worse for remote users. It takes them 20 to 30 minutes to sync. via VPN.
I have about 5 users doing this and they all need to maintain the same information with each other by way of the central database. They all sync to the central DB, not with each other.
Is there a better way to check every record other than comparing timestamps?
Note that only a handful of records (5%) change each time they sync, but I don't know which ones it may be. It could be any of them. So I have to check all of them.
Thanks.
In my opinion timestamps are not the way to go for determining which records to send to the other party.
Although they might be "ok" for conflict resolution, time differences on synchronization parties (computers), might cause records to be skipped from sending out, causing real problems.
Myself I use an identity column (on the server side) on one specific table to generate sequence nr's, and in every transaction, I get a new sequence number, and assign this to all updated/inserted rows of the other tables that need synchronization.
Now when a client requests synchronization, it provides the server with the latest 'sequence' it received during last synchronization or 0 if it is the first time.
The server would send only those records that have a greater sequence number, and then determines what the highest sequence number was on those records it actually sent to the client, and give this number to the client for next synchronization requests.
In my scenario, conflict resolution is done on the client, because all business logic is their anyway, and this means, that the client always receives updates first, before it start to send theirs.
Because you use one newly generated sequence number for every transaction, you maintain referential integrity during each synchronization, but to make sure that's actually true,
you need to determine the currently highest sequence number before you start to send synchronization data, and never retrieve any records higher then this number, because otherwise you could break referential integrity.
This because, some other thread might have committed inserts of Orders and OrderItems after you already looked up the Orders but not the OrderItems, by which you have OrderItems in your outwards synchronization package without the Order.
For deletions, I use a IsDeleted column, and the server holds records for some period before they really get deleted.
When clients insert data, I give them feedback of what (primary) keys that records where given, etc.. etc..
Well, there is so much more to this then I can mention here, but here are some key thoughts for you that you should watch carefully:
How to prevent:
Missing records
Missing deletes
Double inserts
Unnecessary sending of records (I use a nullable field LastModifierId)
Input validation
Referential integrity
Conflict resolution
Performance costs (choose the right indexes, filtered unique indexes are great for keeping track of temporary client insert identities of records, so they might also be null, you need these to prevent double inserts)
Well good luck, hope this gives food for thoughts..
I am working with google app engine and using the low leval java api to access Big Table. I'm building a SAAS application with 4 layers:
Client web browser
RESTful resources layer
Business layer
Data access layer
I'm building an application to help manage my mobile auto detailing company (and others like it). I have to represent these four separate concepts, but am unsure if my current plan is a good one:
Appointments
Line Items
Invoices
Payments
Appointment: An "Appointment" is a place and time where employees are expected to be in order to deliver a service.
Line Item: A "Line Item" is a service, fee or discount and its associated information. An example of line items that might go into an appointment:
Name: Price: Commission: Time estimate
Full Detail, Regular Size: 160 75 3.5 hours
$10 Off Full Detail Coupon: -10 0 0 hours
Premium Detail: 220 110 4.5 hours
Derived totals(not a line item): $370 $185 8.0 hours
Invoice: An "Invoice" is a record of one or more line items that a customer has committed to pay for.
Payment: A "Payment" is a record of what payments have come in.
In a previous implementation of this application, life was simpler and I treated all four of these concepts as one table in a SQL database: "Appointment." One "Appointment" could have multiple line items, multiple payments, and one invoice. The invoice was just an e-mail or print out that was produced from the line items and customer record.
9 out of 10 times, this worked fine. When one customer made one appointment for one or a few vehicles and paid for it themselves, all was grand. But this system didn't work under a lot of conditions. For example:
When one customer made one appointment, but the appointment got rained out halfway through resulting in the detailer had to come back the next day, I needed two appointments, but only one line item, one invoice and one payment.
When a group of customers at an office all decided to have their cars done the same day in order to get a discount, I needed one appointment, but multiple invoices and multiple payments.
When one customer paid for two appointments with one check, I needed two appointments, but only one invoice and one payment.
I was able to handle all of these outliers by fudging things a little. For example, if a detailer had to come back the next day, i'd just make another appointment on the second day with a line item that said "Finish Up" and the cost would be $0. Or if I had one customer pay for two appointments with one check, I'd put split payment records in each appointment. The problem with this is that it creates a huge opportunity for data in-congruency. Data in-congruency can be a serious problem especially for cases involving financial information such as the third exmaple where the customer paid for two appointments with one check. Payments must be matched up directly with goods and services rendered in order to properly keep track of accounts receivable.
Proposed structure:
Below, is a normalized structure for organizing and storing this data. Perhaps because of my inexperience, I place a lot of emphasis on data normalization because it seems like a great way to avoid data incongruity errors. With this structure, changes to the data can be done with one operation without having to worry about updating other tables. Reads, however, can require multiple reads coupled with in-memory organization of data. I figure later on, if there are performance issues, I can add some denormalized fields to "Appointment" for faster querying while keeping the "safe" normalized structure intact. Denormalization could potentially slow down writes, but I was thinking that I might be able to make asynchronous calls to other resources or add to the task que so that the client does not have to wait for the extra writes that update the denormalized portions of the data.
Tables:
Appointment
start_time
etc...
Invoice
due_date
etc...
Payment
invoice_Key_List
amount_paid
etc...
Line_Item
appointment_Key_List
invoice_Key
name
price
etc...
The following is the series of queries and operations required to tie all four entities (tables) together for a given list of appointments. This would include information on what services were scheduled for each appointment, the total cost of each appointment and weather or not payment as been received for each appointment. This would be a common query when loading the calendar for appointment scheduling or for a manager to get an overall view of operations.
QUERY for the list of "Appointments" who's "start_time" field lies between the given range.
Add each key from the returned appointments into a List.
QUERY for all "Line_Items" who's appointment_key_List field includes any of the returns appointments
Add each invoice_key from all of the line items into a Set collection.
QUERY for all "Invoices" in the invoice ket set (this can be done in one asynchronous operation using app engine)
Add each key from the returned invoices into a List
QUERY for all "Payments" who's invoice_key_list field contains a key matching any of the returned invoices
Reorganize in memory so that each appointment reflects the line_items that are scheduled for it, the total price, total estimated time, and weather or not it has been paid for.
...As you can see, this operation requires 4 datastore queries as well as some in-memory organization (hopefully the in-memory will be pretty fast)
Can anyone comment on this design? This is the best I could come up with, but I suspect there might be better options or completely different designs that I'm not thinking of that might work better in general or specifically under GAE's (google app engine) strengths, weaknesses, and capabilities.
Thanks!
Usage clarification
Most applications are more read-intensive, some are more write intensive. Below, I describe a typical use-case and break down operations that the user would want to perform:
Manager gets a call from a customer:
Read - Manager loads the calendar and looks for a time that is available
Write - Manager queries customer for their information, I pictured this to be a succession of asynchronous reads as the manager enters each piece of information such as phone number, name, e-mail, address, etc... Or if necessary, perhaps one write at the end after the client application has gathered all of the information and it is then submitted.
Write - Manager takes down customer's credit card info and adds it to their record as a separate operation
Write - Manager charges credit card and verifies that the payment went through
Manager makes an outgoing phone call:
Read Manager loads the calendar
Read Manager loads the appointment for the customer he wants to call
Write Manager clicks "Call" button, a call is initiated and a new CallReacord entity is written
Read Call server responds to call request and reads CallRecord to find out how to handle the call
Write Call server writes updated information to the CallRecord
Write when call is closed, call server makes another request to the server to update the CallRecord resource (note: this request is not time-critical)
Accepted answer::
Both of the top two answers were very thoughtful and appreciated. I accepted the one with few votes in order to imperfectly equalize their exposure as much as possible.
You specified two specific "views" your website needs to provide:
Scheduling an appointment. Your current scheme should work just fine for this - you'll just need to do the first query you mentioned.
Overall view of operations. I'm not really sure what this entails, but if you need to do the string of four queries you mentioned above to get this, then your design could use some improvement. Details below.
Four datastore queries in and of itself isn't necessarily overboard. The problem in your case is that two of the queries are expensive and probably even impossible. I'll go through each query:
Getting a list of appointments - no problem. This query will be able to scan an index to efficiently retrieve the appointments in the date range you specify.
Get all line items for each of appointment from #1 - this is a problem. This query requires that you do an IN query. IN queries are transformed into N sub-queries behind the scenes - so you'll end up with one query per appointment key from #1! These will be executed in parallel so that isn't so bad. The main problem is that IN queries are limited to only a small list of values (up to just 30 values). If you have more than 30 appointment keys returned by #1 then this query will fail to execute!
Get all invoices referenced by line items - no problem. You are correct that this query is cheap because you can simply fetch all of the relevant invoices directly by key. (Note: this query is still synchronous - I don't think asynchronous was the word you were looking for).
Get all payments for all invoices returned by #3 - this is a problem. Like #2, this query will be an IN query and will fail if #3 returns even a moderate number of invoices which you need to fetch payments for.
If the number of items returned by #1 and #3 are small enough, then GAE will almost certainly be able to do this within the allowed limits. And that should be good enough for your personal needs - it sounds like you mostly need it to work, and don't need to it to scale to huge numbers of users (it won't).
Suggestions for improvement:
Denormalization! Try storing the keys for Line_Item, Invoice, and Payment entities relevant to a given appointment in lists on the appointment itself. Then you can eliminate your IN queries. Make sure these new ListProperty are not indexed to avoid problems with exploding indices
Other less specific ideas for improvement:
Depending on what your "overall view of operations" is going to show, you might be able to split up the retrieval of all this information. For example, perhaps you start by showing a list of appointments, and then when the manager wants more information about a particular appointment you go ahead and fetch the information relevant to that appointment. You could even do this via AJAX if you this interaction to take place on a single page.
Memcache is your friend - use it to cache the results of datastore queries (or even higher level results) so that you don't have to recompute it from scratch on every access.
As you've noticed, this design doesn't scale. It requires 4 (!!!) DB queries to render the page. That's 3 too many :)
The prevailing notion of working with the App Engine Datastore is that you want to do as much work as you possibly can when something is written, so that almost nothing needs to be done when something is retrieved and rendered. You presumably write the data very few times, compared to how many times it's rendered.
Normalization is similarly something that you seem to be striving for. The Datastore doesn't place any value in normalization -- it may mean less data incongruity, but it also means reading data is muuuuuch slower (4 reads?!!). Since your data is read much more often than it's written, optimize for reads, even if that means your data will occasionally be duplicated or out of sync for a short amount of time.
Instead of thinking about how the data looks when it's stored, think about how you want the data to look when it's displayed to the user. Store as close to that format as you can, even if that means literally storing pre-rendered HTML in the datastore. Reads will be lightning-fast, and that's a good thing.
So since you should optimize for reads, oftentimes your writes will grow to gigantic proportions. So gigantic that you can't fit it in the 30 second time limit for requests. Well, that's what the task queue is for. Store what you consider the "bare necessities" of your model in the datastore, then fire off a task queue to pull it back out, generate the HTML to be rendered, and put it in there in the background. This might mean your model is immediately ready to display until the task has finished with it, so you'll need a graceful degradation in this case, even if that means rendering it "the slow way" until the data is fully populated. Any further reads will be lightning-quick.
In summary, I don't have any specific advice directly related to your database -- that's dependent on what you want the data to look like when the user sees it.
What I can give you are some links to some super helpful videos about the datastore:
Brett Slatkin's 2008 and 2009 talks on building scalable, complex apps on App Engine, and a great one from this year about data pipelines (which isn't directly applicable I think, but really useful in general)
App Engine Under the Covers: How App Engine does what it does, behind the scenes
AppStats: a great way to see how many datastore reads you're performing, and some tips on reducing that number
Here are a few app-engine specific factors that I think you'll have to contend with:
When querying using an inequality, you can only use an inequality on one property. for example, if you are filtering on an appt date being between July 1st and July 4th, you couldn't also filter by price > 200
Transactions on app engine are a bit tricky compared to the SQL database you are probably used to. You can only do transactions on entities that are in the same "entity group".