Design scalable highly updated ressource on a DHT in key/value pair - distributed

I would like to design a resource that users would continuously update and read, this resource need not to be always update but must scale well, I mean that the nodes responsible for the resource and his replicas should not be overloaded.
The main problem is that I cannot see how it is possible! I could offload these nodes by adding read cache and update them at a slow pace, but for the writing I have no idea how to scale because values must be at known keys to be recovered by the users and so I can't share the load on the DHT...
Thx a lot for your ideas!

After some though,
one possible idea would be to make something approximative, each time someone want to post a feed it also post the feed with a key equal to a timestamp representing the second (like 25/03/2011 at 5 min and 1 sec) at which he posted the feed, so that the load is automatically shared between the node and that the responsible node quickly change.
When someone want to know the last feeds posted he just check the timestamp of the actual second and the 2-3 past second and can thus recover the last feeds distributing the load between the nodes responsible for the different seconds...
That should make the job I think :).
Hope my own answer will help someone later ;)

Related

How many records / rows / nodes is a lot in Firebase?

I'm creating an app where I will store users under all postalcodes/zipcodes they want to deliver to. The structure looks like this:
postalcodes/{{postalcode}}/{{userId}}=true
The reason for the structure is to easily fetch all users who deliver to a certain postal code.
ex. postalcodes/21121/
If all user applies like 500 postalcodes and the app has about 1000 users it can become a lot of records:
500x1000 = 500000
Will Firebase easily handle that many records in data storage, or should I consider a different approach/solution? What are your thoughts?
Kind regards,
Elias
I'm quite sure Firebase can return 500k nodes without a problem.
The bigger concerns are how long that retrieval will take (especially in this mobile-first era) and what your application will show your user based on that many nodes.
A list with 500k rows is hardly useful, so most likely you'll show a subset of the data.
Say you just show the first screenful of nodes. How many nodes will that be? 20? So why would you already retrieve the other nodes already in that case? I'd simply retrieve the nodes needed to build the first screen and load the rest on demand - when/if needed.
Alternatively I could imagine you show a digest of the nodes (like a total number of nodes and maybe some averages per zip code area). You'd need all nodes to determine that digest. But I'd hardly consider it to task of a client application to determine the digest values. That's more something of a server-side task. That server could use the same technology as client-apps (i.e. the JavaScript API), but it wouldn't be bothered (as much) by bandwidth and time constraints.
Just some ideas of how I would approach this, so ymmv.

One or two field to represent "current step" and "is or not finished" state?

Sorry if this question is too silly or neurotic... But I can't figure it out by myself. So I want to see how others deal with it.
My question is:
I want to write a program show progress of do some thing. So I need to record which state is it currently in so that someone can check it by anytime. there are two method:
Use two field to represent the progress state: step and is_finished.
Just one filed: step. For example, if this thing need 5 step, then 6 means finished. ( 0 means not started? )
Compare above two methods.
two field:
Seems more clear. And the most important is that logically speaking step and finished or not are two concepts? I'm not sure about this.
If thing are finished. We change is_finished field to true ( or 1 as you like ). But what to do with step field now? Plus one, or just not touch it because it has no meaning any more now?
one field:
Simple, space saving. But not very intuitive. For example, we don't know what 6 really means by just looking at this field because it may represent finish or middle step. It need other information e.g. total step to determine. And potentially this meaning is not very stable if the total steps will change ( is_finished field in two field method would not affected by this).
So How do you will deal with it? Thanks!
UPDATE:
I forgot some point maybe useful in the previous post:
The story is: We provide a web-based service for customers. (This service has time limitation e.g. 1 year term) After customer purchase it our deployment programe prepare hardware(virtual machine) and deploy some software which need some time to finish. And we want to provide progress info for customer. When deployment is finished, the customer should be informed.
Database design:
It need a usage state field to represent running normal, running but owe (expired), stop. What confusing me is should it include not deployed yet and deploying information or not?
The progress info should include some other info e.g. the start time so we can tell how much time elapsed since start. But this info is no need to be persistent because we won't care about these info as long as it's finished. So I decide to store these progress info in a separate (temporary) table. Then I think it need another field in another more persistent table to tell if things are done . So can we combine it into the usage state field mentioned above?
I like the one-field approach better, for the following reasons:
(Assuming you want to search on steps) you can "cover" all steps using only one simple index.
Should you ever want to attach some additional information to each of the steps, the one-field approach can easily accommodate a FOREIGN KEY towards a new table containing that information.
Requires slightly less storage space. Storage is cheap these days, but that's not the point - caching and network performance is.
Two-field approach:
(Assuming you want to search on steps) might require a "fatter" composite index or even two indexes (which takes space, lowers the cache effectiveness and incurs maintenance cost for INSERT/UPDATE/DELETE operations).
Requires a CHECK to defend the database from "impossible" combinations. Funny enough, some DBMSes don't enforce CHECKs (I'm looking at you, MySQL).
Requires slightly more storage space (and therefore slightly less of it fits into cache, takes up slightly more network bandwidth etc.).
NOTE: Should you choose to use NULLs, that could have "interesting" consequences under certain DBMSes (for example, Oracle doesn't index NULLs).
For example, we don't know what 6 really means
That doesn't really matter, as long as the client application knows what it means.
Design the database for applications, not humans.
And potentially this meaning is not very stable if the total steps will change
True, but you have the same problem with two-field approach as well, if new step is added in the "middle" of existing steps.
Either UPDATE the table accordingly,
or never change the step values. For example, if the step 5 is the last one, then newly added step 6 is considered earlier despite having greater value - your application (or the additional table I mentioned) will know the order of steps, even if their values are not ordered. If you really want "order by value" without resorting to UPDATE, make the steps: 10, 20, 30 etc, so you can insert new steps in the gaps (the old BASIC line number trick).
It remains a matter of taste but I would suggest the second option of a single int field step. On inserting a new record, initialize the value of step to 0 which would indicate "not started yet". Any positive integer value would obviously denote the current step. As soon as the trajectory is completed I would set step to NULL. As you correctly stated this method does require solid documentation but I think that it is not too confusing

Is this a functional syncing algorithm?

I'm working on a basic syncing algorithm for a user's notes. I've got most of it figured out, but before I start programming it, I want to run it by here to see if it makes sense. Usually I end up not realizing one huge important thing that someone else easily saw that I couldn't. Here's how it works:
I have a table in my database where I insert objects called SyncOperation. A SyncOperation is a sort of metadata on the nature of what every device needs to perform to be up to date. Say a user has 2 registered devices, firstDevice and secondDevice. firstDevice creates a new note and pushes it to the server. Now, a SyncOperation is created with the note's Id, operation type, and processedDeviceList. I create a SyncOperation with type "NewNote", and I add the originating device ID to that SyncOperation's processedDeviceList. So now secondDevice checks in to the server to see if it needs to make any updates. It makes a query to get all SyncOperations where secondDeviceId is not in the processedDeviceList. It finds out its type is NewNote, so it gets the new note and adds itself to the processedDeviceList. Now this device is in sync.
When I delete a note, I find the already created SyncOperation in the table with type "NewNote". I change the type to Delete, remove all devices from processedDevicesList except for the device that deleted the note. So now when new devices call in to see what they need to update, since their deviceId is not in the processedList, they'll have to process that SyncOperation, which tells their device to delete that respective note.
And that's generally how it'd work. Is my solution too complicated? Can it be simplified? Can anyone think of a situation where this wouldn't work? Will this be inefficient on a large scale?
Sounds very complicated - the central database shouldn't be responsible for determining which devices have recieved which updates. Here's how I'd do it:
The database keeps a table of SyncOperations for each change. Each SyncOperation is has a change_id numbered in ascending order (that is, change_id INTEGER PRIMARY KEY AUTOINCREMENT.)
Each device keeps a current_change_id number representing what change it last saw.
When a device wants to update, it does SELECT * FROM SyncOperations WHERE change_id > current_change_id. This gets it the list of all changes it needs to be up-to-date. Apply each of them in chronological order.
This has the charming feature that, if you wanted to, you could initialise a new device simply by creating a new client with current_change_id = 0. Then it would pull in all updates.
Note that this won't really work if two users can be doing concurrent edits (which edit "wins"?). You can try and merge edits automatically, or you can raise a notification to the user. If you want some inspiration, look at the operation of the git version control system (or Mercurial, or CVS...) for conflicting edits.
You may want to take a look at SyncML for ideas on how to handle sync operations (http://www.openmobilealliance.org/tech/affiliates/syncml/syncml_sync_protocol_v11_20020215.pdf). SyncML has been around for a while, and as a public standard, has had a fair amount of scrutiny and review. There are also open source implementations (Funambol comes to mind) that can also provide some coding clues. You don't have to use the whole spec, but reading it may give you a few "ahah" moments about syncing data - I know it helped to think through what needs to be done.
Mark
P.S. A later version of the protocol - http://www.openmobilealliance.org/technical/release_program/docs/DS/V1_2_1-20070810-A/OMA-TS-DS_Protocol-V1_2_1-20070810-A.pdf
I have seen the basic idea of keeping track of operations in a database elsewhere, so I dare say it can be made to work. You may wish to think about what should happen if different devices are in use at much the same time, and end up submitting conflicting changes - e.g. two different attempts to edit the same note. This may surface as a change to the user interface, to allow them to intervene to resolve such conflicts manually.

"MaximumRecievedMessageSize" exceed - Better way of handling this?

The data I'm trying to send out through a WCF service is a Northwind Database table (suppliers) which holds about 29 records, and is already exceeding the maximum length of a message I can send. I've looked around for answers and everyone says the same thing: Increase the "maxRecievedMessageSize" in the .config file.
However, this seems very wrong to me - it feels too much like a work around rather than solving the issue (Ex: What if it exceeds the maximum amount I can set it too?). Instead, is there a way to break up the message into chunks? The service itself is modeled by WSSF, so I'm having hard time finding "where" the message is being serialized in the first place (I do not provide code since WSSF provides a very strict template to work on, as I'm aware).
Side-Note/Question: I have a "backup" plan where I can execute a stored command onto the database that only brings back 10 rows of data (at a specified starting point when calling the function). However, I would have to call the function that does this several times. Would this still be better than breaking the message into chunks?
I apologize for not displaying any code but I feel as though it will only cause more confusion. If it is necessary, then I will try and clear this question up to the best of my ability asap. Thank you for your contribution!
Provide Skip and Take properties on your request object to allow the client to control paging.

What is the life span of data?

Recently I’ve found myself in a database tangle where management wants the ability to remove data from the database, but still wants that data to appear in other places. Example: They want to remove all instances of the product whizbang, but they still want whizbang to appear in sales reports. (if they ran one for a previous date).
Now I can add a field, say is_deleted, that will track whether that product has been deleted and thus still keep all my references, but over a period of time, I have the potential of housing a lot of dead data. (data that is never accessed again). How to handle this is not my question.
I’m curious to find out, in your experience what is the average life span of data? That is, on average how long is data alive or good for before it gets either replaced or deleted? I understand that this is relative to the type of data you are housing, but certainly all data has some sort of life span?
Data lives forever...or often it should. One common practice is to have end and/or start dates for a record. So for your whizbang, you have a start date (so that it won't appear on sales reports before it's official launch), and an end date (so that it drops off of reports after it's been end-of-lifed). Using the proper dates as criteria for your reporting as well as your applications, you won't see the whizbang except for when you should, and the data still exists (which it should, theoretically infinitely).
As Koistya Navin mentions, moving data to a data warehouse at a certain point is also an option, but this depends in large part on how large your 'old' data is, and how long you need to keep it readily available for access.
Many of our customers keep data online for 2 years. After that it's moved to backup disks, but it can be put online if needed.
Consider adding a column "expiration" or "effective date". This will allow you mark a product as obsolete, but reports will return that product if the time range is satisfied.
Usually it's better to move such data into seporate database (database warehouse) and keep working database clean. At data warehouse your data can be kept for many years without impacting your application.
Reference: Data Warehouse at Wikipedia
I've always gone by what is the ruling body looking for. Example the IRS wants you to keep 7 years of history or for security reasons we keep 3 years of log information, etc. So I guess you could do 2 things, determine what the life span of your data is I would say 3 years would be enough and then you could add the is_deleted flag along with a date that way you would be able to flag some data to delete sooner than later.
Yes, all data has a lifespan. And yes, it is relative to the type of data you have.
Some data has a lifespan measured in seconds (authentication tokens, for instance), some other data virtual eternity (more than the medium and formats it is stored into, like for instance ownership records).
You will have to either be more specific as to the type of data you are envisioning, or do a census in your own organization as to the usual lifespan of stuff.
Our particular flavor varies. We have some data (a vast majority) which goes stale after 3 months (hard product limit) but can be revived at any later date.
We have other data that is effectively immortal.
In practice, most of the data we serve up is fresh and frequently requested for a few weeks, at most a month, before falling to sporadic use.
How much is "a lot of dead data"?
With processing power and data storage so cheap, I wouldn't purge old data unless there's a really good reason to. You also need to consider the legal implications. Large (and even small) companies may have incredibly long retention policies for old data, to save themselves millions down the road when they are subpoenaed for it by a judge.
I would check with whatever legal department you have and find out how long the data needs to be stored. That's the safest bet.
Also, ask yourself what the benefit of removing the old data is. Is the only benefit a tidier database? If so, I wouldn't do it. Are you going to see a 10X performance increase? If so, I'd do it. This really is a complex question though, and it's tough for us to have all the information required to give you good advice.
I have a few projects where the customer wants all the historical data (going back over 19 years). Quite a bit of the really old data is malformed and is going to be a nightmare to import into the new system. We convinced them that they won't need records going back any further than 10 years, but like you said it's all relative to the type of data you're housing.
On a side note, data storage is extremely cheap right now, and if it isn't affecting the performance of your application, I would just leave it where it is.
[...] but certainly all data has some sort of life span?
Not any kind of life span we can talk about meaningfully. A lot of data is useless as soon as it's created or recorded. Such data could be discarded immediately with no effect. On the other hand, some data has enough value that it will outlive the current system that hosts it. If Amazon were to completely replace their current infrastructure, the customer histories they have stored would still be immensely valuable.
As you said, it's relative. Each type of data has its own life span that has no relation to another type of data's life span. There's no meaningful "average life span of data".
I have the potential of housing a lot of dead data. (data that is never accessed again).
But they will when they perform those reports then they are accessing that data.
Until then you'll need to keep the data in some form. Move to another table or have a switch like you mentioned.
uh...at the risk of oversimplifying...it sounds like using DateDeleted instead of a bit would solve your how-long-to-keep issue.

Resources