Deletion / invalidation approaches for reference data

Deletion / invalidation approaches for reference data - database

Based on the discussion I found here: Database: To delete or not to delete records, I want to focus on reference data in particular, add a few thoughts on that, and ask for your preferred approach in general, or based on which criteria you make the decision which of the approaches available you go for.
Let's assume the following data structure for a 'request database' for customers, whereas requests may be delivered via various channels (phone, mail, fax, ..; our 'reference data table I want to mainly focus on'):
Request (ID, Text, Channel_ID)
Channel(ID, Description)
Let's, for the beginning, assume the following data within those two tables:
Request:
ID | Text | Channel_ID
===============================================================
1 | How much is product A currently? | 1
2 | What about my inquiry from 2011/02/13? | 1
3 | Did you receive my payment from 2011/03/04? | 2
Channel:
ID | Description
===============================================================
1 | Phone
2 | Mail
3 | Fax
So, how do you attack this assuming the following requirements:
Channels may change over time. That means: Their descriptions may change. New ones may be added, only valid starting from some particular data. Channels may be invalidated (by some particular date)
For reporting and monitoring purposes, it needs to be possibly to identify using which channel a request was originally filed.
For new requests, only the currently 'valid' channels should be allowed, whereas for pre-existing ones, also the channels that were valid at that particular date should be allowed.
In my understanding, that clearly asks for a richer invalidation approach that goes beyond a deletion flag, probably something incorporating a 'ValidFrom / ValidTo' approach for the reference data table.
On the other hand, this involves several difficulties during data capture of requests, because for new requests, you only display they currently available channels, whereas for maintenance of pre-existing ones, all channels available as of the creation of this record need to be displayed. This might not only be complicated from a development point of view, but may also be non-intuitive to the users.
How do you commonly set up your data model for reference data that might chance over time? How do you create your user interface then? Which further parameters do you take into account for proper database design?

In such cases I usually create another table, for example, channel_versions that duplicates all fields from channel and has extra create_date column(and it's own PK of course). For channel I define after insert/update triggers that copy new values into channel_versions. Now all requests from Request table refer to records from channel_versions. For new requests you need to get the most recent version of channel from channel_versions . For old requests you always know how channel looked when the request was fulfilled.

Related

Firestore: Running Complex Update Queries With Multiple Retrievals (ReactJS)

I have a grid of data whose endpoints are displayed from data stored in my firestore database. So for instance an outline could be as follows:
| Spent total: $150 |
| Item 1: $80 |
| Item 2: $70 |
So the value for all of these costs (70,80 and 150) is stored in my firestore database with the sub items being a separate collection from my total spent. Now, I wannt to be able to update the price of item 2 to say $90 which will then update Item 2's value in firestore, but I want this to then run a check against the table so that the "spent total" is also updated to say "$170". What would be the best way to accomplish something like this?
Especially if I were to add multiple rows and columns that all are dependent on one another, what is the best way to update one part of my grid so that afterwords all of the data endpoints on the grid are updated correctly? Should I be using cloud functions somehow?
Additionally, I am creating a ReactJS app and previously in the app I just had my grid endpoints stored in my Redux store state so that I could run complex methods that checked each row and column and did some math to update each endpoint correctly, but what is the best way to do this now that I have migrated my data to firestore?
Edit:here are some pictures of how I am trying to set up my firestore layout currently:

You might want to back up a little and get a better understanding of the type of database that Firestore is. It's NoSQL, so things like rows and columns and tables don't exist.
Try this video: https://youtu.be/v_hR4K4auoQ
and this one: https://youtu.be/haMOUb3KVSo
But yes, you could use a cloud function to update a value for you, or you could make the new Spent total calculation within your app logic and when you write the new value for Item 2, also write the new value for Spent total.
But mostly, you need to understand how firestore stores your data and how it charges you to retrieve it. You are mostly charged for each read/write request, with much less concern for the actual amount of data you have stored overall. So it will probably be better to NOT keep these values in separate collections if you are always going to be utilizing them at the same time.
For example:
Collection(transactions) => Document(transaction133453) {item1: $80, item2: $70, spentTotal: $150}
and then if you needed to update that transaction, you would just update the values for that document all at once and it would only count as 1 write operation. You could store the transactions collection as a subcollection of a customer document, or simply as its own collection. But the bottom line is most of the best practices you would rely on for a SQL database with tables, columns, and rows are 100% irrelevant for a Firestore (NoSQL) database, so you must have a full understanding of what that means before you start to plan the structure of your database.
I hope this helps!! Happy YouTubing...
Edit in response to comment:
The way I like to think about it is how am I going to use the data as opposed to what is the most logical way to organize the data. I'm not sure I understand the context of your example data, but if I were maybe tracking budgets for projects or something, I might use something like the screenshots I pasted below.
Since I am likely going to have a pretty limited number of team members for each budget, that can be stored in an array within the document, along with ALL of the fields specific to that budget - basically anything that I might like to show in a screen that displays budget details, for instance. Because when you make a query to populate the data for that screen, if everything you need is all in one document, then you only have to make one request! But if you kept your "headers" in one doc and then your "data" in another doc, now you have to make 2 requests just to populate 1 screen.
Then maybe on that screen, I have a link to "View Related Transactions", if the user clicks on that, you would then call a query to your collection of transactions. Something like transactions is best stored in a collection, because you probably don't know if you are going to have 5 transactions or 500. If you wanted to show how many total transactions you had on your budget details page, you might consider adding a field in your budget doc for "totalTransactions: (number)". Then each time a user added a transaction, you would write the transaction details to the appropriate transactions collection, and also increase the totalTransactions field by 1 - this would be 2 writes to your db. Firestore is built around the concept that users are likely reading data way more frequently than writing data. So make two writes when you update your transactions, but only have to read one doc every time you look at your budget and want to know how many transactions have taken place.
Same for something like chats. But you would only make chats a subcollection of the budget document if you wanted to only ever show chats for one budget at a time. If you wanted all your chats to be taking place in one screen to talk about all budgets, you would likely want to make your chats collection at the root level.
As for getting your data from the document, it's basically a JSON object so (may vary slightly depending on what kind of app you are working in),
a nested array is referred to by:
documentName.arrayName[index]
budget12345.teamMembers[1]
a nested object:
documentName.objectName.fieldName
budget12345.projectManager.firstName
And then a subcollection is
collection(budgets).document(budget12345).subcollection(transactions)
FirebaseExample budget doc
FirebaseExample remainder of budget doc
FirebaseExample team chats collection
FirebaseExample transactions collection

Making a table with fixed columns versus key-valued pairs of metadata?

I was asked to create a table to store paid-hours data from multiple attendance systems from multiple geographies from multiple sub-companies. This table would be used for high level reporting so basically it is skipping the steps of creating tables for each system (which might exist) and moving directly to what the final product would be.
The request was to have a dimension for each type of hours or pay like this:
date | employee_id | type | hours | amount
2016-04-22 abc123 regular 80 3500
2016-04-22 abc123 overtime 6 200
2016-04-22 abc123 adjustment 1 13
2016-04-22 abc123 paid time off 24 100
2016-04-22 abc123 commission 600
2016-04-22 abc123 gross total 4413
There are multiple rows per employee but the though process is that this will allow us to capture new dimensions if they are added.
The data is coming from several sources and I was told not to worry about the ETL, but just design the ultimate table and make it work for any system. We would provide this format to other people for them to fill in.
I have only seen the raw data from one system and it like this:
date | employee_id | gross_total_amount | regular_hours | regular_amount | OT_hours | OT_amount | classification | amount | hours
It is pretty messy. Multiple rows for employees and values like gross_total repeat each row. There is a classification column which has items like PTO (paid time off), adjustments, empty values, commission, etc. Because of repeating values, it is impossible to just simply sum the data up to make it equal the gross_total_amount.
Anyways, I kind of would prefer to do a column based approach where each row describes the employees paid hours for a cut off. One problem is that I won't know all of the possible types of hours which are possible so I can't necessarily make a table like:
date | employee_id | gross_total_amount | commission_amount | regular_hours | regular_amount | overtime_hours | overtime_amount | paid_time_off_hours | paid_time_off_amount | holiday_hours | holiday_amount
I am more used to data formatted that way though. The concern is that you might not capture all of the necessary columns or if something new is added. (For example, I know there is maternity leave, paternity leave, bereavement leave, in other geographies there are labor laws about working at night, etc)
Any advice? Is the table which was suggested to me from my superior a viable solution?

TAM makes lots of good points, and I have only two additional suggestions.
First, I would generate some fake data in the table as described above, and see if it can generate the required reports. Show your manager each of the reports based on the fake data, to check that they're OK. (It appears that the reports are the ultimate objective, so work back from there.)
Second, I would suggest that you get sample data from as many of the input systems as you can. This is to double check that what you're being asked to do is possible for all systems. It's not so you can design the ETL, or gather new requirements, just testing it all out on paper (do the ETL in your head). Use this to update the fake data, and generate fresh fake reports, and check the reports again.

Let me recapitulate what I understand to be the basic task.
You get data from different sources, having different structures. Your task is to consolidate them in a single database to be able to answer questions about all these data. I understand the hint about "not to worry about the ETL, but just design the ultimate table" in that way that your consolidated database doesn't need to contain all detail information that might be present in the original data, but just enough information to fulfill the specific requirements to the consolidated database.
This sounds sensible as long as your superior is certain enough about these requirements. In that case, you will reduce the information coming from each source to the consolidated structure.
In any way, you'll have to capture the domain semantics of the data coming in from each source. Lacking access to your domain semantics, I can't clarify the mess of repeating values etc. for you. E.g., if there are detail records and gross total records, as in your example, it would be wrong to add the hours of all records, as this would always yield twice the hours actually worked. So someone will have to worry about ETL, namely interpreting each set of records, probably consisting of all entries for an employee and one working day, find out what they mean, and transform them to the consolidated structure.
I understand another part of the question to be about the usage of metadata. You can have different columns for notions like holiday leave and maternity leave, or you have a metadata table containing these notions as a key-value pair, and refer to the key from your main table. The metadata way is sometimes praised as being more flexible, as you can introduce a new type (like paternity leave) without redesigning your database. However, you will need to redesign the software filling and probably also querying your tables to make use of the new type. So you'll have to develop and deploy a new software release anyway, and adding a few columns to a table will just be part of that development effort.
There is one major difference between a broad table containing all notions as attributes and the metadata approach. If you want to make sure that, for a time period, either all or none of the values are present, that's easy with the broad table: Just make all attributes `not null´, and you're done. Ensuring this for the metadata solution would mean some rather complicated constraint that may or may not be available depending on the database system you use.
If that's not a main requirement, I would go a pragmatic way and use different columns if I expect only a handful of those types, and a separate key-value table otherwise.
All these considerations relied on your superior's assertion (as I understand it) that your consolidated table will only need to fulfill the requirements known today, so you are free to throw original detail information away if it's not needed due to these requirements. I'm wary of that kind of assertion. Let's assume some of your information sources deliver additional information. Then it's quite probable that someday someone asks for a report also containing this information, where present. This won't be possible if your data structure only contains what's needed today.
There are two ways to handle this, i.e. to provide for future needs. You can, after knowing the data coming from each additional source, extend your consolidated database to cover all data structures coming from there. This requires some effort, as different sources might express the same concept using different data, and you would have to consolidate those to make the data comparable. Also, there is some probability that not all of your effort will be worth the trouble, as not all of the detail information you get will actually be needed for your consolidated database. Another more elegant way would therefore be to keep the original data that you import for each source, and only in case of a concrete new requirement, extend your database and reimport the data from the sources to cover the additional details. Prices of storage being low as they are, this might yield an optimal cost-benefit ratio.

Google App Engine Datastore - Keys vs. Identifiers

One decision that I have run into a few times is how to handle passing around either the key or embedded IDs of the entities. Each seems equally feasible given the encoders and marshalling methods built in with the datastore keys, but I was wondering if there is any sort of best practice on this choice. An example might be for a URL accessing a user’s files, where users have the default auto-generated numerical IDs, of the form: website.com/users/{userIdentifier}/files
I am trying to determine whether the number embedded in the datastore keys is preferable to the actual key strings themselves. Is it safe to have datastore keys out in the wild? I would like to standardize the way we handle those identifiers across our system and was wondering if there are any best practices on this.

The only reason to use a full Key as opposed to an identifier is to get the ancestor information embedded in the key itself without passing an additional data. While this may be convenient in some cases, I don't think it's a big enough of an advantage to use keys as a standard method of reference within an app.
The advantages of using an identifier are more substantial: (a) they are much smaller, and (b) they do not reveal any information about their ancestors (which may or may not be an issue).
The smaller size comes into play quite often: you may want to use an id in a URL, hold a list of ids in a memcache (which has a 1MB limit), etc.

Datastore keys contain (at least) next information:
Kind
Reference to ancestor
String or Int ID
Do you really need/want to pass in URL or keep in your DB AppID & Kind?
Compare this 2 urls (logically, in case of key it would be probably encoded with urlsafe()):
/list-of-orders?user=123
/list-of-orders?user=User/123
Or this 2 fields:
Table: Orders
---------------------
| UserKey | UserID |
---------------------
| User/123 | 123 |
---------------------
Why would you want to keep & pass around repetitive information about app & kind? Usually your app reference its own entities and kind is known by column or parameter name.
Unless you build some orchestration/integration between few apps it's more effective to use just IDs.

Is it possible to make changes to WCF RIA entities on the server, send them to the client but not affect the underlying entities?

From reading the title it might seem like an odd request, so let me clarify.
I'm storing dates and times on the server alongside their time zone information. I want the clients to be able to request these objects with a parameter matching their required time zone and receive the objects with the appropriate data.
So say I have a table of Bookings for particular times. A couple of rows might look like
BookingId | When | TimeZone | Notes
1 | 2011-05-06 12:00:00.000 | GMT +12 | null
2 | 2011-05-06 08:00:00.000 | GMT +2 | null
The client would call something like GetBookings("Pacific Standard Time") and the resulting entity would be the above 2 tuples (probably without the time zone field) with their DateTimes adjusted such that the times are given in the client's time zone, with no additional time zone/offset information.
I know I could just do the time zone conversion on the client, but if I have multiple different clients I'm looking at duplicating this (somewhat tricky) code on multiple platforms, which I don't want to do.
The problem here is that if the server makes changes to these entities (which are backed by EF) then the changes are tracked by the ObjectContext. I'm sure there's a simple way around this?
The best solution I have thought of so far is a DTO for my Booking object, which I'd rather avoid but will implement if necessary.
Thanks.

Well, one approach will be you can simply create a new object of that class type and copy the data from your "real" object to this one and modify this object's timestamp. Offcourse you should not add this to ObjectContext :p. If you return this object it will be simply good and you can achieve your results.
A better solution will be just create partial class for your class (Mindwell it should be in the same namespace) and create a computed property. If you are using Silverlight use [DataMemberAttribute()] on the property and populate your information accordingly with your desired timezone. I think this is good to go.

Can I do transactions and locks in CouchDB?

I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?

No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F

Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.

A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.

No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.

As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.

Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.

Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight