I'm looking for some thoughts about how to design a digest email feature. I'm not concerned about the actual business code; instead I'd like to focus on the gist of it.
Let's tackle this with a known example: articles. Here's a general overview of some important features:
The user is able to choose the digest frequency (e.g. daily or weekly);
The digest only contains new articles;
"New articles" are to be considered relative to the previous digest that was sent to a specific user;
I've been thinking about the following:
Introduce per-user tracking of articles previously included in a digest and filter those out?
Requires a new database table;
Could become expensive when the table contains millions of rows;
What to do in case of including multiple types of models in the digest? Multiple tracking tables? Polymorphic table? ...?
Use article creation dates to include articles between current date and the chosen digest frequency?
Uses current date and information already present in the database, so no new tables required;
What happens when a user changes from daily to weekly emails? He could receive the same article again in the weekly digest. Should this edge case be considered? If so, how to mitigate?
For some reason the creation date of an article is being updated to today, positively triggering the date comparison again. Should this edge case be considered? If so, how to mitigate?
Or can you think of other ways to implement this feature?
I'm eager to learn your insights.
You can make an additional table that will contain information about digest subscription by each user. This way gives the ability to make a database design cleaner and more universal because mailing is a separate logical module. Aside from that, the additional table gives the ability to expand stored data about digest subscription easy in future. For example:
With help usage of this table, you would manage data easy. For example, you can select all recipient of daily digest:
SELECT *
FROM digest_subscription
WHERE interval_type = 'daily'
AND last_date_distribution <= NOW()
or select all recipient of the weekly digest
SELECT *
FROM digest_subscription
WHERE interval_type = 'weekly'
AND last_date_distribution <= NOW() - INTERVAL 7 DAY
Condition by interval type and compare the last date distribution by rule "equal or less" give the ability to avoid problems of untimely sending of emails (for example technical failures on a server, etc.)
Also, you can make correct articles list with help information of the last data distribution. Usage of the last data distribution gives the ability to avoid problems of interval change. For example:
SELECT *
FROM articles
WHERE created_at >= <the last date distribution of the user>
Of course, you don't avoid the problems of updated creation date. But you should minimize the reasons for that happening. For example, your code can update the modification date but your code shouldn't modify the creation date.
Related
[Note: There is a Teacher Object with the fields such as Teacher Name, DateofJoining, and also a formula field called Experience]
My Task was to create a Public Group consisting of another user
and this user should only see teachers who have experience greater than 2 years
But when i create a sharing rule based on criteria the field name called Experience doesn't show up as it is a formula field.
So i got an idea of creating a new field(maybe a text or number data type) which would have the value of Experience in it. (But i have no idea on how to implement this)
Is there a way to implement this?
Any other solution is also well appreciated!
Hard to say.
Normal trick would be to create a helper field (text, number, whatever) and have piece of functionality that populates it. An "early flow" or "before insert, before update" trigger ideally. Worst case a normal flow, process builder or "after insert, after update" trigger. Something like "if Experience__c != 'your formula here' then Experience__c = 'your formula here'". Consult normal SF help and trailhead if you never used early flows
You'd make an one-off data fix to populate existing records and job done, normal field should be selectable as sharing rule criteria.
=====
But I smell trouble with your formula. What exactly you have there, something like Experience__c = (TODAY() - DateofJoining__c) / 365? That's bit evil. Formulas with TODAY(), NOW() or anything with $ (roughly speaking who's looking at the data, user's name, profile role... not what's actually on the record itself) are "nondeterministic". Unpredictable.
A "today()" changes just like that, without updating the record. Sure, when you watch the record a fresh value will be calculated but other than that LastModifiedDate doesn't change, there's no magical trigger running at midnight that rechecks sharing. (especially that there's no single midnight, you could have users in multiple timezones). SF just doesn't allow nondeterministic fields in many places, see https://salesforce.stackexchange.com/q/32122/799
So if you do rely on TODAY() in your formula you might have to make a "scheduled flow" or read about schedulable, batchable apex. Create nightly job that would run and recalculate your helper field with right experience. You'd probably even need both solutions, a "before save" flow for new data created today and nightly job to advance the clock on existing old data...
We have a booking system where dozens of thousands of reservations are done every day. Because a customer can create a reservation without being logged in, it means that for every reservation a new customer id/row is created, even if the very same customer already have reserved in the system before. That results in a lot of customer duplicates.
The engineering team has decided that, in order to deduplicate the customers, they will run a nightly script, every day, which checks for this duplicates based on some business rules (email, address, etc). The logic for the deduplication then is:
If a new reservation is created, check if the (newly created) customer for this reservation has already an old customer id (by comparing email and other aspects).
If it has one or more old reservations, detach that reservation from the old customer id, and link it to a new customer id. Literally by changing the customer ID of that old reservation to the newly created customer.
I don't have a too strong technical background but this for me smells like terrible design. As we have several operational applications relying on that data, this creates a massive sync issue. Besides that, I was hoping to understand why exactly, in terms of application architecture, this is bad design and what would be a better solution for this problem of deduplication (if it even has to be solved in "this" application domain).
I would appreciate very much any help so I can drive the engineering team to the right direction.
In General
What's the problem you're trying to solve? Free-up disk space, get accurate analytics of user behavior or be more user friendly?
It feels a bit risky, and depends on how critical it is that you get the re-matching 100% correct. You need to ask "what's the worst that can happen?" and "does this open the system to abuse" - not because you should be paranoid, but because to not think that through feels a bit negligent. E.g. if you were a govt department matching private citizen records then that approach would be way too cavalier.
If the worst that can happen is not so bad, and the 80% you get right gets you the outcome you need, then maybe it's ok.
If there's not a process for validating the identity of the user then by definition your customer id/row is storing sessions, not Customers.
In terms of the nightly job - If your backend system is an old legacy system then I can appreciate why a nightly batch job might be the easiest option; that said, if done correctly and with the right architecture, you should be able to do that check on the fly as needed.
Specifics
...check if the (newly created) customer
for this reservation has already an old customer id (by comparing
email...
Are you validating the email - e.g. by getting users to confirm it through a confirmation email mechanism? If yes, and if email is a mandatory field, then this feels ok, and you could probably use the email exclusively.
... and other aspects.
What are those? Sometimes getting more data just makes it harder unless there's good data hygiene in place. E.g. what happens if you're checking phone numbers (and other data) and someone does a typo on the phone number which matches with some other customer - so you simultaneously match with more than one customer?
If it has one or more old reservations, detach that reservation from
the old customer id, and link it to a new customer id. Literally by
changing the customer ID of that old reservation to the newly created
customer.
Feels dangerous. What happens if the detaching process screws up? I've seen situations where instead of updating the delta, the system did a total purge then full re-import... when the second part fails the entire system is blank. It's not your exact situation but you are creating the possibility for similar types of issue.
As we have several operational applications relying on that data, this creates a massive sync issue.
...case in point.
In your case, doing the swap in a transaction would be wise. You may want to consider tracking all Cust ID swaps so that you can revert if something goes wrong.
Option - Phased Introduction Based on Testing
You could try this:
Keep the system as-is for now.
Add the logic which does the checks you are proposing, but have it create trial data on the side - i.e. don't change the real records, just make a copy that is what the new data would be. Do this in production - you'll get a way better sample of data.
Run extensive tests over the trial data, looking for instances where you got it wrong. What's more likely, and what you could consider building, is a "scoring" algorithm. If you are checking more than one piece of data then you'll get different combinations with different likelihood of accuracy. You can use this to gauge how good your matching is. You can then decide in which circumstances it's safe to do the ID switch and when it's not.
Once you're happy, implement as you see fit - either just the algorithm & result, or the scoring harness as well so you can observe its performance over time - especially if you introduce changes.
Alternative Customer/Session Approach
Treat all bookings (excluding personal details) as bookings, with customers (little c, i.e. Sessions) but without Customers.
Allow users to optionally be validated as "Customers" (big C).
Bookings created by a validated Customer then link to each other. All bookings relate to a customer (session) which never changes, so you have traceability.
I can tweak the answer once I know more about what problem it is you are trying to solve - i.e. what your motivations are.
I wouldn't say that's a terrible design, it's just a simple approach of solving this particular problem, with some room for improvement. It's not optimal because the runtime of that job depends on the new bookings that are received during the day, which may vary from day to day, so other workflows that depend on that will be impacted.
This approach can be improved by processing new bookings in parallel, and using an index to get a fast lookup when checking if a new e-mail already exists or not.
You can also check out Bloom Filters - an efficient data structure that is able to tell you if an element is not in a given set.
The way I would do it is to store the bookings in a No-SQL DB table keyed-off the user email. You get the user email in both situations - when it has an account or when it makes a booking without an account, so you just have to make a lookup to get the bookings by email, which makes that deduplication job redundant.
Let's say I want to model a graph with sales people. They belong to an organisation, have a manager, etc.. They are assigned to specific territories and/or client accounts. Your company may work with external partners, which must be managed, and so on. A nice, none-trivial, graph.
Elements in this graph keep on changing all the time: sales people come and go, or move within the organisation and thus change responsibilities; customers sign contract or cancel them, ...
In my specific use cases, the point in time is very important. How did the graph look like at the end of last month? End of last fiscal year? last Monday when we run job ABC. E.g. what was the manager hierarchy end of last month? Which clients did the sale person manage end of last month? and so on.
In our use cases, DELETE doesn't delete anything, but some sort of end_date gets updated. UPDATE doen't update anything, but a new version of the record is created.
I'm sure I can add CREATED and START-/END_DATE properties to nodes as well as relations, and for sure I can also create queries. But these queries are a pain to write, and almost unreadable, with tons of repeating where clauses everywhere.
I wish graph databases (and their graphical query builder) would allow me to travel in time more easily, e.g. by setting a session variable to a point in time, and all the where clauses are automatically added for all nodes and references that have the start/end date properties. The algorithm should not fail for objects that don't have these properties, but consider the condition met.
What are you thoughts about this use case und what help does memgraph provide for these use cases?
thanks a lot
Juergen
As far as I am aware there is not any graph database that supports the type of functionality you are asking about directly although as #buda points out you can model and query against time series data. I agree with #buda that the way in which you would like this to work seems a bit undefined and very application specific so I would not expect this to be a feature of any database.
The closest I can think of to out of the box support for something like this would be to use a Tinkerpop-enabled database with a PartitionStrategy or SubgraphStrategy to create the subgraph of only the times you wanted and then query against that. Another option would be creating a domain specific language to minimize the amount of times you need to repeated code in your queries.
PartitionStrategy
SubgraphStrategy
Domain Specific Languages
I am designing this database that must maintain a history of employee salary and the movements within the organization. Basically, my design has 3 tables (I mean, there more tables but for this question I'll mention 3, so bear with me). Employee table (containing the most current salary, position data, etc), SalaryHistory table (salary, date, reason, etc.) and MovementHistory(Title, Dept., comments). I'll be using Linq to Sql, so what I was thinking is that every time employee data is updated, the old values will be copied to their respective history tables. Is this a good approach? Should I just do it using Linq to SQL or triggers? Thanks for any help, suggestion or idea.
Have a look at http://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture .
Basically, the article suggests that you have the following columns in the tables you need to track history for -
* DateCreated – the actual date on which the given row was inserted.
* DateEffective – the date on which the given row became effective.
* DateEnd – the date on which the given row ceased to be effective.
* DateReplaced – the date on which the given row was replaced by another row.
* OperatorCode – the unique identifier of the person (or system) that created the row.
DateEffective and DateEnd together tell you the time for which the row was valid (or the time for which an employee was in a department, or the time for which he earned a particular salary).
It is a good idea to keep that logic internal to the database: that's basically why triggers exist. I say this carefully, however, as there are plenty of reasons to keep it external. Often times - especially with a technology as easy as LINQ-to-SQL - it is easier to write the code externally. In my experience, more people could write that logic in C#/LINQ than could do it correctly using a trigger.
Triggers are fast - they're compiled! However, they're very easy to misuse and make your logic overcomplex to a degree that performance can degrade rapidly. Considering how simple your use case is, I would opt to use triggers, but that's me personally.
Triggers will likely be faster, and don't require a "middle man" to get the job done, eliminating at least one chance for errors.
Depending on your database of choice, you can just use one table and enable OID's on it, and add two more columns, "flag" and "previous". Never update this table, only insert. Add a trigger so that when a row is added for employee #id, set all records with employee #id to have a flag of "old" and set the new rows "previous" value to the previous row.
I think this belongs in the database for two reasons.
First, middle tiers come and go, but databases are forever. This year Java EJBs, next year .NET, the year after that something else. The data remains, in my experience.
Second, if the database is shared at all it should not have to rely on every application that uses it to know how to maintain its data integrity. I would consider this an example of encapsulation of the database. Why force knowledge and maintenance of the history on every client?
Triggers make your front-end easier to migrate to something else and they will keep the database consistent no matter how data is inserted/updated/removed.
Besides in your case I would write the salaries straight to the salary history - from your description I wouldn't see a reason why you should go the way via an update-trigger on the employee table.
I need to do transactions (begin, commit or rollback), locks (select for update).
How can I do it in a document model db?
Edit:
The case is this:
I want to run an auctions site.
And I think how to direct purchase as well.
In a direct purchase I have to decrement the quantity field in the item record, but only if the quantity is greater than zero. That is why I need locks and transactions.
I don't know how to address that without locks and/or transactions.
Can I solve this with CouchDB?
No. CouchDB uses an "optimistic concurrency" model. In the simplest terms, this just means that you send a document version along with your update, and CouchDB rejects the change if the current document version doesn't match what you've sent.
It's deceptively simple, really. You can reframe many normal transaction based scenarios for CouchDB. You do need to sort of throw out your RDBMS domain knowledge when learning CouchDB, though. It's helpful to approach problems from a higher level, rather than attempting to mold Couch to a SQL based world.
Keeping track of inventory
The problem you outlined is primarily an inventory issue. If you have a document describing an item, and it includes a field for "quantity available", you can handle concurrency issues like this:
Retrieve the document, take note of the _rev property that CouchDB sends along
Decrement the quantity field, if it's greater than zero
Send the updated document back, using the _rev property
If the _rev matches the currently stored number, be done!
If there's a conflict (when _rev doesn't match), retrieve the newest document version
In this instance, there are two possible failure scenarios to think about. If the most recent document version has a quantity of 0, you handle it just like you would in a RDBMS and alert the user that they can't actually buy what they wanted to purchase. If the most recent document version has a quantity greater than 0, you simply repeat the operation with the updated data, and start back at the beginning. This forces you to do a bit more work than an RDBMS would, and could get a little annoying if there are frequent, conflicting updates.
Now, the answer I just gave presupposes that you're going to do things in CouchDB in much the same way that you would in an RDBMS. I might approach this problem a bit differently:
I'd start with a "master product" document that includes all the descriptor data (name, picture, description, price, etc). Then I'd add an "inventory ticket" document for each specific instance, with fields for product_key and claimed_by. If you're selling a model of hammer, and have 20 of them to sell, you might have documents with keys like hammer-1, hammer-2, etc, to represent each available hammer.
Then, I'd create a view that gives me a list of available hammers, with a reduce function that lets me see a "total". These are completely off the cuff, but should give you an idea of what a working view would look like.
Map
function(doc)
{
if (doc.type == 'inventory_ticket' && doc.claimed_by == null ) {
emit(doc.product_key, { 'inventory_ticket' :doc.id, '_rev' : doc._rev });
}
}
This gives me a list of available "tickets", by product key. I could grab a group of these when someone wants to buy a hammer, then iterate through sending updates (using the id and _rev) until I successfully claim one (previously claimed tickets will result in an update error).
Reduce
function (keys, values, combine) {
return values.length;
}
This reduce function simply returns the total number of unclaimed inventory_ticket items, so you can tell how many "hammers" are available for purchase.
Caveats
This solution represents roughly 3.5 minutes of total thinking for the particular problem you've presented. There may be better ways of doing this! That said, it does substantially reduce conflicting updates, and cuts down on the need to respond to a conflict with a new update. Under this model, you won't have multiple users attempting to change data in primary product entry. At the very worst, you'll have multiple users attempting to claim a single ticket, and if you've grabbed several of those from your view, you simply move on to the next ticket and try again.
Reference: https://wiki.apache.org/couchdb/Frequently_asked_questions#How_do_I_use_transactions_with_CouchDB.3F
Expanding on MrKurt's answer. For lots of scenarios you don't need to have stock tickets redeemed in order. Instead of selecting the first ticket, you can select randomly from the remaining tickets. Given a large number tickets and a large number of concurrent requests, you will get much reduced contention on those tickets, versus everyone trying to get the first ticket.
A design pattern for restfull transactions is to create a "tension" in the system. For the popular example use case of a bank account transaction you must ensure to update the total for both involved accounts:
Create a transaction document "transfer USD 10 from account 11223 to account 88733". This creates the tension in the system.
To resolve any tension scan for all transaction documents and
If the source account is not updated yet update the source account (-10 USD)
If the source account was updated but the transaction document does not show this then update the transaction document (e.g. set flag "sourcedone" in the document)
If the target account is not updated yet update the target account (+10 USD)
If the target account was updated but the transaction document does not show this then update the transaction document
If both accounts have been updated you can delete the transaction document or keep it for auditing.
The scanning for tension should be done in a backend process for all "tension documents" to keep the times of tension in the system short. In the above example there will be a short time anticipated inconsistence when the first account has been updated but the second is not updated yet. This must be taken into account the same way you'll deal with eventual consistency if your Couchdb is distributed.
Another possible implementation avoids the need for transactions completely: just store the tension documents and evaluate the state of your system by evaluating every involved tension document. In the example above this would mean that the total for a account is only determined as the sum values in the transaction documents where this account is involved. In Couchdb you can model this very nicely as a map/reduce view.
No, CouchDB is not generally suitable for transactional applications because it doesn't support atomic operations in a clustered/replicated environment.
CouchDB sacrificed transactional capability in favor of scalability. In order to have atomic operations you need a central coordination system, which limits your scalability.
If you can guarantee you only have one CouchDB instance or that everyone modifying a particular document connects to the same CouchDB instance then you could use the conflict detection system to create a sort of atomicity using methods described above but if you later scale up to a cluster or use a hosted service like Cloudant it will break down and you'll have to redo that part of the system.
So, my suggestion would be to use something other than CouchDB for your account balances, it will be much easier that way.
As a response to the OP's problem, Couch is probably not the best choice here. Using views is a great way to keep track of inventory, but clamping to 0 is more or less impossible. The problem being the race condition when you read the result of a view, decide you're ok to use a "hammer-1" item, and then write a doc to use it. The problem is that there's no atomic way to only write the doc to use the hammer if the result of the view is that there are > 0 hammer-1's. If 100 users all query the view at the same time and see 1 hammer-1, they can all write a doc to use a hammer 1, resulting in -99 hammer-1's. In practice, the race condition will be fairly small - really small if your DB is running localhost. But once you scale, and have an off site DB server or cluster, the problem will get much more noticeable. Regardless, it's unacceptable to have a race condition of that sort in a critical - money related system.
An update to MrKurt's response (it may just be dated, or he may have been unaware of some CouchDB features)
A view is a good way to handle things like balances / inventories in CouchDB.
You don't need to emit the docid and rev in a view. You get both of those for free when you retrieve view results. Emitting them - especially in a verbose format like a dictionary - will just grow your view unnecessarily large.
A simple view for tracking inventory balances should look more like this (also off the top of my head)
function( doc )
{
if( doc.InventoryChange != undefined ) {
for( product_key in doc.InventoryChange ) {
emit( product_key, 1 );
}
}
}
And the reduce function is even more simple
_sum
This uses a built in reduce function that just sums the values of all rows with matching keys.
In this view, any doc can have a member "InventoryChange" that maps product_key's to a change in the total inventory of them. ie.
{
"_id": "abc123",
"InventoryChange": {
"hammer_1234": 10,
"saw_4321": 25
}
}
Would add 10 hammer_1234's and 25 saw_4321's.
{
"_id": "def456",
"InventoryChange": {
"hammer_1234": -5
}
}
Would burn 5 hammers from the inventory.
With this model, you're never updating any data, only appending. This means there's no opportunity for update conflicts. All the transactional issues of updating data go away :)
Another nice thing about this model is that ANY document in the DB can both add and subtract items from the inventory. These documents can have all kinds of other data in them. You might have a "Shipment" document with a bunch of data about the date and time received, warehouse, receiving employee etc. and as long as that doc defines an InventoryChange, it'll update the inventory. As could a "Sale" doc, and a "DamagedItem" doc etc. Looking at each document, they read very clearly. And the view handles all the hard work.
Actually, you can in a way. Have a look at the HTTP Document API and scroll down to the heading "Modify Multiple Documents With a Single Request".
Basically you can create/update/delete a bunch of documents in a single post request to URI /{dbname}/_bulk_docs and they will either all succeed or all fail. The document does caution that this behaviour may change in the future, though.
EDIT: As predicted, from version 0.9 the bulk docs no longer works this way.
Just use SQlite kind of lightweight solution for transactions, and when the transaction is completed successfully replicate it, and mark it replicated in SQLite
SQLite table
txn_id , txn_attribute1, txn_attribute2,......,txn_status
dhwdhwu$sg1 x y added/replicated
You can also delete the transactions which are replicated successfully.