Database Supporting Update by Function rather than by Value - database

Desired behavior:
In Clojure's implementation of Agents, to update an Agent, one does NOT send a new value. One sends a function, which is called on the old value, and the return value is set as the new value of the Agent.
This makes certain things easy: for example, if I have a queue, and I have two concurrent threads that both want to append to the queue (and I don't care which order they append), each thread can just fire off a (fn [x] (cons x new_value)) ... and it just works. Whereas, if it was updating by value, I'd have to do a compare and swap of some sort.
Question:
Is there any database that supports this type of updating? For example, I was recently looking at MongoDB. However, MongoDB supports only $inc/$dec, and not arbitrary functions for updating the documents.
Thanks!
PS -- I don't need transactions / ACID / BASE / ... all I really want is a simple document store that supports updating via functions rather than values.

MongoDB actually supports multiple different modifier operations (full list here) and while I can see $inc/$dec alone might be limiting, a simple variant of specific example you give of adding to a queue could be handled using $push or $addToSet (depending on whether you wanted duplicates to appear in your queue or not as $addToSet will only add a value if it doesn't already exist).
For more complex cases (supporting compare and swap type of semantics) you can look at findAndModify command in MongoDB if you don't find something simple that fits your needs exactly.

Related

SCD-2 in data modelling: how do I detect changes?

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.
I have the next scenario/experiment:
I'm calling daily to a rest API to extract information about companies.
In my initial load to the DB everything is new, so everything is very easy.
Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (i.e., they changed the size, the profits, the location, ...)
I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.
In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.
Is there any example out there that I can see?
There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.
In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.
You may of course make it more complex by the decision to use your own surrogate key.
You mentioned that there are two main types of the interface for the process:
• You get periodically a full set of the dimension data
• You get the “changes only” (aka delta interface)
Paradoxically the former is much simple to handle than the latter.
First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).
Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.
Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface.
Connected is the question whether a previously deleted entity can be reused (i.e. reappear in the data).
If you support delete/reuse you'll have to thing about how to show them in your dimension table.
In any case you will need some additional columns in the dimension to cover the historical information.
Some implementation use a change_timestamp, some other use validity interval valid_from and valid_to.
Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.
So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.

Perform operations directly on database (esp. Firestore)

Just a question regarding NoSQL DB. As far as I know, operations are done by the app/website outside the DB. For instance, if I need to add an value to a list, I need to
download the intial list
add the new value in the list on my device
upload the whole updated list.
At the end, a lot of data is travelling (twice the initial list) with no added value.
Is there any way to request directly the DB for simple operations like this?
db.collection("collection_key").document("document_key").add("mylist", value)
Or simply increment a field?
Same for knowing the number of documents in a collection: is it needed to download the whole set of document to get the number ?
Couple different answers:
In Firestore, many intrinsic operations can be done "FieldValues", such as increment/decrement (by supplied value, so really Add/subtract). Also array unions, field deletes, etc. Just search the documentation for FieldValue. Whether this is true for NoSQL in general, I can't say.
Knowing the number of documents, on the other hand. is not trivially done in Firestore - but frankly, I can't think of any situations other than artificially contrived examples where you would need to know. Easy enough to setup ways to "count" documents as you create/delete them, and keep that separately, if for some reason you find yourself needing it.
Or were you just trying to generically put down NoSQL as a concept?

Want to capture fields which get updated in Salesforce

I wish to create a generic component which can save the Object Name and field Names with old and new values in a BigObject.
The brute force algo says, on every update of each object, get field API names using describe and check old and new value of those fields. If it gets modified insert it into new BigObject.
But it will consume a lot of CPU time and I am looking for an optimum solution to handle this.
Any suggestions are appreciated.
Well, do you have any code written already? Maybe benchmark it and then see what you can optimise instead of overdesigning it from the start... Keep it simple, write test harness and then try to optimise (without breaking unit tests).
Couple random ideas:
You'd be doing that in a trigger? So your "describe" could happen only once. You don't need to describe every single field, you need only one operation outside of trigger's main loop.
Set<String> fieldNames = Account.sObjectType.getDescribe().fields.getMap().keyset();
System.debug(fieldNames);
This will get you "only" field names but that's enough. You don't care whether they're picklists or dates or what. Use that with generic sObject.get('fieldNameHere') and it's a good start.
or maybe without describe at all. sObject's getPopulatedFieldsAsMap() will give you cool Map which you can easily iterate & compare.
or JSON.serialize the old & new version of the object and if they aren't identical - you know what to do. No idea if they'll always serialise with same field order though so checking if the maps are identical might be better
do you really need to hand-craft this field history tracking like that? You have 1M records free storage but it could explode really easily in busier SF org. Especially if you have workflows, processes, other triggers that would translate to multiple updates (= multiple trigger runs) in same transaction. Perhaps normal field history tracking + chatter feed tracking + even salesforce shield (it comes with 60 more fields tracked I think) would be more sensible for your business needs.

GAE NDB Sorting a multiquery with cursors

In my GAE app I'm doing a query which has to be ordered by date. The query has to containt an IN filter, but this is resulting in the following error:
BadArgumentError: _MultiQuery with cursors requires __key__ order
Now I've read through other SO question (like this one), which suggest to change to sorting by key (as the error also points out). The problem is however that the query then becomes useless for its purpose. It needs to be sorted by date. What would be suggested ways to achieve this?
The Cloud Datastore server doesn't support IN. The NDB client library effectively fakes this functionality by splitting a query with IN into multiple single queries with equality operators. It then merges the results on the client side.
Since the same entity could be returned in 1 or more of these single queries, merging these values becomes computationally silly*, unless you are ordering by the Key**.
Related, you should read into underlying caveats/limitations on cursors to get a better understanding:
Because the NOT_EQUAL and IN operators are implemented with multiple queries, queries that use them do not support cursors, nor do composite queries constructed with the CompositeFilterOperator.or method.
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values. The de-duplication logic for such multiple-valued properties does not persist between retrievals, possibly causing the same result to be returned more than once.
If the list of values used in IN is a static list rather than determined at runtime, a work around is to compute this as an indexed Boolean field when you write the Entity. This allows you to use a single equality filter. For example, if you have a bug tracker and you want to see a list of open issues, you might use a IN('new', 'open', 'assigned') restriction on your query. Alternatively, you could set a property called is_open to True instead, so you no longer need the IN condition.
* Computationally silly: Requires doing a linear scan over an unbounded number of preceding values to determine if the current retrieved Entity is a duplicate or not. Also known as conceptually not compatible with Cursors.
** Key works because we can alternate between different single queries retrieving the next set of values and not have to worry about doing a linear scan over the entire proceeding result set. This gives us a bounded data set to work with.

What's the difference between findAndModify and update in MongoDB?

I'm a little bit confused by the findAndModify method in MongoDB. What's the advantage of it over the update method? For me, it seems that it just returns the item first and then updates it. But why do I need to return the item first? I read the MongoDB: the definitive guide and it says that it is handy for manipulating queues and performing other operations that need get-and-set style atomicity. But I didn't understand how it achieves this. Can somebody explain this to me?
If you fetch an item and then update it, there may be an update by another thread between those two steps. If you update an item first and then fetch it, there may be another update in-between and you will get back a different item than what you updated.
Doing it "atomically" means you are guaranteed that you are getting back the exact same item you are updating - i.e. no other operation can happen in between.
findAndModify returns the document, update does not.
If I understood Dwight Merriman (one of the original authors of mongoDB) correctly, using update to modify a single document i.e.("multi":false} is also atomic. Currently, it should also be faster than doing the equivalent update using findAndModify.
From the MongoDB docs (emphasis added):
By default, both operations modify a single document. However, the update() method with its multi option can modify more than one document.
If multiple documents match the update criteria, for findAndModify(), you can specify a sort to provide some measure of control on which document to update.
With the default behavior of the update() method, you cannot specify which single document to update when multiple documents match.
By default, findAndModify() method returns the pre-modified version of the document. To obtain the updated document, use the new option.
The update() method returns a WriteResult object that contains the status of the operation. To return the updated document, use the find() method. However, other updates may have modified the document between your update and the document retrieval. Also, if the update modified only a single document but multiple documents matched, you will need to use additional logic to identify the updated document.
Before MongoDB 3.2 you cannot specify a write concern to findAndModify() to override the default write concern whereas you can specify a write concern to the update() method since MongoDB 2.6.
When modifying a single document, both findAndModify() and the update() method atomically update the document.
One useful class of use cases is counters and similar cases. For example, take a look at this code (one of the MongoDB tests):
find_and_modify4.js.
Thus, with findAndModify you increment the counter and get its incremented
value in one step. Compare: if you (A) perform this operation in two steps and
somebody else (B) does the same operation between your steps then A and B may
get the same last counter value instead of two different (just one example of possible issues).
This is an old question but an important one and the other answers just led me to more questions until I realized: The two methods are quite similar and in many cases you could use either.
Both findAndModify and update perform atomic changes within a single request, such as incrementing a counter; in fact the <query> and <update> parameters are largely identical
With both, the atomic change takes place directly on a document matching the query when the server finds it, ie an internal write lock on that document for the fraction of a millisecond that the server confirms the query is valid and applies the update
There is no system-level write lock or semaphore which a user can acquire. Full stop. MongoDB deliberately doesn't make it easy to check out a document then change it then write it back while somehow preventing others from changing that document in the meantime. (While a developer might think they want that, it's often an anti-pattern in terms of scalability and concurrency ... as a simple example imagine a client acquires the write lock then is killed while holding it. If you really want a write lock, you can make one in the documents and use atomic changes to compare-and-set it, and then determine your own recovery process to deal with abandoned locks, etc. But go with caution if you go that way.)
From what I can tell there are two main ways the methods differ:
If you want a copy of the document when your update was made: only findAndModify allows this, returning either the original (default) or new record after the update, as mentioned; with update you only get a WriteResult, not the document, and of course reading the document immediately before or after doesn't guard you against another process also changing the record in between your read and update
If there are potentially multiple matching documents: findAndModify only changes one, and allows you customize the sort to indicate which one should be changed; update can change all with multi although it defaults to just one, but does not let you say which one
Thus it makes sense what HungryCoder says, that update is more efficient where you can live with its restrictions (eg you don't need to read the document; or of course if you are changing multiple records). But for many atomic updates you do want the document, and findAndModify is necessary there.
We used findAndModify() for Counter operations (inc or dec) and other single fields mutate cases. Migrating our application from Couchbase to MongoDB, I found this API to replace the code which does GetAndlock(), modify the content locally, replace() to save and Get() again to fetch the updated document back. With mongoDB, I just used this single API which returns the updated document.

Resources