Caching refresh and update strategy inputs

Caching refresh and update strategy inputs - database

I am building a Redis cache to store product data for eg
Key - value pairs as
key -> testKey
value [json] ->
{
"testA" : "A",
"testB" : "B",
"testC" : "C"
}
Problem i am struggling with is if i get two requests to update this value for key.
request1 to change -> "testB" = "Bx"
request2 to change -> "testC" = "Cx"
How to handle inconsistancy.
As based on my understanding one request will read above data and update only testB value and another request will update testC value because these are running in parallel and any new request is not waiting for last update in cache to propagate.
How do we maintain data consistancy with Redis ?.
I can think of locking using transaction DB in front but that will reduce latency of real time data.

It based on what data structure you selected in Redis.
In your case Hash will be a good way to store all fields in your values. And use HSET command to update target fields, which can guarantee your update requests will only update a single field. And all Redis commands will be execute senquentially, so you will not have concurrency issues.
Also you can use String to store raw json data, and serialize/deserialize for each query and update. In this case you will need to consider concurrency because your read and update will not be atomic operation.(maybe a distribute lock can be the solution).

Related

MongoDB "update()" (Update the document of “Reg Rubio” and “Ian Tayao” by adding the “President” to their Reporting file)

I use this codes manually
db.employees.update({Name: "Reg Rubio"}, {$set : {ReportingTo: ["Vice-President", "President"]}})
db.employees.update({Name: "Ian Tayao"}, {$set : {ReportingTo: ["Secretary", "Vice-President", "President"]}})
to update my data. Is their other way to to combine this in one function?

So.. no.. , there is no way to update two separate documents using an 2 different inputs in 1 db call.
To optimize network traffic you can use bulkWrite, this still executes each update separately in the db but it only sends the request over network once, reducing the overhead for all the traffic. This is the only optimization you can add and for only 2 calls I would say it's somewhat negligent.

FLINK ,trigger event based on JSON dynamic input data ( like map object data)

I would like to know if FLINK can support my requirement, I have gone through with lot of articles but not sure if my case can be solved or not
Case:
I have two input source. a)Event b)ControlSet
Event sample data is:
event 1-
{
"id" :100
"data" : {
"name" : "abc"
}
}
event 2-
{
"id" :500
"data" : {
"date" : "2020-07-10";
"name" : "event2"
}
}
if you see event-1 and event-2 both have different attribute in "data". so consider like data is free form field and name of the attribute can be same/different.
ControlSet will give us instruction to execute the trigger. for example trigger condition could be like
(id = 100 && name = abc) OR (id =500 && date ="2020-07-10")
please help me if these kind of scenario possible to run in flink and what could be the best way. I dont think patternCEP or SQL can help here and not sure if event dataStream can be as JSON object and can be query like JSON path on this.

Yes, this can be done with Flink. And CEP and SQL don't help, since they require that the pattern is known at compile time.
For the event stream, I propose to key this stream by the id, and to store the attribute/value data in keyed MapState, which is a kind of keyed state that Flink knows how to manage, checkpoint, restore, and rescale as necessary. This gives us a distributed map, mapping ids to hash maps holding the data for each id.
For the control stream, let me first describe a solution for a simplified version where the control queries are of the form
(id == key) && (attr == value)
We can simply key this stream by the id in the query (i.e., key), and connect this stream to the event stream. We'll use a RichCoProcessFunction to hold the MapState described above, and as these queries arrive, we can look to see what data we have for key, and check if map[attr] == value.
To handle more complex queries, like the one in the question
(id1 == key1 && attr1 == value1) OR (id2 == key2 && attr2 == value2)
we can do something more complex.
Here we will need to assign a unique id to each control query.
One approach would be to broadcast these queries to a KeyedBroadcastProcessFunction that once again is holding the MapState described above. In the processBroadcastElement method, each instance can use applyToKeyedState to check on the validity of the components of the query for which that instance is storing the keyed state (the attr/value pairs derived from the data field in the even stream). For each keyed component of the query where an instance can supply the requested info, it emits a result downstream.
Then after the KeyedBroadcastProcessFunction we key the stream by the control query id, and use a KeyedProcessFunction to assemble together all of the responses from the various instances of the KeyedBroadcastProcessFunction, and to determine the final result of the control/query message.
It's not really necessary to use broadcast here, but I found this scheme a little more straightforward to explain. But you could instead route keyed copies of the query to only the instances of the RichCoProcessFunction holding MapState for the keys used in the control query, and then do the same sort of assembly of the final result afterwards.
That may have been hard to follow. What I've proposed involves composing two techniques I've coded up before in examples: https://github.com/alpinegizmo/flink-training-exercises/blob/master/src/main/java/com/ververica/flinktraining/solutions/datastream_java/broadcast/TaxiQuerySolution.java is an example that uses broadcast to trigger the evaluation of query predicates across keyed state, and https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 is an example that uses a unique id to re-assemble a single response after doing multiple enrichments in parallel.

How to get fullDocument from MongoDB changeStream when a document is deleted?

My Code
I have a MongoDB with two collections, Items and Calculations.
Items
value: Number
date: Date
Calculations
calculation: Number
start_date: Date
end_date: Date
A Calculation is a stored calcluation based off of Item values for all Items in the DB which have dates in between the Calculation's start date and end date.
Mongo Change Streams
I figure a good way to create / update Calculations is to create a Mongo Change Stream on the Items collection which listens for changes to the Items collection to then recalculate relevant Calculations.
The issue is that according to the Mongo Change Event docs, when a document is deleted, the fullDocument field is omitted which would prevent me from accessing the deleted Item's date which would inform which Calculations should be updated.
Question
Is there any way to access the fullDocument of a Mongo Change Event that was fired due to a document deletion?

No I don't believe there is a way. From https://docs.mongodb.com/manual/changeStreams/#event-notification:
Change streams only notify on data changes that have persisted to a majority of data-bearing members in the replica set.
When the document was deleted and the deletion was persisted across the majority of the nodes in a replica set, the document has ceased to exist in the replica set. Thus the changestream cannot return something that doesn't exist anymore.
The solution to your question would be transactions in MongoDB 4.0. That is, you can adjust the Calculations and delete the corresponding Items in a single atomic operation.

fullDocument is not returned when a document is deleted.
But there is a workaround.
Right before you delete the document, set a hint field. (Obviously use a name that does not collide with your current properties.)
await myCollection.updateOne({_id:theId}, {_deleting: true})
await myCollection.deleteOne({_id:theId})
This will trigger one final change event in the stream, with a hint that the document is getting deleted. Then in your stream watcher, you simple check for this value.
stream.on('change', event => {
if (!event.fullDocument) {
// The document was deleted
}
else if (event.fullDocument._deleting) {
// The document is going to be deleted
}
else {
// Document created or updated
}
})
My oplog was not getting updated fast enough, and the update was looking up a document that was already removed, so I needed to add a small delay to get this working.
myCollection.updateOne({_id:theId}, {_deleting: true})
setTimeout( ()=> {
myCollection.deleteOne({_id:theId})
}, 100)
If the document did not exist in the first place, it won't be updated or deleted, so nothing gets triggered.
Using TTL Indexes
Another way to make this work is to add a ttl index, and then set that field to the current time. This will trigger an update first, and then delete the document automatically.
myCollection.setIndex({_deleting:1}, {expireAfterSeconds:0})
myCollection.updateOne({_id:theId}, {$set:{_deleting:new Date()}})
The problem with this approach is that mongo prunes TTL documents only during certain intervals, 60s or more as stated in the docs, so I prefer to use the first approach.

Syncronizing data between two django servers

I have a central Django server containing all of my information in a database. I want to have a second Django server that contains a subset of that information in a second database. I need a bulletproof way to selectively sync data between the two.
The secondary Django will need to pull its subset of data from the primary at certain times. The subset will have to be filtered by certain fields.
The secondary Django will have to occasionally push its data to the primary.
Ideally, the two-way sync would keep the most recently modified objects for each model.
I was thinking something along the lines of having using TimeStampedModel (from django-extensions) or adding my own DateTimeField(auto_now=True) so that every object stores its last modified time. Then, maybe a mechanism to dump the data from one DB and load it in to the other such that only the more recently modified objects are kept.
Possibilities I am considering are django's dumpdata, django-extensions dumpscript, django-test-utils makefixture or maybe django-fixture magic. There's a lot to think about, so I'm not sure which road to proceed down.

Here is my solution, which fits all of my requirements:
Implement natural keys and unique constraints on all models
Allows for a unique way to refer to each object without using primary key IDs
Sublcass each model from TimeStampedModel in django-extensions
Adds automatically updated created and modified fields
Create a Django management command for exporting, which filters a subset of data and serializes it with natural keys
baz = Baz.objects.filter(foo=bar)
yaz = Yaz.objects.filter(foo=bar)
objects = [baz, yaz]
flat_objects = list(itertools.chain.from_iterable(objects))
data = serializers.serialize("json", flat_objects, indent=3, use_natural_keys=True)
print(data)
Create a Django management command for importing, which reads in the serialized file and iterates through the objects as follows:
If the object does not exist in the database (by natural key), create it
If the object exists, check the modified timestamps
If the imported object is newer, update the fields
If the imported object is older, do not update (but print a warning)
Code sample:
# Open the file
with open(args[0]) as data_file:
json_str = data_file.read()
# Deserialize and iterate
for obj in serializers.deserialize("json", json_str, indent=3, use_natural_keys=True):
# Get model info
model_class = obj.object.__class__
natural_key = obj.object.natural_key()
manager = model_class._default_manager
# Delete PK value
obj.object.pk = None
try:
# Get the existing object
existing_obj = model_class.objects.get_by_natural_key(*natural_key)
# Check the timestamps
date_existing = existing_obj.modified
date_imported = obj.object.modified
if date_imported > date_existing:
# Update fields
for field in obj.object._meta.fields:
if field.editable and not field.primary_key:
imported_val = getattr(obj.object, field.name)
existing_val = getattr(existing_obj, field.name)
if existing_val != imported_val:
setattr(existing_obj, field.name, imported_val)
except ObjectDoesNotExist:
obj.save()
The workflow for this is to first call python manage.py exportTool > data.json, then on another django instance (or the same), call python manage.py importTool data.json.

How can I modify the Solr Update Handler to not simply overwrite existing documents?

I'm working with Solr indexing data from two sources - real-time "pump" inserting (and updating) documents into Solr and database which holds backups of those documents.
The problem we encountered looks like that - if we make a data import from database while pump is performing inserts, we may index a doc from pump, and later overwrite it with doc extracted from database - which is a backup, so it's probably little outdated.
If we close the pump, import from database and open the pump again, it probably will cause instabilities in our application.
What I'd like to do is tell Solr to not automatically overwrite the document, but do so conditionally (for example by the value of 'last_modified_date' field).
My question is - how can I do it? Do I have to modify Solr source, make a new class overwriting some update processor, or just add some magic lines to solrconfig?

Sorry, but there there is not an option or config to tell Solr to not automatically update documents, but instead use some conditional check. The current model for Solr is that if you insert a document with the same unique id as one already in the index, it will "update" that document by a delete/add operation. Solr also does not currently support the ability to only update specific fields in an existing indexed document. Please see issue SOLR-139 for more details.
Based on the scenario you have described, I would suggest that you create a process outside of Solr that handles the retrieval of items from your data sources and then performs the conditional check to see what is in the index already and determine if an update to the index is necessary.

You can use solr script processors to check if that document exists proceeds in its accordance
Below code only works when solr uses java 8
function processAdd(cmd) {
doc = cmd.solrDoc;
var previousDoc=null;
try {
// create a term type object
var Term = Java.type("org.apache.lucene.index.Term");
var TermObject =new Term("fieldForSearchTryUnique","Value of field");
//retrieve document id from solr return -1 if not present
previousDocId= req.getSearcher().getFirstMatch(TermObject);
if(-1!=perviousDocId) {
// get complete document from solr for that searched field
previousDoc=req.getSearcher().doc(previousDocId);
// do required process here
}
}
catch(err) {
logger.error("error in update processor "+err)
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight