Single Solrj call for adding and deleting docs

Single Solrj call for adding and deleting docs - solr

I am using org.apache.solr.client.solrj.impl.HttpSolrServer.HttpSolrServer for calling solr.
For sequential delete and add operations , I am hitting solr like
solr.addBeans(<solrDocs>);
solr.deleteByQuery(<Query>)
solr.commit();
Is there anyway, I can achieve same in one solr call, like solr.execute(addbean, deleteByQuery1)?
I know that multiple commands may be contained in one message as per solr wiki. I what to know that how to achieve same in solrj and any other java library.
What I want to achieve by this ?
Atomic opertion.
Lets have a case:There are two process(or thread) P1 and P2. Each perform Add(corresponding A1 and A2) and Delete(D1 and D2) operation. Let the sequence be like this :
D1 (Deletion of docs by process P1)
D2 (Deletion of docs by process P2)
A2 (Addition of docs by process P2)
P2.commit -> (This will make D1 commited in Solr too)
A1 (Addition of docs by process P1) : Now even if it failed, D1 is not going to rollback (beacuse of P2.commit)
What I want is to rollback P1.D1

Related

Concurrency - Spring + Hibernate + SQL Server

I have encountered a concurrency problem even after using the serializable isolation level in the spring transaction. My use-case is that the user will provide config to be updated in the database in the below format.
{A1: [B1, B2, B3]}
I have to save this in the entities below.
A {
#OneToMany
List<B> bList;
}
B {
#ManyToOne
A a;
Boolean isDeleted;
}
When there are concurrent requests to save config, more B's are getting inserted than expected. Please refer to the scenario below.
Initial enitites in database: A1 -> []
Transaction 1 - given config {A1: [B2]}
Reads A1 -> []
Insert B2
Transaction 2 - given config {A1: [B3]}
Reads A1 -> []
Insert B3
Final in database: A1 -> [B2, B3] when expected is either A1 -> [B2, B3-deleted] or A1 -> [B2-deleted, B3].
I am not able to find a proper solution to this problem even after a lot of research.
According to this article (https://sqlperformance.com/2014/04/t-sql-queries/the-serializable-isolation-level), this situation is always possible when using SQL Server as the order of operations is one of the valid serializations.

This is best handled by introducing a version column for optimistic locking. There is no need for using the SERIALIZABLE isolation level. Just use
A {
#Version
long version;
#OneToMany
List<B> bList;
}
and make sure you use LockModeType.OPTIMISTIC_FORCE_INCREMENT when loading the A. This way, the "serialization" will be based on a lock of your so called "aggregate root" which is A.
By doing so, one transaction will succeed and the other will fail because at the end of each transaction, the version column would be incremented only if the value didn't change in the meantime. If it changes in the meantime, it will rollback one of the two transactions and you will see an OptimisticLockException.

Methods to avoiding cross-product APOC query (using hashmap?)?

I currently have a Neo4J database with a simple data structure comprised of about 400 million (:Node {id:String, refs:List[String]}), with two properties: An id, which is a string, and refs, which is a list of strings.
I need to search all of these nodes to identify relationships between them. These directed relationships exist if a node's id is in the ref list of another nose. A simple query that accomplishes what I want (but is too slow):
MATCH (a:Node), (b:Node)
WHERE ID(a) < ID(b) AND a.id IN b.refs
CREATE (b)-[:CITES]->(a)
I can use apoc.periodic.iterate, but the query is still much too slow:
CALL apoc.periodic.iterate(
"MATCH (a:Node), (b:Node)
WHERE ID(a) < ID(b)
AND a.id IN b.refs RETURN a, b",
"CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:false,iterateList:true})
Any suggestions as to how I can build this database and relationships efficiently? I've vague thoughts about creating a hash table as I first add the Nodes to the database, but am not sure how to implement this, especially in Neo4j.
Thank you.

If you first create an index on :Node(id), like this:
CREATE INDEX ON :Node(id);
then this query should be able to take advantage of the index to quickly find each a node:
MATCH (b:Node)
UNWIND b.refs AS ref
MATCH (a:Node)
WHERE a.id = ref
CREATE (b)-[:CITES]->(a);
Currently, the Cypher execution planner does not support using the index when directly comparing the values of 2 properties. In the above query, the WHERE clause is comparing a property with a variable, so the index can be used.
The ID(a) < ID(b) test was omitted, since your question did not state that ordering the native node IDs in such a way was required.
[UPDATE 1]
If you want to run the creation step in parallel, try this usage of the APOC procedure apoc.periodic.iterate:
CALL apoc.periodic.iterate(
"MATCH (b:Node) UNWIND b.refs AS ref RETURN b, ref",
"MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:true})
The first Cypher statement passed to the procedure just returns each b/ref pair. The second statement (which is run in parallel) uses the index to find the a node and creates the relationship. This division of effort puts the more expensive processing in the statement running in a parallel thread. The iterateList: true option is omitted, since we (probably) want the second statement to run in parallel for each b/ref pair.
[UPDATE 2]
You can encounter deadlock errors if parallel executions try to add relationships to the same nodes (since each parallel transaction will attempt to write-lock every new relationship's end nodes). To avoid deadlocks involving just the b nodes, you can do something like this to ensure that a b node is not processed in parallel:
CALL apoc.periodic.iterate(
"MATCH (b:Node) RETURN b",
"UNWIND b.refs AS ref MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize:10000, parallel:true})
However, this approach is still vulnerable to deadlocks if parallel executions can try to write-lock the same a nodes (or if any b nodes can also be used as a nodes). But at least hopefully this addendum will help you to understand the problem.
[UPDATE 3]
Since these deadlocks are race conditions that depend on multiple parallel executions trying to lock the same nodes at the same time, you might be able to work around this issue by retrying the "inner statement" whenever it fails. And you could also try making the batch size smaller, to reduce the probability that multiple parallel retries will overlap in time. Something like this:
CALL apoc.periodic.iterate(
"MATCH (b:Node) RETURN b",
"UNWIND b.refs AS ref MATCH (a:Node {id: ref}) CREATE (b)-[:CITES]->(a)",
{batchSize: 1000, parallel: true, retries: 100})

Google NDB update only specific property?

Is there a way to update only specific property on NDB entity?
Consider this example.
Entity A has following property:
property B
property C
Let's assume both of these properties have values of 1 at the moment.
Two different request are trying to update same entity and they are happening at the same time.
So when Request#1 and #2 are retrieving this entity, value of B and C were 1.
Now Request #1 tries to update property B so it sets the value B to 2 and put into Datastore. Now B = 2 and C = 1 in the datastore.
But, Request #2 has B=1 and C=1 in the memory and when it change C to 2 and put into DB, it put's B=1 and C=2 which overwrites B value written by Request #1.
How do you get around this? Is there way to only write specific property into datastore?

I believe you may want to look into transactions.
As per the documentation:
If the transaction "collides" with another, it fails; NDB automatically retries such failed transactions a few times. Thus, the function may be called multiple times if the transaction is retried.
Link: https://developers.google.com/appengine/docs/python/ndb/transactions

Solr update semantics with optimistic locking

I face a strange scenario with Solr 4.6.1.
I'm trying to update a document several times. The pseudocode for this:
id1 = obtain-the-ID-of-the-document()
lock()
// only one thread is updating document1
doc1 = read-document-from-Solr-with-realtime-GET(id1)
modify-document(doc1)
update-document-in-Solr(doc1)
unlock()
[...]
id1 = obtain-the-ID-of-the-document()
lock()
// only one thread is updating document1
doc1 = read-document-from-Solr-with-realtime-GET(id1)
modify-document(doc1)
update-document-in-Solr(doc1)
unlock()
Now, I'm using also the optimistic locking mechanism of Solr, mostly to make sure my update logic is fine. And sometimes I stil get "Conflict" from Solr, with status code 409.
It looks like the update operation is returning before the transaction log is being written, because the RealTimeGetHandler does not find the updated version (I know this because the returned document has the same version number). Thus, is possible that the second modification is actually performed on the same document, because both realtime-get queries return the same document; reason for the conflict.
I solved this by adding a small delay in the update method (50-100ms) and re-query Solr until the version numbers are different; at this time I assume that the transaction log is correctly updated and thus I can safely unlock and return to process the next document.
It's really strange to add any delay, is a better way to solve this problem? Or maybe there is some configuration to tell Solr to return from an update only after writing the tlog?

Great Grandpa AppEngine ancestor query

Scenario 1:
If I have a Great Grandpa ancestor (let's name him A); And my key is from "child1"; is there a way to check that my Great Grandpa is A? (hope I can do that without needing to loop)
Or can I check, if child1's key is of the path "A->B->C".
A -> B -> C -> (child1, child2...)
Scenario 2:
From the above. Great Grandpa has another descendants from "G", and would like to retrieve "H"s children:
A-> B -> C -> (children of C)
...-> G -> H -> (children of H)
I like to retrieve "H"s children, thinking that Grandpa knows the path from A, G, to H... can I do that? (hope I can do this in a query, without looping)
If you have a Go1 example: that would be awesome...

Scenario 1:
If you want to check that Child 1's great-grandpa is A, you will have to invoke key.getParent() twice (checking for null parents). There is no API to check this for you.
If you want to check that entity X has has A as an ancestor then you will have to call key.getParent() N times.
Note however the overhead is minimal. Calling key.getParent() does not result in any calls to the actual datastore.
You can of course ensure with an ancestor query that C / entity X is a descendant of A (as your scenario 2 implies). Thus avoiding checking the query result. The datastore on query execution will check this for you.
https://developers.google.com/appengine/docs/java/datastore/queries
=> search for Ancestor Queries
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/Query#setAncestor(com.google.appengine.api.datastore.Key)
childCQuery.setAncestor(entityA.getKey());
Scenario 2:
Grandpa 'A' can't know the path to 'H' since children can be added and removed at any point in time. There is no limitation on what entities can be descendants of 'A'. So only with a datastore query can you determine the descendants of 'A'.
But as stated in scenario 1 you can specify 'A' as the ancestor in your query so that you filter any results where 'A' is not the ancestor.
Hope this answers your questions.
Note: My responses to your question refer to the java API. I am not yet familiar with the Go API.
Thanks.
Amir

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Single Solrj call for adding and deleting docs - solr

Related

Concurrency - Spring + Hibernate + SQL Server

Methods to avoiding cross-product APOC query (using hashmap?)?

Google NDB update only specific property?

Solr update semantics with optimistic locking

Great Grandpa AppEngine ancestor query

Categories

Resources