How to use shard key in update queries generated by SimpleMongoRepository - spring-data-mongodb

We are using Spring Data MongoDB to connect to an Azure CosmosDB instance that is sharded. We currently face the issue, that the default SimpleMongoRepository implementation does not seem to support specifying a shard key that is then used in the query section of the update command sent to the MongoDB (or CosmosDB in our case). Compared to MongoDB, CosmosDB require the shard key in every query hitting a sharded collection. MongoDB only suggests to specify it.
Anyway, we have not yet found a way to manipulate the save operation so that is uses the shard key in the query section of the update command as well. Implementing a custom repository seems to be tricky since most classes we require to implement that are private or package private.
Does anyone have experience with this or is in a similar situation?

Related

How can I download all documents from Retrieve and Rank (Solr)?

We have a Cloudant database on Bluemix that contains a large number of documents that are answer units built by the Document Conversion service. These answer units are used to populate a Solr Retrieve and Rank collection for our application. The Cloudant database serves as our system of record for the answer units.
For reasons that are unimportant, our Cloudant database is no longer valid. What we need is a way to download everything from the Solr collection and re-create the Cloudant database. Can anyone tell me a way to do that?
I'm not aware of any automated way to do this.
You'll need to fetch all your documents from Solr (and assuming you have a lot of them, do this in a paginated way - there are some examples of how to do this in the Solr doc) and add them into Cloudant.
Note that you'll only be able to do this for the fields that you have set to be stored in your schema. If there are important fields that you need in Cloudant that you haven't got stored in Solr, then you might be stuck. :(
You can replicate one Cloudant database to another which will create you an exact replica.
Another technique is to use a tool such as couchbackup which takes a copy of your database's documents (ignoring any deletions) and allows you to save the data in a text file. You can then use the couchrestore tool to upload the data file to a new database.
See this blog for more details.

How to generate _id on conflict during replication of cloudant databases?

While using the replication API https://docs.cloudant.com/replication_guide.html how to instruct cloudant to generate a new _id when there is a conflict?
Basically I have to merge documents from one database to another one ... I will specify "doc_ids" to replicate ... But if there is a conflict I want cloudant to generate a new _id and proceed on. Is this possible?
Cloudant, as with CouchDB, doesn't have any means to handle conflicts on your behalf. What this means is you'll need to handle this in your application code, which you can do my listening to the changes feed. If you are new to conflicts and conflict resolution you can use these guides.
https://cloudant.com/blog/introduction-to-document-conflicts-part-one/
https://cloudant.com/blog/introduction-to-document-conflicts-part-two/
https://cloudant.com/blog/introduction-to-document-conflicts-part-three/

Using solr/pysolr with flask sqlalchemy

I am trying to setup solr to use with postgres db which I use via flask sqlalchemy orm. I found the library pysolr for the purpose but it is not clear how to setup hooks within the sqlalchemy models to update solr index. Are there any examples?
pysolr suggests inserting documents manually, via solr.add, but it's not clear how you would separate indices for different database tables.
after doing some research I came up with the following approach, I am wondering if this is right way to go:
in the ORM models, hook after_insert, after_update, after_remove and after_commit and insert/update/remove the object data in solr in these events.
to segregate data of different models use the table name as prefix in the "id" field of solr documents. solr_id = db_table_name + db_id
when you do a search, get all the results, filter manually those matching the db table required, extract the ids, lookup the db against those ids and use those db results.
is there a better way to about doing this? thanks.
SQLAlchemy and Solr are different structure. I think a better solution is implement a script to synchronize data. Run the script to update maybe 30 minutes or a hour for new data.
Binding insert/update/remove/commit mechanisms in model isn't good way. Because if your Solr services have any problems, your website (about access database) will be affected. Keep difference services independent.

solr - can I use it for this?

Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.

building in support for future Solr sharding

Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.

Resources