I am trying to setup solr to use with postgres db which I use via flask sqlalchemy orm. I found the library pysolr for the purpose but it is not clear how to setup hooks within the sqlalchemy models to update solr index. Are there any examples?
pysolr suggests inserting documents manually, via solr.add, but it's not clear how you would separate indices for different database tables.
after doing some research I came up with the following approach, I am wondering if this is right way to go:
in the ORM models, hook after_insert, after_update, after_remove and after_commit and insert/update/remove the object data in solr in these events.
to segregate data of different models use the table name as prefix in the "id" field of solr documents. solr_id = db_table_name + db_id
when you do a search, get all the results, filter manually those matching the db table required, extract the ids, lookup the db against those ids and use those db results.
is there a better way to about doing this? thanks.
SQLAlchemy and Solr are different structure. I think a better solution is implement a script to synchronize data. Run the script to update maybe 30 minutes or a hour for new data.
Binding insert/update/remove/commit mechanisms in model isn't good way. Because if your Solr services have any problems, your website (about access database) will be affected. Keep difference services independent.
Related
We are using Spring Data MongoDB to connect to an Azure CosmosDB instance that is sharded. We currently face the issue, that the default SimpleMongoRepository implementation does not seem to support specifying a shard key that is then used in the query section of the update command sent to the MongoDB (or CosmosDB in our case). Compared to MongoDB, CosmosDB require the shard key in every query hitting a sharded collection. MongoDB only suggests to specify it.
Anyway, we have not yet found a way to manipulate the save operation so that is uses the shard key in the query section of the update command as well. Implementing a custom repository seems to be tricky since most classes we require to implement that are private or package private.
Does anyone have experience with this or is in a similar situation?
We have a Cloudant database on Bluemix that contains a large number of documents that are answer units built by the Document Conversion service. These answer units are used to populate a Solr Retrieve and Rank collection for our application. The Cloudant database serves as our system of record for the answer units.
For reasons that are unimportant, our Cloudant database is no longer valid. What we need is a way to download everything from the Solr collection and re-create the Cloudant database. Can anyone tell me a way to do that?
I'm not aware of any automated way to do this.
You'll need to fetch all your documents from Solr (and assuming you have a lot of them, do this in a paginated way - there are some examples of how to do this in the Solr doc) and add them into Cloudant.
Note that you'll only be able to do this for the fields that you have set to be stored in your schema. If there are important fields that you need in Cloudant that you haven't got stored in Solr, then you might be stuck. :(
You can replicate one Cloudant database to another which will create you an exact replica.
Another technique is to use a tool such as couchbackup which takes a copy of your database's documents (ignoring any deletions) and allows you to save the data in a text file. You can then use the couchrestore tool to upload the data file to a new database.
See this blog for more details.
Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.
I'm building a web application that will essentially allow 'admins' to create forms with any number and combination of form elements (checkboxes, combo-boxes, text-fields, date-fields, radio-groups, etc). 'Users' will log into this application and complete the forms that admins create.
We're using MySQL for our database. Has anyone implemented an application with this type of functionality? My first thoughts are to serialize the form schema has JSON and store this as a field in one of my database tables, and then serialize the submissions from different users and also store this in a mysql table. Another thought: is this something that a no-sql database such as MongoDB would be suited for?
Yes, a document-oriented database such as MongoDB, CouchDB, or Solr could do this. Each instance of a form or a form response could have a potentially different set of fields. You can also index fields, and they'll help you query for documents if they contain that respective field.
Another solution I've seen for implementing this in an SQL database is the one described in How FriendFeed uses MySQL to store schema-less data.
Basically like your idea for storing semi-structured data in serialized JSON, but then also create another table for each distinct form field you want to index. Then you can do quick indexed lookups and join back to the row where the serialized JSON data includes that field/value pair you're looking for.
where do I find a howto to set up elasticSearch using Postgres?
My field sizes will be about 350mb, yes, MB, each in size. I have a
text output of all of the US Code and all decisions from all the courts,
the Statutes at Large, pretty much everything you would find in a library,
and I need to be able to do full text searches and return the exact point
in the field to the app to return the exact page in PDF form. Postgres
can easily handle the datastore, but I've never used elasticSearch and
have no idea of how it integrates into the indexing, etc.
As of 2015, there's ZomboDB (https://github.com/zombodb/zombodb). As the author, I'm a bit biased, but it's quite powerful. ;)
It's a Postgres extension and Elasticsearch plugin that allows you to "CREATE INDEX"s that use a remote Elasticsearch cluster, and it exposes a fairly powerful query language for performing full-text searches.
Because it's an actual index in Postgres, the ES cluster is automatically synchronized as you INSERT/UPDATE/DELETE records. As such, there's no need for asynchronous synchronization processes.
Additionally, because it's an actual index, it is transaction-safe, which means concurrent Postgres sessions will only see results that are consistent with their current transaction.
Here's a link to ZomboDB's tutorial. It should give you an idea of how easy ZomboDB is to use.
There is an application that you can use to import SQL Server, Oracle, Postgresql MySQL, etc. in to an ElasticSearch index.
http://code.google.com/p/ogr2elasticsearch/
Please let me know if you have any trouble building or using it. ~Adam
You can explore using pgsync.
PGSync is an open-source middleware (written in python) for syncing data from Postgres to Elasticsearch effortlessly. It allows you to keep Postgres as your source of truth and expose structured denormalized documents in Elasticsearch.
Githib link: https://github.com/toluaina/pgsync
Its possible to insert/update/delete postgres data in elasticsearch without middle ware other than the pgsql_http extension. Using triggers you can get a pretty much real-time index update.
You can also query elasticsearch and use the results within postgres to do joins etc with other tables/data in your database.
See the elasticsearch examples: https://github.com/sysadminmike/pgsql-http_examples