Is it possible to append data to existing SOLR document based on a field value? - database

Currently, I have two databases that share only one field. I need to append the data from one database into the document generated by the other, but the mapping is one to many, such that multiple documents will have the new data appended to it. Is this possible in SOLR? I've read about nested documents, however, in this case the "child" documents would be shared by many "parent" documents.
Thank you.

I see two main options:
you can write some client code using SolrJ that reads all data needed for a given doc from all datasources (doing a SQL join, looking up separate db, whatever), and then write the doc to Solr. Of course, you can (should) do this in batches if you can.
you can index the first DB into Solr (using DIH if it's doable so it's quick to develop). It is imporntant you store all fields (or use docvalues) so you can have all your data back later. Then you write some client code that:
a) retrieves all data about a doc
b)gets all data that must be added from the other DB
c) build a new representation of the doc (with client docs if needed)
d) you update the doc, overwriting it

Related

Can we use Elastic Search as Backend Data Store by replacing Postgres?

We are using Postgres to store and work with app data, the app data contains mainly:
We need to store the incoming request json after processing that.
We need to search the particular JSON using the Identifier field, for which we are creating a separate column, for each row in the table
For Clients, they may require searching the JSON column, I mean client want to one json based on certain key value in the json
All these things are ok at present with Postgres, when I am reading some blog article, where they mentioned that we can use ElasticSearch as backend data store also, instead of just as search server, if we can use like that, can we replace Postgres with ElasticSearch? What advantages I can get in doing this, what are the pros of postgres when compared with ElasticSearch for my case, what are cons?
Can anyone given some advice please.
Responding the questions one by one:
We need to store the incoming request json after processing that.
Yes and No. ElasticSearch allows to store JSON objects. This works if the JSON structure is known beforehand and/or is stable (i.e. the same keys in the JSON have the same type always).
By default the mapping (i.e. schema of the collection) is dynamic, means it allows to infer schema based on the value inserted. Say we insert this document:
{"amount": 1.5} <-- insert succeeds
And immediately after try to insert this one:
{"amount": {"value" 1.5, "currency": "EUR"]} <-- insert fails
ES will reply with an error message:
Current token (START_OBJECT) not numeric, can not use numeric value accessors\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper#757a68a8; line: 1, column: 13]
If you have JSON objects of unknown structure you can still store them in ES, it could be done by using the type object and setting property enabled: false; this will not allow you to do any kind of queries on the content of such field though.
We need to search the particular JSON using the Identifier field , for which we are creating a separate column ,for each row in the table
Yes. This can be done using field of type keyword if identifier is an arbitrary string, or integer if it is an integer.
For Clients, they may require searching the JSON column, I mean client want to one json based on certain key value in the json.
As per 1), yes and no. If JSON schema is known and strict, it can be done. If JSON structure is arbitrary, it can be stored but will not be queryable.
Though I would say ElasticSearch is not suitable for your case, there are some some guys that make JDBC and ODBC drivers for ElasticSearch, apparently in some cases ElasticSearch can be used as relational database.
elasticsearch is a HTTP wrapper to Apache Lucene. Apache Lucene stores object in a columnar fashion in order to speed-up search (Lucene segments).
I am completing the very good Nikolay answer:
The good:
Both Lucene and Elasticsearch are solid project
Elasticsearch is (my opinion) the best and easiest software for clustering (sharding and replication)
Support version conflict (https://www.elastic.co/guide/en/elasticsearch/guide/current/concurrency-solutions.html)
The bad:
not realtime (https://www.elastic.co/guide/en/elasticsearch/guide/current/near-real-time.html)
No support ACID transaction (Changes to individual documents are ACIDic, but not changes involving multiple documents.)
Slow to get lot of data (must use search scroll, very slow comparing to a SQL database fetch)
No authentication and access-control
My opinion is use elasticsearch as a kind of view of your database, with read-only access.

How to read documents while adding other documents in Mongoose?

My app receives a new document and save it to database at any time. How can I make sure that there will not be conflict in the worst case that I read existing documents while a new document is being saved?
I don't think you should have a problem as the insert and update operations should be atomic in mongoDB
from the docs:
In MongoDB, a write operation is atomic on the level of a single
document, even if the operation modifies multiple embedded documents
within a single document.

Create index of nested documents in SOLR

How should I import nested entities from DB to Solr index? For some reasons i don't want to flatten documents into single one. What should i write in schema.xml and data-config.xml ? I'm using Solr 4.10.
The currently distributed version of the DataImportHandler does not support nested documents (or BlockJoins as they're called in Solr/Lucene).
There is however a patch available that you can try out - be sure to follow the discussion on JIRA (SOLR-5147) about how to use it and where it goes in the future.
Since you can't use the DataImportHandler, you could write custom code to do this. I'd recommend using SolrJ to load childDocuments. To handle childDocuments, first you have to create all of your required fields (for all of your different record types) in your schema.xml (or use dynamic fields). From there, you can create a SolrInputDocument for the parent, and a SolrInputDocument for the child, and then call addChildDocument(doc) on the parent SolrInputDocument to add the child to it.
I'd also recommend creating a field that can indicate what level you're at - something like "content_type" that you fill in with "parent" or "root," or whatever works for you. Then, once you've loaded the records, you can use the Block/Join Queries to search hierarchically. Be aware that doing this will create an entry for each record, though, and if you do a q=: query, you'll get all of your records intermixed with each other.

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Resources