Create index of nested documents in SOLR - solr

How should I import nested entities from DB to Solr index? For some reasons i don't want to flatten documents into single one. What should i write in schema.xml and data-config.xml ? I'm using Solr 4.10.

The currently distributed version of the DataImportHandler does not support nested documents (or BlockJoins as they're called in Solr/Lucene).
There is however a patch available that you can try out - be sure to follow the discussion on JIRA (SOLR-5147) about how to use it and where it goes in the future.

Since you can't use the DataImportHandler, you could write custom code to do this. I'd recommend using SolrJ to load childDocuments. To handle childDocuments, first you have to create all of your required fields (for all of your different record types) in your schema.xml (or use dynamic fields). From there, you can create a SolrInputDocument for the parent, and a SolrInputDocument for the child, and then call addChildDocument(doc) on the parent SolrInputDocument to add the child to it.
I'd also recommend creating a field that can indicate what level you're at - something like "content_type" that you fill in with "parent" or "root," or whatever works for you. Then, once you've loaded the records, you can use the Block/Join Queries to search hierarchically. Be aware that doing this will create an entry for each record, though, and if you do a q=: query, you'll get all of your records intermixed with each other.

Related

Is it possible to append data to existing SOLR document based on a field value?

Currently, I have two databases that share only one field. I need to append the data from one database into the document generated by the other, but the mapping is one to many, such that multiple documents will have the new data appended to it. Is this possible in SOLR? I've read about nested documents, however, in this case the "child" documents would be shared by many "parent" documents.
Thank you.
I see two main options:
you can write some client code using SolrJ that reads all data needed for a given doc from all datasources (doing a SQL join, looking up separate db, whatever), and then write the doc to Solr. Of course, you can (should) do this in batches if you can.
you can index the first DB into Solr (using DIH if it's doable so it's quick to develop). It is imporntant you store all fields (or use docvalues) so you can have all your data back later. Then you write some client code that:
a) retrieves all data about a doc
b)gets all data that must be added from the other DB
c) build a new representation of the doc (with client docs if needed)
d) you update the doc, overwriting it

Solr suggester filter at build time

Is it possible to create a suggestion dictionary using a pre-filtered query ? In other word is it possible to create a suggestion dictionary based on a subset of an existing index instead of the whole index ?
This is a feature very similar to context filtering, but beforehand.
In my case, I'd like to create a suggestion dictionary using semi-public data only (not owned by a tenant) and language, from my main index.
no that is not possible right now. I see 3 possible ways to do what you want:
you create a new collection from scratch, indexing from the original source only the subset you want in a field, and use that field as DocumentDictionaryFactory for the suggester
again, create a new index, but index it off the existing collection, either using DIH and the SolrEntityProcessor or with with Streaming expressions
you create your custom FilteredDocumentDictionaryFactory that does what you need and plug it in into Solr.

sorting fields inside a document in solr

I am working with a solr index that I have not made. I only have access to the solr admin.
In each document that is returned by the query I write in the solr admin, has around 40 fields. These fields are not sorted alphabetically.
Now my question is can I sort them somehow in the solr admin?
If I can not, I have the opportunity to import that index locally in my dev machine. I also have access to the config (solr config, data import config etc) files.
Is it possible to do some magic in any of those config files and import locally which will sort them alphabetically?
No, neither Lucene or Solr guarantees the order of the fields returned (the order of values inside a multi-valued field is however guaranteed)
You might have luck (you won't - see comment below - fl maintains the same order as in the document) by explicitly using the fl parameter to get the order you want, but that would require maintaining a long list of fields to be returned.
It's usually better to ask why you'd need the order of the fields to maintained. The data returned from Solr is usually not meant for the user directly, and should be processed in your controller / view layer to suit the use case.
You could return it using XSLT response writer instead of XML one. Usually it is used to transform XML into a different form, but you could probably use it for identity transformation but with sorting.
I don't think that's the best way forward, but if you are desperate, it is a way.

Partial Update of documents

We have a requirement that documents that we currently index in SOLR may periodically need to be PARTIALLY UPDATED. The updates can either be
a. add new fields
b. update the content of existing fields.
Some of the fields in our schema are stored, others are not.
SOLR 4 does allow this but all the fields must be stored. See Update a new field to existing document and http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/
Questions:
1. Is there a way that SOLR can achieve this. We've tried SOLR JOINs in the past but it wasn't the right fit for all our use cases.
On the other hand, can elastic search , linkedin's senseidb or other text search engines achieve this ?
For now, we manage by re-indexing the affected documents when they need to be indexed
Thanks
Solr has the limitation of stored fields, that's correct. The underlying lucene always requires to delete the old document and index the new one. In fact lucene segments are write-once, it never goes back to modify the existing ones, thus it only markes documents as deleted and deletes them for real when a merge happens.
Search servers on top of lucene try to work around this problem by exposing a single endpoint that's able to delete the old document and reindex the new one automatically, but there must be a way to retrieve the old document somehow. Solr can do that only if you store all the fields.
Elasticsearch works around it storing the source documents by default, in a special field called _source. That's exactly the document that you sent to the search engine in the first place, while indexing. This is by the way one of the features that make elasticsearch similar to NoSQL databases. The elasticsearch Update API allows you to update a document in two ways:
Sending a new partial document that will be merged with the existing one (still deleting the old one and indexing the result of the merge
Executing a script on the existing document and indexing the result after deleting the old one
Both options rely on the presence of the _source field. Storing the source can be disabled, if you disable it you of course lose this great feature.

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Resources