Is there a better way to represent provenenace on a field level in SOLR - solr

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?

In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Related

How to store nested relational data in Solr

I'm trying to store data within Solr so that I can best maintain the indexes. The problem i'm having is that my data structure is heavily nested. Example:
Company
(to many) Person
(to many) Property
(to many) Network
(to many) SubNetwork
I'm trying to create a full text search index for each SubNetwork that will display the current parent fields along side it.
Currently my data is completely denormalised, e.g:
{
"company": "Coca-Cola",
"property": "1 plaza hotel",
"network": "ABC",
"subNetwork": "123"
}
Now if a user were to go into the application and change the name of the company, right now (in the denormalized state), that would require Solr to partially update (atomic update) many documents which doesn't feel very efficient. Re-indexing the index isn't a preferred solution as this is a multi tenanted application.
I have tried putting the relational data in separate indexes and then used join within Solr but this does not copy over the joined indexes fields in the final result which means a full text search on all the fields isn't possible.
{!join from=inner_id to=outer_id}field:value
I'm trying to configure Solr in a way that when a parent record is updated, it only requires one atomic update but still retains the ability to search on all fields. Is this possible?
Unless you are seeing the performance issues, your initial implementation seems correct. Especially if you are returning subnetwork and may be searching on subnetwork and parent values at the same time.
Doing atomic update, under the covers, actually re-indexes the document anyway (and creates a Lucene-level new document). It also requires all fields to be stored to allow recreating the document. And the join reduces the scoring flexibility you can have.
One optimization you could do is to NOT store the parent fields, but keep them index-only. This will be more space-efficient and less disk/record re-hydration work. But then you cannot return those fields to the user and would have to fetch them from the original source instead.

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Solr equivalent to ElasticSearch Mapping Type

ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

How do I index different sources in Solr?

How do I index text files, web sites and database in the same Solr schema? All 3 sources are a requirement and I'm trying to figure out how to do it. I did some examples and they're working fine as they're separate from each other, now I need them all to be 1 schema since the user will be searching in all of those 3 data sources.
How should I proceed?
You should sketch up a few notes for each of your content sources:
What meta-data is available
How is the information accessed
How do I want to present the information
Once that is done, determine which meta-data you want to make searchable. Some of it might be very specific to just one of the content sources (such as author on web pages, or any given field in a DB row), while others will be present in all sources (such as unique ID, title, text content). Use copy-fields to consolidate fields as needed.
Meta-data will vary greatly from project to project, but yes -- things like update date, filename, and any structured data you can parse out of the text files will surely help you improve relevance. Beyond that, it varies a lot from case to case. Maybe the file paths hint at a (possibly informal) taxonomy you can use as metadata. Maybe filenames contain metadata themselves (such as year, keyword, product names, etc).
Be prepared to use different fields for different sources when displaying results. A source field goes a long way in terms of creating result tiles -- and it might turn out to be your most used facet.
An alternative (and probably preferred) approach to using copy-fields extensively, is using the DisMax/EDisMax request handlers, to facilitate searching in several fields.
Consider using a mix of copy-fields and (e)dismax. For instance, copy all fields into a catch-all text-field, that need not be stored, and include it in searches, but with a low boost-value, and include highly weighted fields (such as title, or headings, or keywords, or filename) in the search. There's a lot of parameters to tweak in dismax, but it's definately worth the effort.

Resources