Excluding bad data models during query - spring-data-mongodb

We have mongo data models that are written by multiple systems; currently, a bug in a different system can corrupt a single document in a collection such that it can no longer be mapped to the correct Java object (for example, a missing _class attribute in a subdocument will cause an instantiation exception). When we then query for all documents in the collection using Java, the entire query fails due to the single bad document.
We would like to use an approach which is tolerant of instantiation exceptions; the intent is for any bad documents to be discarded, while still returning objects for all the documents that can be mapped.
Could you please advise the best approach to achieve this outcome?

I think you should be able to mark this field as #Transient in entity to make SpringData to ignore this field in MongoDB communication.

Related

Elastic Search: Parent child vs Nested Document

P.S: We are using Elastic 6.x
AS Elastic Search is upgraded few breaking changes are also popped out. We have some relational data which requires to be managed either nested or parent/child mode.
For Final decision I was wondering with following questions:
How many nested documents/array size I can save in one field
We have to manipulate the fields often so whats the recommendation if we use nested field type
What are the limitations of Parent/child if we use 4 types of relations
I believe, answers of the above questions can help me decide the field type, let me know if there is any other thing I should consider
Thanks in advance
How many nested documents/array size i can save in one field
By default, you can have a maximum of 50 nested fields defined per index. In each of those nested fields arrays, you may store any number of elements.
We have to manipulate the fields often so whats the recommendation if we use nested field type
That's where nested fields come short, as whenever a nested document changes, you either have to reindex the whole parent document or figure out via scripting which nested document to update, but it can quickly get quite convoluted.
What are the limitations of Parent/child if we use 4 types of relations
In ES 6.x onwards, you're limited to a single join field per index.
As it looks like, it doesn't seem like either nested fields nor parent/child would work well in your case... Maybe there's another possible design if you are willing to denormalize a little bit more your data, but hard to say without getting more detailed information about your preceise use case.
Choosing Parent/Child vs Nested Document

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.
The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

How to query a particular object without it's embedded objects or collections?

I have a class, lets say Blarkar. Blarkar has an embed class kar. Sometimes when I query for an instance of Blarkar I want the complete object, but other times I don't need all its embed objects and their embed objects. How do I load an object without its embed objects?
You can't. GAE loads an entity whole or not at all. Generally this is not a problem and you shouldn't try to optimize unless you know you have a real issue. But if so, you can split your entity into multiple parts, eg User and UserExtraStuff.
There is a special type of query called a projection query, but this is not likely going to be useful - it lets you select some data out of an index without doing a full entity lookup. It's only useful in limited types of inequality queries. The data has to be in the index.

Solr equivalent to ElasticSearch Mapping Type

ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.

Resources