The Github page defines TerminusDB as also a "document store", does it mean it can work like MongoDB storing JSON documents?
If so, to what extent it can be used to stored documents for user with web application data?
You can store documents in JSON-LD format in TerminusDB as long as the format follows the defined schema. The schema definitions constrain the allowed documents, and allow you to seamlessly move between a graph and document representation. They also allow you to perform quality control on the documents to make sure that required fields exist, and that these fields have the appropriate data-types.
As of October 2021 (TerminusDB 10.x), the document-oriented perspective is now first-class. See e.g. https://terminusdb.com/docs/index/terminusx-db/reference-guides/document-interface.
Related
I am using Azure search which is using default indexing on the data which is importing unstructured data (pdf, doc, text, image files etc.)
I didn't make any scoring profile on the default available fields.
Almost every setting in the portal is the default. If I search any text through the search explorer then I get the JSON result which has very low search score.
I read about score boosting using the scoring profile. however, the terms which I want to find out can be in any document at any place. so how can I decide on which field I can weight more?
how can I generate more custom fields on these input files? Do I need to write document parser?
I am using SDK 4.0 and c# in my bot.
please suggest.
To use scoring profile, the fields you are trying to boost need to be part of the index definition, otherwise the scoring mechanism won't know about them.
You mentioned using unstructured data as your source, I assume this means your data does not have any stable or predictable structure. If that's the case, then you probably won't be able to update your index definition to match exactly the structure of every document, since different documents will likely have a different and unpredictable structure. If you know what fields you want to boost, and you know how to retrieve those fields from your document, then you could update your index definition with only the fields you care about, and then use the "merge" document API to populate that field for each document.
https://learn.microsoft.com/en-us/rest/api/searchservice/addupdate-or-delete-documents
This would require you to retrieve all documents from the index, parse the data to extract the field you want to boost, and then use the merge API to update the index data with the data you extracted. Once you have this, you will be able to use that field as part of a scoring profile.
I am new to apache solr and exploring some use cases that could potentially be applicable for my application.
In one of the use case, I have multiple mongodb instances pushing data to solr via mongo-connector. I am able to do so by running two instance of mongo-connector with two different mongo instance and using same solr core.
My question is: How do I handle a situation where I have a field in mongo-collection, say "startTime" which is of Date type in one mongo instance and another is treating it as long. I want this field to be treated as long type in solr. Does solr provide any sort of auto conversion or I will have to write my analyzer?
If you want both values to normalize to the same form, you should do that in the UpdateRequestProcessor (defined in solrconfig.xml). There is quite a number of them for various purposes, including date parsing. In fact, the schemaless mode is implemented by a chain of URPs, so that's an example you can review.
To process different Mongo instances in different ways, you can just define separate Update Request Handler endpoints (in solrconfig.xml again) and setup different processing for those. Use shared definitions to avoid duplicating what's common (using processor reference as in the schemaless definition linked above).
It may be more useful to normalize to dates rather than back from dates, as Solr allows more interesting searches that way, such as Date Math.
I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).
I gooeled and search for the title, there was a lot of results returned on how to create QUERY for hierarchy/nested fields but no clear answer as to how it would be defined in schema.xml.
Let me be very specific, say I have json records of following format (very simplified version) :
Office string
city string
zipcode string
Home
city string
zipcode string
City string
If I just want to index/store home.city then how would I define that in the "field" in schema.xml?
The schema has to be the union of all the fields as one collection has only one real definition which includes everything.
So: city, zipcode, and probably type to differentiate. Plus whatever Solr requires for parent/child relationship management (id, _root_, _version_).
If the fields are different, then you need to make sure that the fields that only happen in one type and not another are optional.
That's assuming you are indexing child-records as separate documents. If you want to merge them all in one parent document, then you need to do some folding of the content on the client. ElasticSearch gives you a slightly better interface for that, though - under the covers - the issues of a single real definition are still the same (they come from Lucene, which both use).
Solr does not support nested field. If you are looking for
a search engine with the above feature you can try out elastic search. Elastic search also have lucence at its core and it offers lot more than what solr has to offer as far as scalaibility, full text search features, auto sharding, easy import export of data is concerned.
ElasticSearch has Mapping Types to, according to the docs:
Mapping types are a way to divide the documents in an index into
logical groups. Think of it as tables in a database.
Is there an equivalent in Solr for this?
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't. So, is there an equivalent?
Or, maybe a better question,
If I have a multiple document types and I want to limit searches to a certain document type, which one should offer a better solution?
I hope this question has any sense since I'm new to both of them.
Thanks!
You can configure multicore solr:
http://wiki.apache.org/solr/CoreAdmin
Maybe something has changed since solr 4.0 and it's easier now, i didn't look at it since i have switched to elasticsearch. Personally i find elasticsearch indexes/types system much better than that.
In Solr 4+.
If you are planning to do faceting or any other calculations across multiple types than create a single schema with a differentiator field. Then, on your business/mapping/client layer just define only the fields you actually want to look at. Use custom search handlers with 'fl' field to only return the fields relevant to that object. Of course, that means that all those single-type-only fields cannot be compulsory.
If your document types are completely disjoint, you can create a core/collection per type, each with its own definition file. You have full separation, but still have only one Solr server to maintain.
I have seen that some people include a new field in the documents and later on they use this new field to limit the search to a certain type of documents, but as I understand it, they have to share the schema and (I believe) ElasticSearch Mapping Type doesn't.
You can exactly do this in Solr. Add a field and use it to filter.
It is correct that Mapping Types in ElasticSearch do not have to share the same schema but under the hood ElasticSearch uses only ONE schema for all Mapping Types. So technical it makes to difference. In fact the MappingType is mapped to an internal schema field.