Null or empty values are not stored in solr - solr

I have solr database where I inserted string field like this:
{
"add-field": [
{
"name": "string__single_line_text_field__LC",
"type": "string",
"stored": true,
"indexed": true,
"required": true,
"default": ""
}
]
}
I set field to be required and define its default value. In my solr database, this field is like this:
The problem is because solr doesn't store my default value as empty string when string is null or empty (it simply doesn't exist) - it stores only non null/non empty values. Any idea how to solve this issue?

Related

How to get all solr field names except for multivalued fields?

I'm new to solr and I'm trying to query for field names excluding fields with multiValued=true.
So far I have
select?q=*:*&wt=csv&rows=0&facet
which returns all the fields.
Is there a way to modify the query to check if a field is multivalued?
You can retrieve information about all the defined fields through the Schema API. The response will contain a multiValued field set to true if the field is defined as multivalued:
v1 API:
http://localhost:8983/techproducts/schema/fields
v2 API:
http://localhost:8983/api/collections/techproducts/schema/fields
{
"fields": [
{
"indexed": true,
"name": "_version_",
"stored": true,
"type": "long"
},
{
"indexed": true,
"multiValued": true, <----
"name": "cat",
"stored": true,
"type": "string"
},
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}

Set Index Key to Output Field Mapping

In my index, I've a field called id. During my enrichment pipeline I compute a value called /document/documentId, which I'm attempting to map to the id field. However, this mapping does not seem to work as the id always seems to be some long value that looks like a hash. All my other output field mappings work as expected.
Portion of the Index:
{
'name': 'id',
'type': 'Edm.String',
'facetable': false,
'filterable': true,
'key': true,
'retrievable': true,
'searchable': true,
'sortable': true,
'analyzer': null,
'indexAnalyzer': null,
'searchAnalyzer': null,
'synonymMaps': [],
'fields': []
}
Portion of the Indexer:
'outputFieldMappings': [
{
'sourceFieldName': '/document/documentId',
'targetFieldName': 'id'
}
]
Expected Value: 4b160942-050f-42b3-bbbb-f4531eb4ad7c
Actual Value: aHR0cHM6Ly9zdGRvY3VtZW50c2Rldi5ibG9iLmNvcmUud2luZG93cy5uZXQvMDNiZTBmMzEtNGMyZC00NDRjLTkzOTQtODJkZDY2YTc4MjNmL29yaWdpbmFscy80YjE2MDk0Mi0wNTBmLTQyYjMtYmJiYi1mNDUzMWViNGFkN2MucGRm0
Any thoughts on how to fix this would be much appreciated!
TL;DR - Can't use output field mappings for Keys. Can only use source fields.
According to Microsoft, it's not possible to set the document key using the output field mapping. Apparently, there is an issue in cases of deleting documents so the key has to exist straight out of the document.
I ended up using a mapping function in the fieldMappings.
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "filename"
},
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "id",
"mappingFunction": {
"name": "extractTokenAtPosition",
"parameters": {
"delimiter": ".",
"position": 0
}
}
}
]
Since my file name is something like 4b160942-050f-42b3-bbbb-f4531eb4ad7c.pdf then this ends up mapping mapping correctly to my Id.
You can use a regular field mapping rather than an output field mapping. If you created your indexer in the Azure portal, your key (which is "id", since key is true in your index definition of "id" above) was probably base64-encoded (that option is checked by default). You will need to base64-decode it to get your original value, OR you can store a second copy of the original value without encoding it (the key will need to be encoded). Here's how you do the latter - this can replace your output field mapping:
"fieldMappings": [
{
"sourceFieldName": "documentId",
"targetFieldName": "documentId"
},
{
"sourceFieldName": "documentId",
"targetFieldName": "id",
"mappingFunction": {
"name": "base64Encode"
}
}
]
Note that you will also need to add a documentId field in your index since you are storing this in its original format as well.
{
'name': 'documentId',
'type': 'Edm.String',
'facetable': false,
'filterable': true,
'key': false,
'retrievable': true,
'searchable': true,
'sortable': true,
'analyzer': null,
'indexAnalyzer': null,
'searchAnalyzer': null,
'synonymMaps': [],
'fields': []
}
Alternatively, you could just base64 encode (when storing) and decode (when retrieving) the id value. This key value is base64-encoded so it's safe to use as an Azure Cognitive Search document key. Check out https://learn.microsoft.com/azure/search/search-indexer-field-mappings for more info.

JSON Schema Array must contain a specific string

There are several question on the subject, but none of them seem to address this particular issue nor does the documentation on JSON Schema, so maybe it cannot be done.
The issue is that I have an array that can have any of 4 strings as values, easy enough to achieve with this schema:
...
"attributes": {
"type": "array",
"items": {
"type": "string",
"enum": [
"controls",
"autoplay",
"muted",
"loop"
]
},
"additionalItems": false
}
...
So the values in the array can only be one of those four. Nevertheless, "controls" must always be part of the array, while the other three are optional. If it was an array of objects we could make this required, but I'm not sure how to check for an array having a specific value.
Thanks for any help!
You can use the contains keyword:
"attributes": {
"type": "array",
"items": {
"type": "string",
"enum": [
"controls",
"autoplay",
"muted",
"loop"
]
},
"contains": {
"const": "controls"
},
"additionalItems": false
}
From the specification:
6.4.6. contains
The value of this keyword MUST be a valid JSON Schema.
An array instance is valid against "contains" if at least one of its
elements is valid against the given schema.

Set criteria in query for fields and fields in nested objects

I have a document like this:
{
"InDate": "11.09.2015",
"Kst2Kst": true,
"OutDate": "11.09.2015",
"__v": 0,
"_id": ObjectId('55f2df2d7e12a9f1f52837e6'),
"accepted": true,
"inventar": [
{
"accepted": "1",
"name": "AAAA",
"isstammkost": true,
"stammkost": "IWXI"
},
{
"accepted": "1",
"name": "BBBB",
"isstammkost": false,
"stammkost": "null"
}
]
}
I want to select the data with "isstammkost": true in the inventar-array.
My query is:
Move.findOne({accepted : true, 'inventar.isstammkost' : true},
'OutDate InDate inventar.name', function(err, res)
It doesn't work -> It selects all, even with inventar.isstammkost : false.
The "normal" query works like I want (without criteria in sub-array). Whats the right way to set criteria in sub-array?
Of course it will return the "isstammkost": false part, because that is part of the same document as the "isstammkost": true. They are both objects in the array "inventar", a top-level field in a single document. Without some sort of projection, the entire document will always be returned to a mongodb query and thus nodejs will pass them on to you.
I'm not terribly up-to-speed on nodejs, but if this were the mongo shell it would look like this:
> db.MyDB.findOne({{accepted : true, "inventar.isstammkost" : true}, {"inventar.isstammkost.$": 1});
You will need to find out how to add that extra parameter to the nodejs function.

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

Resources