Multivalued copy field support in DSE SOLR - solr

I have a SOLR schema as following:
<field name="category_id1" type="integer" indexed="false" stored="true" />
<field name="category_id2" type="integer" indexed="false" stored="true" />
<field name="category_id3" type="integer" indexed="false" stored="true" />
<field name="category_ids" type="integer" multiValued="true" indexed="true" stored="true"/>
and a copy section:
<copyField source="category_id1" dest="category_ids" />
but whenever I tried to inject the data into DSE/Cassandra, I got this error
InvalidRequestException(why:(Expected 4 or 0 byte int (14)) [diem][business][category_ids] failed validation)
me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:(Expected 4 or 0 byte int (14)) [diem][business][category_ids] failed validation)
Exception in thread "main" me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:(Expected 4 or 0 byte int (14)) [diem][business][category_ids] failed validation)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:264)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at com.diem.db.crud.CassandraStorageManager.insertMultiColumns(CassandraStorageManager.java:197)
at com.diem.db.dao.impl.AbstractDaoImpl.saveUUIDEntity(AbstractDaoImpl.java:47)
at com.diem.db.dao.impl.BusinessDaoImpl.saveBusiness(BusinessDaoImpl.java:81)
at com.diem.data.LoadBusinesses.execute(LoadBusinesses.java:187)
at com.diem.data.LoadContent.run(LoadContent.java:121)
at com.diem.data.LoadBusinesses.main(LoadBusinesses.java:45)
Caused by: InvalidRequestException(why:(Expected 4 or 0 byte int (14)) [diem][business][category_ids] failed validation)
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20833)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
... 8 more
A copy into a multiValued solr.IntField (integer) isn't something special and we could do it before using DSE/SOLR. But I can't seem to get this work inside DSE/SOLR combination. Logically speaking, I can't see any reason why this fails, because DSE should not interfere with operation on category_ids field, which is used primarily for indexing. Does anyone see anything wrong with the situation? What could I do in this situation to prevent the validation error (note: I can't use a text/string type for category_ids)?
Thank you!

I could find out the problem, my CF has a default_validation_class=BytesType, so the multiValued field category_ids is validated using BytesType in DSE/Solr, which will cause the error. So unless I change my CF into CQL declaration using the type of LIST<int> and do not use Hector (at least for this CF), I won't be able to work with multiValued fields other than text/string fields in Solr.

If I understand it correctly, you are using thrift tables, so you either declare the category_ids column as UTF8Type (the Solr field can be of any type), or you declare the category_ids Solr field as stored=false (in which case the copy field will not be stored, only indexed).
Let us know if any of the two works for you.

Related

Could Solr search contains wildcard in key?

I have a json block saved as one document in solr,
{
"internal":...
"internet":...
"interface":...
"noise":...
"noise":...
}
Could I seach as " inter*:* "? I want to find out all content with key start with "inter"
Unfortunately, I got parser error, is there any way that I could the search with a wildcard in the key?
No, not really. You'll have to do that as a copyField if providing a wildcard is important to you, in effect copying everything into a single field and then querying that field.
You can supply multiple fields through qf without specifying each field in the q parameter as long as you're using the edismax query handler - that's usually more flexible, but it will still require each field to be specified.
There's also a little known feature named "Field aliasing using per-field qf overrides" (I'm wasn't aware with it, at least). If I've parsed what I've been able to find from a few web searches correctly, you should be able to do f.i_fields.qf=internal internet interface&qf=i_fields. In effect creating an i_fields alias that refers to those three fields. You'll still have to give them explicitly.
You can use Dynamic fields. It allow Solr to index fields that you did not explicitly define in your schema.
This is useful if you discover you have forgotten to define one or more fields. Dynamic fields can make your application less brittle by providing some flexibility in the documents you can add to Solr.
A dynamic field can be defined like
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
Please refer solr documentation for more on Dynamic Fields.
Dynamic Fields
After this create a copy field. Copy the dynamic fields into the copy field.
Once done with this, query can be done on the copyField.
<dynamicField name="inter_*" type="string" indexed="true" stored="true"/>
<field name="internal_static" type="string" indexed="true" stored="true" multiValued="true"/>
<copyField source="inter_*" dest="emp_static"/>

SOLR Exact match issue

I have indexed my field in SOLR using field type "string".
My field contains two values "APA" and "APA LN".
I have queried SOLR with q=field:"APA".
With the above query i ma getting the results for both APA and APA LN.
I have to query SOLR to just get "APA".
Any help is appreciated
I presume that your field "field" is TextField or text_general. Can you change it to string and try again?
ie something like this
<field name="customfield" type="string" indexed="true" stored="true" multiValued="false" />
It should not be happening for a type string. The most likely scenario is that you did not fully reindex or did not commit after reindexing.
You can check what your field actually contains in the Admin UI's Schema Browser screen (press load term info).

Is it possible to get SOLR DIH to ignore spatial fields for documents with invalid lat/long values?

Im trying to import data from an Oracle Database to SOLR index. Dabatase entities do have lat/long values and the documents in the index should have a field position. The corresponding configuration in the data-config.xml hence is
<field column="LONGITUDE" name="long_d" />
<field column="LAT" name="lat_d" />
<field column="bl" name="position" template="${data.LAT},${data.LONGITUDE}"/>
where position field is defined as
<field name="position" type="location_rpt" indexed="true" stored="true" multiValued="false"/>
in the schema.xml file.
The problem I've is caused by badly choosen default values 999.9 for database entries for both lat and long which are not accepted by the DIH as import values for the position field.
So my intention is to simply omit the field position whenever the DB entry has erroneous default values.
Is there something I can define in the configuration file for the DataImportHandler that will give me my desired results?
There are two stages where you can apply changes:
You can use a transformer inside DIH itself
You can use a custom update request processor (URP) chain to replace or get rid of the fields
So, for example, you could use RegexTransformer to replace known bad values with blanks. If that (blank but present fields) causes problems, you could use RemoveBlankFields in a custom chain to drop them.

solr fq; integer comparison on a substring

That is probably a bad title...
But let's say I have a bunch of strings in a multivalue field
<field name="stringweights" type="text_nostem" indexed="true" stored="true" multiValued="true"/>
Sample data might be:
history:10
geography:33
math:29
Now I want to write a fq where I select all records in solr where:
stringweights starts with "geography:"
and where the integer value after "geography:" is >= 10.
Is it possible to write a solr query like that?
(It's not possible to create an integer field in the solr schema named "geography", another called "math" etc because these string portions of the field are unknown at design time and can be many hundreds / thousands of different values.)
You may want to look into dynamic fields. Declare a dynamic field in your schema like:
<dynamicField name="stringweight_*" type="integer" indexed="true" stored="true"/>
Then you can have your docs like:
stringweight_history: 10
stringweight_geography: 33
stringweight_math: 29
Your filter query is then simply:
fq=stringweight_geography:[10 TO *]
You may need to build a custom indexer for doing this. Or use a script transformer with data import handler as mentioned here: Dynamic column names using DIH (DataImportHandler).

Solr data-config: Fields Questions regarding TF-IDF

We are using solr 1.4 (I know I know, pathetic :) )
in data-config
<!-- Snippet -->
<field column="description" stripHTML="true" stored="false" indexed="false"/>
Will the "description" data still be used to calculate the "score/tf-idf" value ?
Nope. The field should be marked as indexed true to be able to be used in the relevancy scoring. i.e. indexed=true
indexed=true|false
True - if this field should be "indexed". If (and
only if) a field is indexed, then it is searchable, sortable, and
facetable.

Resources