How to fetch records from solr those having spaces - solr

I have solr collection and stored a million documents inside. Name is one field in that document holding values along with multiple spaces (Dont know exactly space count). Need to gives name and ids for those records.
For Ex:1. Name: "Arun "
2. Name: "David "
3. Name: "Rahul" (correct record not require ID for this)
Please let me the fq.
Added Field details
<field name="CallSign" type="text_general" indexed="true" stored="true"/>
Adding token details for field CallSign.
Field: CallSign
Field-Type:org.apache.solr.schema.TextFieldPI Gap:100Docs:20,719
Flags: Indexed Tokenized Stored UnInvertible Omit Norms
Properties √ √ √ √ √
Schema √ √ √ √ √
Index Analyzer:
org.apache.solr.analysis.TokenizerChain
Query Analyzer:
org.apache.solr.analysis.TokenizerChain

I found solution for my question. Lets consider field is Name
q=Name:*\s*
By using the above solr query it is fetching all documents whose names have space/spaces only.
Thank you all for you inputs.

Related

How to execute date range query on date index - SOLR

What is the correct way to run the range queries on SOLR date index field. Below is the snapshot from schema.xml file for the field definition:
<dynamicField name="ds_*" type="pdate" indexed="true" stored="false" multiValued="false" docValues="true"/>
Context: The SOLR document contains a date element and value as "ds_clip_date":"2020-10-14T12:00:00Z".
Requirement is to allow user to fetch documents from this month/last 30 days/last 90 days/this year.
I tried building below from SOLR query dashboard - but this gives me all the documents without respecting the range:
http://localhost:8983/solr/testcore/select?facet.field=ds_clip_date&facet.query=facet.range%3Dds_clip_date%26f.ds_clip_date.facet.range.start%3D2020-11-17&facet=on&fl=ds_clip_date&q=*%3A*&rows=200&start=0
and I get the below summary:
"facet_counts":{
"facet_queries":{
"facet.range=ds_clip_date&f.ds_clip_date.facet.range.start=2020-11-17":0},
"facet_fields":{
"ds_clip_date":[
"2020-12-15T12:00:00Z",20,
"2020-12-16T12:00:00Z",14,
"2020-09-25T12:00:00Z",1,
"2020-10-14T12:00:00Z",1,
"2020-10-20T12:00:00Z",1,
"2020-10-22T12:00:00Z",1,
"2020-11-24T12:00:00Z",1,
"2020-12-17T12:00:00Z",1]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}
Also, referring to some other QA, tried to build as below:
http://localhost:8983/solr/testcore/select?facet=true&facet.field=ds_clip_date&facet.range=ds_clip_date&f.ds_clip_date.facet.range.start=2020-12-15T00:00:00Z&f.ds_clip_date.facet.range.end=NOW&f.ds_clip_date.facet.range.gap=%2B1DAY
The result is as below:
{
"response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"ds_clip_date":[]},
"facet_ranges":{
"ds_clip_date":{
"counts":[
"2020-12-15T00:00:00Z",0,
"2020-12-16T00:00:00Z",0,
"2020-12-17T00:00:00Z",0],
"gap":"+1DAY",
"start":"2020-12-15T00:00:00Z",
"end":"2020-12-18T00:00:00Z"}},
"facet_intervals":{},
"facet_heatmaps":{}}}
However in the above summary you can find that 2020-12-15 has 20 documents indexed.
I am running SOLR 8.7.

Incorrect field reading during ranking

Solr version 5.1.0
Documents contain DocValues field "ts" with timestamp using during ranking.
<field name="ts" type="long" docValues="true" indexed="true" stored="true" multiValued="false"/>
If I directly request document at Solr Admin UI I see that it contains correctly value:
"ts": 1575624481951
But when I added logs into the ranking method I saw that "ts" values for the same document is 0.
LeafReader reader = context.reader();
NumericDocValues timeDV = DocValues.getNumeric(reader, "ts");
long timestamp = timeDV.get(doc);
LOG.info("ts: " + timestamp);
Log:
ts: 0
Problem was in incorrect deleting document from Solr.
That was reproducing with next sequence of actions:
Firstly document was added to Solr without field "ts".
After some actions in app document was added again but with field "ts".
When Solr tried to ranking this document had not this field.
I added additional logs and saw that first version of document was on one shard and second version (with field "ts") was on another shard.
I don't pretty sure why it may happened because as I know Solr should put the same document on the same shard.
But anyway it was fixed with deleting document from index before adding second version.

Stored fields in Solr are getting displayed in queries , why?

I am new to using Solr , and I have made a new core and copied the default schema.xml to the conf/ folder. The changes I have made is very trivial .
<field name="id" type="string" indexed="true" stored="false" required="true" multiValued="false" />
As you can see, I set the id field to stored=false. As per my understanding, the field id should not be displayed now when I do a query search. But that is not happening. I have tried restarting solr instance, and did the query to index the file again.
curl 'http://localhost:8983/solr/TwitterCore/update/json?commit=true'
--data-binary #$(echo TwitterData_Core_Conf/TwitterText_en_demo.json)
-H 'Content-type:application
As per Solr Wiki , this should have re-indexed my file. However when I run my query again, I still see the Id .
An example of the document returned (this is not the complete JSON node , I just copied some parts ) :
"text": [
"RT #FollowTrainTV: Moonseternity just joined #FollowTrainTV - Watch them stream on http://t.co/oMcOGA51kT"
],
"lang": [
"en"
],
"id": "0a8edfea-68f7-4b05-b370-27b5aba640b7", // I dont want to see this
"_version_": 1512067627994841000
Maybe someone can give me detailed steps on re-indexing.
When you change the schema.xml file and restart the solr-server, the changes only apply for new documents. This means you have to clear the index and re-index all documents (Except at query tokenizer, these changes are active immediately after server restart, but this is not the case here). After re-indexing, the id field should not be visible any more.
Another remark: You don't have to test your queries with curl. When you connect to http://localhost:8983/solr with your web-browser you should find an admin interface there. There you can select a core and test your queries.
Refer to this https://lucene.apache.org/solr/guide/6_6/docvalues.html document.
Non-stored docValues fields will be also returned along with other
stored fields when all fields are
specified to be returned (e.g. “fl=*”) for search queries depending on
the effective value of the useDocValuesAsStored parameter for each
field. For schema versions >= 1.6, the implicit default is
useDocValuesAsStored="true".
The String field type has docValues="true" . That is the reason why it is appearing in the search response.
You can either add the useDocValuesAsStored="false" parameter to the field or you can use a different fieldType, say text_general.

Using logic AND in a text field

I'm using a schema that has a text field containing ids separated by spaces. The field definition in schema is below:
<field name="aux_identifiers" type="text" indexed="true" stored="true"/>
a query that fetch a single document returns the field as below - example:
<str name="aux_identifiers">1 2 3 4</str>
is there any possibility to apply a logic AND operator to these fields? I need to find the documents that has, as example, the ids 2 and 3 in the field.
fyi, we can't modifiy those fields to multivalued or array and reindex right now. that's why i'm trying a alternate solution.
It would depend on what kind of processing you have on that field, but this should work:
q=aux_identifier:2 AND aux_identifier:3

solr fq; integer comparison on a substring

That is probably a bad title...
But let's say I have a bunch of strings in a multivalue field
<field name="stringweights" type="text_nostem" indexed="true" stored="true" multiValued="true"/>
Sample data might be:
history:10
geography:33
math:29
Now I want to write a fq where I select all records in solr where:
stringweights starts with "geography:"
and where the integer value after "geography:" is >= 10.
Is it possible to write a solr query like that?
(It's not possible to create an integer field in the solr schema named "geography", another called "math" etc because these string portions of the field are unknown at design time and can be many hundreds / thousands of different values.)
You may want to look into dynamic fields. Declare a dynamic field in your schema like:
<dynamicField name="stringweight_*" type="integer" indexed="true" stored="true"/>
Then you can have your docs like:
stringweight_history: 10
stringweight_geography: 33
stringweight_math: 29
Your filter query is then simply:
fq=stringweight_geography:[10 TO *]
You may need to build a custom indexer for doing this. Or use a script transformer with data import handler as mentioned here: Dynamic column names using DIH (DataImportHandler).

Resources