Apache solr fuzzy search on list values - solr

Enviornment - solr-8.9.0
To implement fuzzy search on column "name" (fuzzy search for 'alaistiar~') of csv file in apache solr i am issueing following query
http://localhost:8983/solr/bigboxstore/select?indent=on&q=name:'alaistiar~'&wt=json
To implement fuzzy search on column "name" (fuzzy search for 'shanka~') of csv file in apache solr
http://localhost:8983/solr/bigboxstore/select?indent=on&q=name:'shanka~'&wt=json
May i combine both the above query in a single and find out the documents?
My first http request is doing fuzzy search for value alaistiar~ on name colums and giving some score value and second http request is for shanka~. When i combine both with 'OR' operator Will it behave same as they are individual request.Acutally My purpose is that i dont want to invoke multiple http request for multiple names, Also i want fuzzy search name in output indicating that this document is for name alaistiar~ and this document is for name shanka~
I have loaded a csv file having 4 columns(Size-5GB.) with 100 milion records. .csv file has following column names -
'name', 'father_name', 'date_of_passing','admission_number'
I have created index on column 'name. To do this i have executed following curl request on managed-schema(solr-8.9.0, jdk-11.0.12)
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"name","type":"text_general","stored":true,"indexed":true }}' http://localhost:8983/solr/bigboxstore/schema
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"father_name","type":"text_general","stored":true,"indexed":false }}' http://localhost:8983/solr/bigboxstore/schema
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"date_of_passing","type":"pdate","stored":true,"indexed":false }}' http://localhost:8983/solr/bigboxstore/schema
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field":{"name":"admission_number","type":"text_general","stored":true,"indexed":false }}' http://localhost:8983/solr/bigboxstore/schema
Is this a right way to create index on 1 column(only on name) as described above?
Now i have list of 1 milion names. On each name i have to do fuzzy-search(column:name) on already loaded data. In the output, for each name I have to return list of java objects including all 4 columns of .csv file.
Note- In output I also have to include name which was supplied as input(in where clause).
For each name, i am doing fuzzy search as follows :
http://localhost:8983/solr/bigboxstore/select?indent=on&q=name:'alaistiar~'&wt=json.
To do this i have to execute 1 milion http request, which i dont want. Instead of executing 1milion http request, May i do in a single http request?
I understand that 'OR operator will not solve my problem because i will not able to group output documents on the basis of name which was passed as a input.

Yes, you could unify the queries by using "OR":
name:alaistiar~ OR name:shanka~
http://localhost:8983/solr/bigboxstore/select?indent=on&q=name:alaistiar~ OR name:shanka~&wt=json
You could omit "OR" if your default operator is "OR". The query would look like:
name:alaistiar~ name:shanka~
http://localhost:8983/solr/bigboxstore/select?indent=on&q=name:alaistiar~ name:shanka~&wt=json
Of course the "space" should be escaped in URL.
Hello, again. After you edited your question, it is more clear what you are looking for:
only 1 query for 1 mil names
see in the result which response to which name corresponds
There is a solutions, but you have to do some post processing. You can use a POST request with a json with params (for 1) and you can use hit highlighting (for 2) like I did here:
curl 'http://localhost:8983/solr/bigboxstore/query?hl.fl=name&hl.simple.post=</b>&hl.simple.pre=<b>&hl=on' -H "Content-Type: application/x-www-form-urlencoded" -X POST -d 'json={"query":"name:alaistiar~ name:shanka~"}'
The answer contains two parts: the first one with the result and the second one with the id and the highlights -> you will have to pair them up on id after receiving the response.

Related

Document is not returned when searched using query parameter in solr

I updated a document in solr using the below query and it was successful.The document has other fields like organization_name,place etc apart from what is shown in the api below.
curl -X POST -H 'Content-Type: application/json' 'http://localhost:8390/solr/collection/update?commit=true' -d '[{"id" : “12345”,”code” : {"set" : “500”}}]’
{
"responseHeader":{
"rf":1,
"status":0,
"QTime":11}}
Post update, when I tried to query solr with query parameter(q) as name, it does not return any document. At the same time, if I query using name as fq parameter, I see the document coming up fine.
This query doesnot work:-
http://localhost:8390/solr/collection/select?q=test&qf=content^0.1%20name_display_name^1.0&defType=edismax
But,this query works(with fq param),
http://localhost:8390/solr/collection/select?q=*%3A*&rows=1000&fq=ngram_info_organization_name:test
The field type of organization_name is string and its both indexed and stored.
This issue is seen only for the document that i updated. If I query for other documents which are not updated, i am able to see the results.
Please help to figure out why the document is not listed when I use the name with query parameter.

CouchDB - Mango Query to select records based on complex composite key

I have records with keys looking like this:
"001_test_66"
"001_testab_54"
"002_testbc_88"
"0020_tesgdtbc_38"
How can I query a couchdb database using Mango queries based on the first part of the key (001 or 002). The fourth one should fail if I search on '002'
You can use $regex operator described in chapter Condition Operators of CouchDB API Reference. In below example, I assumed _id to be the key you want to search by.
"selector": {
"_id": {
"$regex": "^001.*"
}
}
Here's an example using CURL (replace <db> with the name of your database).
curl -H 'Content-Type: application/json' -X POST http://localhost:5984/<db>/_find -d '{"selector":{"_id":{"$regex": "^001.*"}}}'

Using Apache Solr to do faceted search for unstructured text

I have a set of text documents that I have indexed in my respective collection defined in Solr. I am able to do a keyword based search that returns the required documents having the typed in keyword. But my next objective is to do faceted search on unstructured text wherein I am able to retrieve results based on facet fields.
I have tried the following steps:
1) I have defined a new field (distributioncompany)in managed-schema that will act as facet field with a copyfield also defined (distributioncompany_str). But when I do indexing curl command ( with id and distribution company name passed as arguments), I get the facet counts with q =*, but doesnot work when I type in a keyword in q field.
2) I also tried text tagger feature with a tag field defined for te facet field so that required entity can be extracted from document and matched against a list of tag values. But tag field not getting returned
For 1st approach:
1)
curl -X POST -H ‘Content-type:text/plain’ –data-binary ‘{“add-field”:{“name”:”distributioncompany_str”,”type”:”string”,”multiValued”:true,
“indexed”:true,”stored”:false}}’
http://localhost:8983/solr/collectionname/schema
(same code used for adding distributioncompany field)
2) Copy field added:
curl -X POST -H ‘Content-type:text/plain’ –data-binary ‘{“add-copy-field”:{“source”:”distributioncompany”,”dest”:”distributioncompany_str”}}’
http://localhost:8983/solr/collectionname/schema
3) Added a new document to index:
curl ‘http://localhost:8983/solr/collectionname/update/json/docs’ -H ‘Content-type:text/plain’ -d ‘{“id”:”Appeal No. 220 of 2013.pdf.txt”,”distributioncompany”:”Himachal Pradesh State Electricity Board”}’
But if query is done using q =*, it shows facet field count, but if query done using a keyword present in the document, it doesnot show up
For 2nd approach ( text tagger)
1) Added a new fieldtype "tag" in schema of collection
2) Added new fields ( a) trancompany (type:text_general),b) trancompany_tag(type:tag), c) copyfield for these 2 fields
3) Added a new custom SolrRequest Handler in Solrconfig file:
curl -X POST -H ‘Content-type:application/json’ http://192.168.0.95:8983/solr/rajdhanitest2/config -d ‘{
“add-requesthandler”:{
“name”:”/tag”,
“class”:”solr.TaggerRequestHandler”,
“defaults”:{“field”:” trancompany_tag”}
}
}’
4) Updates values for tag field "trancompany_tag" with curl command
5) But on passing text with one of tag values updated, only id gets returned not the tag value
For both approaches, the required faceted/ tagged field to be extracted from text document is not displayed when search query is done. Would appreciate help in guiding me how to do a faceted search for unstructured text documents

Delete all documents from Solr that have a certain empty field

Querying for those documents works with: "fq=-myfield:[* TO *]".
But how can I delete all those? It seems that the delete syntax update?stream.body=<delete><query>... accepts only a query, no filters...
Only pass -myfield[* TO *] in query tag. Do not pass fq parameter. Then it will work I feel. Once I had to delete all documents with id that contained word "data" in the id field string, I just passed id:*data* between query tags, and it worked. Let me know if that helps you.
The correct answer should be: -myfield:* or even -myfield:[* TO *], but the : is mandatory.
This is is an example with curl:
curl http://localhost:8983/solr/collection/update \
-H "Content-Type: text/xml" \
--data-binary '<delete><query>-myfield:*</query></delete>'

How to delete a doc at a specific shard in Solr

I want to delete a specific doc at a specific shard in Solr, below is my query:
http://localhost:8080/solr/collections_1_replica1/update?stream.body=<delete><query>id:1</query></delete>&commit=true&distrib=false
But this still effect to collections_2_replica1, so what is the correct query in this case.
If you use the default Solr Cloud collection configurations, Solr choose where to put the document according to the document id (docId.hash() % number of shards).
In other words, you're not supposed to delete from a specific shard because you can't be sure whether the document is there or on the other shards.
If I'm not wrong, the distrib=false parameter is not effective in updated.
Tyr out this one :
curl http://xx.xx.xx.xx:8983/solr/collection_name/update/?commit=true -H "Content-Type: text/xml" --data-binary '<delete><query>id:1</query></delete>'

Resources