eDisMax parser issue running over multiple fields - solr

Enviornment ==> solr - solr-8.9.0, java version "11.0.12" 2021-07-20 LTS
Following .csv file is indexed in solr
books_id,cat,name,price,inStock,author,series_t,sequence_i,genre_s
0553573403,book,Game Thrones Clash,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553573404,book,Game Thrones,7.99,true,George Martin,"A Song of Ice and Fire",1,fantasy
0553573405,book,Game Thrones,7.99,true,George,"A Song of Ice and Fire",1,fantasy
I want to search for a book with having a name saying 'Game Thrones Clash'(with mm=75%) and author George R.R. Martin(with mm=70%.)
Now I want to search book-name in only the 'name' field having its minimum match value as well. Also, the author needs to be searched in author, with different mm values.
field-type : text_general is configured for fields :'name','author' with multivalued as false.
Query shall run over input field 'name'(mm=75%) having the value 'Game Thrones Clash' and author(mm=70%) having the value 'George R.R. Martin'.
There are 3 criteria over which results will be displayed, Only those results shall be displayed which satisfy all the following three criteria:
if there is a minimum of 75% of tokens are fuzzy matches in the 'name' field, then it should result in output.
if there is a minimum of 70% of tokens are fuzzy matches in the 'author' field, then it should result in output.
if field 'inStock' has value 'true'.
Output shall contain the following result.
0553573403 (name - 75% matched as well author 70% matched)
0553573404 (name - 75% matched as well author 70% matched)
Following books_id will not contain in output.
0553573405 (name - 75% matched but author not 70% matched)
I understand that Extended DisMax includes query parameters 'mm'(Minimum should match) with fuzzy search functionality, but the following query is giving all 3 results.
curl -G http://$solrIp:8983/solr/testCore2/select --data-urlencode "q=(name:'Game~' OR name:'Thrones~' OR name:'Clash~')" --data-urlencode "defType=edismax" --data-urlencode "mm=75%" --data-urlencode "q=(author:'George~' OR author:'R.R.~' OR author:'Martin~')" --data-urlencode "defType=edismax" --data-urlencode "mm=70%" --data-urlencode "sort=books_id asc"
{
"responseHeader":{
"status":0,
"QTime":3,
"params":{
"mm":["75%",
"70%"],
"q":["(name:'Game~' OR name:'Thrones~' OR name:'Clash~')",
"(author:'George~' AND author:'R.R.~' AND author:'Martin~')"],
"defType":["edismax",
"edismax"],
"sort":"books_id asc"}},
"response":{"numFound":3,"start":0,"numFoundExact":true,"docs":[
{
"books_id":[553573403],
"cat":["book"],
"name":"Game Thrones Clash",
"price":[7.99],
"inStock":[true],
"author":"George R.R. Martin",
"series_t":"A Song of Ice and Fire",
"sequence_i":1,
"genre_s":"fantasy",
"id":"3de00ecb-fbaf-479b-bfde-6af7dd63c60f",
"_version_":1738326424041816064},
{
"books_id":[553573404],
"cat":["book"],
"name":"Game Thrones",
"price":[7.99],
"inStock":[true],
"author":"George Martin",
"series_t":"A Song of Ice and Fire",
"sequence_i":1,
"genre_s":"fantasy",
"id":"a036a400-4f54-4c90-a52e-888349ecb1da",
"_version_":1738326424107876352},
{
"books_id":[553573405],
"cat":["book"],
"name":"Game Thrones",
"price":[7.99],
"inStock":[true],
"author":"George",
"series_t":"A Song of Ice and Fire",
"sequence_i":1,
"genre_s":"fantasy",
"id":"36360825-1164-4cb6-bf48-ebeaaff0ef10",
"_version_":1738326424111022080}]
}}
Can someone help me in writing edismax query or any other way around?

Nested Queries can specify different query parameters for different parts of the query.
+_query_:"{!edismax mm=75% df=name} Game~ Thrones~ Clash~"
+_query_:"{!edismax mm=70% df=author} George~ R.R.~ Martin~"
+inStock:true
Nested Queries are specified by using _query_: as the prefix (see references [1], [2]).
Local Params [3] {! ... }... specify the query parser and its parameters by prefixing the query string.
query type [4] eDisMax [5]: {!type=edismax}... or just {!edismax}...
minimum [should-term] match (mm) [6]: {!mm=75%}...
default field (df) [7]: {!df=author}...
Requiring both of two subqueries can be specified via either of the following:
+query1 +query2
query1 AND query2
Line breaks are for clarity, all lines are within one query parameter (q=).
(If multiple q= parameters exist, adding a debugQuery=true parameter shows that only one is parsed.)
Testing
Steps to reproduce test with a local Solr 9, in Cloud Mode for the Schema Designer page:
Command line: solr-9.0.0/bin/solr start -e cloud
answer questions until return to command prompt
In web browser, open Solr Admin: http://localhost:8983/solr/
On the Schema Designer tab,
click New Schema
Name New Schema: eDisMaxTest1
Copy From: _default
click Add Schema
Under Sample Documents, past csv data and clicked Analyze Documents.
Under Schema Editor, select fields author and name and
changed type to text_general.
disabled the [ ] Multi-Valued checkbox.
clicked Update Schema
To find URL: from Schema Designer page, open browser F12 debugger,
enable the Network tab,
perform a query: from the Query Tester, click RunQuery
copy the path http://localhost:8983/api/schema-designer/query
copy the parameters _=1658489222229 and configSet=eDisMaxTest1 (values may differ).
Request: one query string
bash: (\ before newline continues line, omit inside '...')
curl --silent --get localhost:8983/api/schema-designer/query \
--data-urlencode "_=1658489222229" \
--data-urlencode "configSet=eDisMaxTest1" \
--data-urlencode 'q= +_query_:"{!edismax mm=75% df=name} Game~ Thrones~ Clash~"
+_query_:"{!edismax mm=70% df=author} George~ R.R.~ Martin~"
+inStock:true' \
--data-urlencode "sort=books_id asc"
cmd: (^ before newline continues line, even inside '...')
curl --silent --get localhost:8983/api/schema-designer/query ^
--data-urlencode "_=1658489222229" ^
--data-urlencode "configSet=eDisMaxTest1" ^
--data-urlencode 'q= +_query_:"{!edismax mm=75% df=name} Game~ Thrones~ Clash~" ^
+_query_:"{!edismax mm=70% df=author} George~ R.R.~ Martin~" ^
+inStock:true' ^
--data-urlencode "sort=books_id asc"
Alternative request: use nested query parameter v=$param refer to other request parameters containing subquery terms.
bash: (\ before newline continues line, omit inside '...')
curl --silent --get localhost:8983/api/schema-designer/query \
--data-urlencode "_=1658489222229" \
--data-urlencode "configSet=eDisMaxTest1" \
--data-urlencode 'q= +_query_:"{!edismax mm=75% df=name v=$qname}"
+_query_:"{!edismax mm=70% df=author v=$qauthor}"
+inStock:true' \
--data-urlencode "qname= Game~ Thrones~ Clash~" \
--data-urlencode "qauthor= George~ R.R.~ Martin~" \
--data-urlencode "sort=books_id asc"
cmd: (^ before newline continues line, even inside '...')
curl --silent --get localhost:8983/api/schema-designer/query ^
--data-urlencode "_=1658489222229" ^
--data-urlencode "configSet=eDisMaxTest1" ^
--data-urlencode 'q= +_query_:"{!edismax mm=75% df=name v=$qname}" ^
+_query_:"{!edismax mm=70% df=author v=$qauthor}" ^
+inStock:true' ^
--data-urlencode "qname= Game~ Thrones~ Clash~" ^
--data-urlencode "qauthor= George~ R.R.~ Martin~" ^
--data-urlencode "sort=books_id asc"
Response: two books as desired
{
...
"responseHeader":{
...
"params":{
"q":" +_query_:\"{!edismax mm=75% df=name v=$qname}\"\n
+_query_:\"{!edismax mm=70% df=author v=$qauthor}\"\n
+inStock:true",
"qauthor":" George~ R.R.~ Martin~",
"qname":" Game~ Thrones~ Clash~",
"sort":"books_id asc",
"configSet":"eDisMaxTest1",
"wt":"javabin",
"version":"2",
"_":"1658489222229"}},
"response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[
{
"name":"Game Thrones Clash",
"author":"George R.R. Martin",
...
"books_id":553573403,
"inStock":true},
{
"name":"Game Thrones",
"author":"George Martin",
...
"books_id":553573404,
"inStock":true}]
}}
References
[1] Nested Query Parser (Solr Reference Guide / Query Guide / Query Syntax and Parsers / Other Query Parsers)
https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#nested-query-parser
[2] Nested Queries in Solr
https://lucidworks.com/post/nested-queries-in-solr/
[3] Local Params (Solr Reference Guide / Query Guide / Query Syntax and Parsers)
https://solr.apache.org/guide/solr/latest/query-guide/local-params.html
[4] Query type short form (Solr Reference Guide / Query Guide / Query Syntax and Parsers / Local Params)
https://solr.apache.org/guide/solr/latest/query-guide/local-params.html#query-type-short-form
[5] Extended DisMax (eDisMax) Query Parser (Solr Reference Guide / Query Guide / Query Syntax and Parsers)
https://solr.apache.org/guide/solr/latest/query-guide/edismax-query-parser.html
[6] mm (Minimum [Should-term] Match) Parameter (Solr Reference Guide / Query Guide / Query Syntax and Parsers / DisMax Query Parser)
https://solr.apache.org/guide/solr/latest/query-guide/dismax-query-parser.html#mm-minimum-should-match-parameter
[7] df [default field] (Solr Reference Guide / Query Guide / Query Syntax and Parsers / Standard Query Parser / Standard Query Parser Parameters)
https://solr.apache.org/guide/solr/latest/query-guide/standard-query-parser.html#standard-query-parser-parameters

Related

How to check not exist field in document in vespa

curl --location --request POST 'http://xxxxxx/search/' \
--header 'Content-Type: application/json' \
--data-raw '{
"offset": 0,
"hits": 60,
"ranking.softtimeout.factor": 0.7,
"ranking.profile": "default",
"yql": "select id, origin_id from mt_challenge where best_score = \"null\";",
"timeout": "1500ms",
"request_timeout": 5
}'
best_score field is float type how to checkout all document with not exist best_score field
Sorry but NaN/null values are not searchable so this is currently not supported. Using a default value at ingestion time which does not conflict with your real data can be used to work around this.

How do I insert timeseries data into Solr?

I have existing Solr collection named chronix.
ANd I also have configured chronix, but when I issue query on chronix JAVA FX then it is throwing error like this -
2018-06-29 09:30:16.611 INFO (qtp761960786-20) [ x:chronix] o.a.s.c.S.Request [chronix] webapp=/solr path=/select params={q=*:*&fl=%2Bdata&start=0&rows=200&wt=javabin&version=2} status=500 QTime=1
2018-06-29 09:30:16.611 ERROR (qtp761960786-20) [ x:chronix] o.a.s.s.HttpSolrCall null:java.lang.NumberFormatException: For input string: "+"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
What am I missing?
You have a + before the value in your fl parameter. The fl parameter should contain the list of fields you want to retrieve, and having a + here is not a valid field name.
To get the field data back, &fl=data is what you want, not &fl=+data (%2B is the url encoded version of +)

Does Sqoop use Reducer?

Does Sqoop run reducer if there is a join/aggregation performed in the select query given with a --query parameter? Or is there any case in Sqoop where both mappers and reducers run?
Documentation specifies that each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop.
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
In the example above, how does the JOIN take place where the table is first partitioned using $CONDITIONS?
Join/Computation will be executed on RDBMS and its result will be used by mapper to transfer to HDFS.
No reducer is involved
With --query parameter, you need to specify the --split-by parameter with the column that should be used for slicing
your data into multiple parallel map tasks. This parameter usually automatically defaults to
the primary key of the main table
$CONDITIONS will be automatically substituted this placeholder with the generated conditions specifying which slice of data to transfer
In your particular command, the sqoop does not use reducer.
However there are cases, when sqoop do use reducer. Check the below example taken from document here.
$ sqoop export \
-Dmapred.reduce.tasks=2
-Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
-Dpgbulkload.input.field.delim=$'\t' \
-Dpgbulkload.check.constraints="YES" \
-Dpgbulkload.parse.errors="INFINITE" \
-Dpgbulkload.duplicate.errors="INFINITE" \
--connect jdbc:postgresql://pgsql.example.net:5432/sqooptest \
--connection-manager org.apache.sqoop.manager.PGBulkloadManager \
--table test --username sqooptest --export-dir=/test -m 2

sqoop import all to hive from db2 specific schema

I was trying to import all tables from a Specific schema in DB2 using below command line.
sqoop import-all-tables --username user --password pass \
--connect jdbc:db2://myip:50000/databs:CurrentSchema=testdb \
--driver com.ibm.db2.jcc.DB2Driver --fields-terminated-by ',' \
--lines-terminated-by '\n' --hive-database default --hive-import --hive-overwrite \
--create-hive-table -m 1;
Struck with following error
2017-05-02 09:21:18,474 ERROR - [main:] ~ Error reading database metadata:
com.ibm.db2.jcc.am.SqlSyntaxErrorException: [jcc][10165][10051][4.11.77]
Invalid database URL syntax:
jdbc:db2://myip:50000/msrc:CurrentSchema=testdb. ERRORCODE=-4461,
SQLSTATE=42815 (SqlManager:43)
com.ibm.db2.jcc.am.SqlSyntaxErrorException: [jcc][10165][10051][4.11.77]
Invalid database URL syntax:
jdbc:db2://myip:50000/msrc:CurrentSchema=testdb. ERRORCODE=-4461,
SQLSTATE=42815
at com.ibm.db2.jcc.am.gd.a(gd.java:676)
at com.ibm.db2.jcc.am.gd.a(gd.java:60)
at com.ibm.db2.jcc.am.gd.a(gd.java:85)
at com.ibm.db2.jcc.DB2Driver.tokenizeURLProperties(DB2Driver.java:911)
at com.ibm.db2.jcc.DB2Driver.connect(DB2Driver.java:408)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
at
org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:885)
at
org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.listTables(SqlManager.java:520)
at
org.apache.sqoop.tool.ImportAllTablesTool.run(ImportAllTablesTool.java:95)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
at java.util.StringTokenizer.nextToken(StringTokenizer.java:377)
at com.ibm.db2.jcc.DB2Driver.tokenizeURLProperties(DB2Driver.java:899)
... 13 more
Could not retrieve tables list from server
2017-05-02 09:21:18,696 ERROR - [main:] ~ manager.listTables() returned null
(ImportAllTablesTool:98)
[
Command:
sqoop import-all-tables \
--driver com.ibm.db2.jcc.DB2Driver \
--connect jdbc:db2://myip:50000/databs \
--username username --password password \
--hive-database default --hive-import --m 1 \
--create-hive-table --hive-overwrite
The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

How to delete all data from solr and hbase

How do I delete all data from solr by command? We are using solr with lily and hbase.
How can I delete data from both hbase and solr?
http://lucene.apache.org/solr/4_10_0/tutorial.html#Deleting+Data
If you want to clean up Solr index -
you can fire http url -
http://host:port/solr/[core name]/update?stream.body=<delete><query>*:*</query></delete>&commit=true
(replace [core name] with the name of the core you want to delete from). Or use this if posting data xml data:
<delete><query>*:*</query></delete>
Be sure you use commit=true to commit the changes
Don't have much idea with clearing hbase data though.
I've used this request to delete all my records but sometimes it's necessary to commit this.
For that, add &commit=true to your request :
http://host:port/solr/core/update?stream.body=<delete><query>*:*</query></delete>&commit=true
Post json data (e.g. with curl)
curl -X POST -H 'Content-Type: application/json' \
'http://<host>:<port>/solr/<core>/update?commit=true' \
-d '{ "delete": {"query":"*:*"} }'
You can use the following commands to delete.
Use the "match all docs" query in a delete by query command:
'<delete><query>*:*</query></delete>
You must also commit after running the delete so, to empty the index, run the following two commands:
curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
Another strategy would be to add two bookmarks in your browser:
http://localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>
http://localhost:8983/solr/update?stream.body=<commit/>
Source docs from SOLR:
https://wiki.apache.org/solr/FAQ#How_can_I_delete_all_documents_from_my_index.3F
If you want to delete all of the data in Solr via SolrJ do something like this.
public static void deleteAllSolrData() {
HttpSolrServer solr = new HttpSolrServer("http://localhost:8080/solr/core/");
try {
solr.deleteByQuery("*:*");
} catch (SolrServerException e) {
throw new RuntimeException("Failed to delete data in Solr. "
+ e.getMessage(), e);
} catch (IOException e) {
throw new RuntimeException("Failed to delete data in Solr. "
+ e.getMessage(), e);
}
}
If you want to delete all of the data in HBase do something like this.
public static void deleteHBaseTable(String tableName, Configuration conf) {
HBaseAdmin admin = null;
try {
admin = new HBaseAdmin(conf);
admin.disableTable(tableName);
admin.deleteTable(tableName);
} catch (MasterNotRunningException e) {
throw new RuntimeException("Unable to delete the table " + tableName
+ ". The actual exception is: " + e.getMessage(), e);
} catch (ZooKeeperConnectionException e) {
throw new RuntimeException("Unable to delete the table " + tableName
+ ". The actual exception is: " + e.getMessage(), e);
} catch (IOException e) {
throw new RuntimeException("Unable to delete the table " + tableName
+ ". The actual exception is: " + e.getMessage(), e);
} finally {
close(admin);
}
}
Use the "match all docs" query in a delete by query command: :
You must also commit after running the delete so, to empty the index, run the following two commands:
curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
I came here looking to delete all documents from solr instance through .Net framework using SolrNet. Here is how I was able to do it:
Startup.Init<MyEntity>("http://localhost:8081/solr");
ISolrOperations<MyEntity> solr =
ServiceLocator.Current.GetInstance<ISolrOperations<MyEntity>>();
SolrQuery sq = new SolrQuery("*:*");
solr.Delete(sq);
solr.Commit();
This has cleared all the documents. (I am not sure if this could be recovered, I am in learning and testing phase of Solr, so please consider backup before using this code)
From the command line use:
bin/post -c core_name -type text/xml -out yes -d $'<delete><query>*:*</query></delete>'
fire this in the browser
http://localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>&commit=true
this commmand will delete all the documents in index in solr
I've used this query to delete all my records.
http://host/solr/core-name/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true
To delete all documents of a Solr collection, you can use this request:
curl -X POST -H 'Content-Type: application/json' --data-binary '{"delete":{"query":"*:*" }}' http://localhost:8983/solr/my_collection/update?commit=true
It uses the JSON body.
The curl examples above all failed for me when I ran them from a cygwin terminal. There were errors like this when i ran the script example.
curl http://192.168.2.20:7773/solr/CORE1/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst>
</response>
<!--
It looks like it deleted stuff, but it did not go away
maybe because the committing call failed like so
-->
curl http://192.168.1.2:7773/solr/CORE1/update --data-binary '' -H 'Content-type:text/xml; charset=utf-8'
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int name="QTime">2</int></lst><lst name="error"><str name="msg">Unexpected EOF in prolog
at [row,col {unknown-source}]: [1,0]</str><int name="code">400</int></lst>
</response>
I needed to use the delete in a loop on core names to wipe them all out in a project.
This query below worked for me in the Cygwin terminal script.
curl http://192.168.1.2:7773/hpi/CORE1/update?stream.body=<delete><query>*:*</query></delete>&commit=true
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int></lst>
</response>
This one line made the data go away and the change persisted.
I tried the below steps. It works well.
Please make sure the SOLR server it running
Just click the link Delete all SOLR data which will hit and delete all your SOLR indexed datas then you will get the following details on the screen as output.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">494</int>
</lst>
</response>
if you are not getting the above output then please make sure the following.
I used the default host (localhost) and port (8080) on the above link. please alter the host and port if it is different in your end.
The default core name should be collection / collection1. I used collection1 in the above link. please change it too if your core name is different.
If you need to clean out all data, it might be faster to recreate collection, e.g.
solrctl --zk localhost:2181/solr collection --delete <collectionName>
solrctl --zk localhost:2181/solr collection --create <collectionName> -s 1
I made a JavaScript bookmark which adds the delete link in Solr Admin UI
javascript: (function() {
var str, $a, new_href, href, upd_str = 'update?stream.body=<delete><query>*:*</query></delete>&commit=true';
$a = $('#result a#url');
href = $a.attr('href');
str = href.match('.+solr\/.+\/(.*)')[1];
new_href = href.replace(str, upd_str);
$('#result').prepend('<a id="url_upd" class="address-bar" href="' + new_href + '"><strong>DELETE ALL</strong> ' + new_href + '</a>');
})();
If you're using Cloudera 5.x, Here in this documentation is mentioned that Lily maintains the Real time updations and deletions also.
Configuring the Lily HBase NRT Indexer Service for Use with Cloudera Search
As HBase applies inserts, updates, and deletes to HBase table cells,
the indexer keeps Solr consistent with the HBase table contents, using
standard HBase replication.
Not sure iftruncate 'hTable' is also supported in the same.
Else you create a Trigger or Service to clear up your data from both Solr and HBase on a particular Event or anything.
When clearing out a Solr index, you should also do a commit and optimize after running the delete-all query. Full steps required (curl is all you need): http://www.alphadevx.com/a/365-Clearing-a-Solr-search-index
Solr I am not sure but you can delete all the data from hbase using truncate command like below:
truncate 'table_name'
It will delete all row-keys from hbase table.

Resources