Just got an issue where our solr 6.6 doesn't seem to be using logical operators (AND,OR,NOT) as operators but is actually search for the words. So results that should only have a few hundred hits now have thousands. We are using the edismax parser.
Solr Query: Apple AND Google
"debug":{
"rawquerystring":"Apple AND Google",
"querystring":"Apple AND Google",
"parsedquery":"(+(DisjunctionMaxQuery((cm_city_t:apple | cm_credit_t:apple | cm_notes_t:apple | cm_state_t:apple | cm_country_t:apple | cm_description_t:apple | cm_caption_writer_s:Apple | cm_photographer_t:apple)) DisjunctionMaxQuery((cm_city_t:and | cm_credit_t:and | cm_notes_t:and | cm_state_t:and | cm_country_t:and | cm_description_t:and | cm_caption_writer_s:AND | cm_photographer_t:and)) DisjunctionMaxQuery((cm_city_t:google | cm_credit_t:google | cm_notes_t:google | cm_state_t:google | cm_country_t:google | cm_description_t:google | cm_caption_writer_s:Google | cm_photographer_t:google))))/no_coord",
"parsedquery_toString":"+((cm_city_t:apple | cm_credit_t:apple | cm_notes_t:apple | cm_state_t:apple | cm_country_t:apple | cm_description_t:apple | cm_caption_writer_s:Apple | cm_photographer_t:apple) (cm_city_t:and | cm_credit_t:and | cm_notes_t:and | cm_state_t:and | cm_country_t:and | cm_description_t:and | cm_caption_writer_s:AND | cm_photographer_t:and) (cm_city_t:google | cm_credit_t:google | cm_notes_t:google | cm_state_t:google | cm_country_t:google | cm_description_t:google | cm_caption_writer_s:Google | cm_photographer_t:google))",
"QParser":"ExtendedDismaxQParser",
"altquerystring":null,
"boost_queries":null,
"parsed_boost_queries":[],
"boostfuncs":null,
"filter_queries":["an_sas_s:\"Photo System \\- P1\" AND an_security_group_s:\"P1.Leaders\""],
"parsed_filter_queries":["+an_sas_s:Photo System - P1 +an_security_group_s:P1.Leaders"],
You can see the "and" is being included as a search term in our fields. I'm not sure why its doing this. I have found that if I drop down to the dismax parser it works fine. I'm new to developing and working with solr, but my understanding is edismax is the better parser to use and should be way more advanced. Could this be a configuration issue with the response handler or something else?
Related
Do we have any mechanism in Snowflake where we alert Users running a Query containing Large Size Tables , this way user would get to know that Snowflake would consume many warehouse credits if they run this query against large size dataset,
There is no alert mechanism for this, but users may run EXPLAIN command before running the actual query, to estimate the bytes/partitions read:
explain select c_name from "SAMPLE_DATA"."TPCH_SF10000"."CUSTOMER";
+-------------+----+--------+-----------+-----------------------------------+-------+-----------------+-----------------+--------------------+---------------+
| step | id | parent | operation | objects | alias | expressions | partitionsTotal | partitionsAssigned | bytesAssigned |
+-------------+----+--------+-----------+-----------------------------------+-------+-----------------+-----------------+--------------------+---------------+
| GlobalStats | | | | 6585 | 6585 | 109081790976 | | | |
| 1 | 0 | | Result | | | CUSTOMER.C_NAME | | | |
| 1 | 1 | 0 | TableScan | SAMPLE_DATA.TPCH_SF10000.CUSTOMER | | C_NAME | 6585 | 6585 | 109081790976 |
+-------------+----+--------+-----------+-----------------------------------+-------+-----------------+-----------------+--------------------+---------------+
https://docs.snowflake.com/en/sql-reference/sql/explain.html
You can also assign users to specific warehouses, and use resource monitors to limit credits on those warehouses.
https://docs.snowflake.com/en/user-guide/resource-monitors.html#assignment-of-resource-monitors
As the third alternative, you may set STATEMENT_TIMEOUT_IN_SECONDS to prevent long running queries.
https://docs.snowflake.com/en/sql-reference/parameters.html#statement-timeout-in-seconds
I'm currently struggling to get decent performance on a ~18M documents core with NRT indexing from a date range query, in Solr 4.10 (CDH 5.14).
I tried multiple strategies but everything seems to fail.
Each document has multiple versions (10 to 100) valid at different non-overlapping periods (startTime/endTime) of time.
The query pattern is the following: query on the referenceNumber (or other criteria) but only return documents valid at a referenceDate (day precision). 75% of queries select a referenceDate within the last 30 days.
If we query without the referenceDate, we have very good performance, but 100x slowdown with the additional referenceDate filter, even when forcing it as a postfilter.
Here are some perf tests from a python script executing http queries and computing the QTime of 100 distinct referenceNumber.
+----+-------------------------------------+----------------------+--------------------------+
| ID | Query | Results | Comment |
+----+-------------------------------------+----------------------+--------------------------+
| 1 | q=referenceNumber:{referenceNumber} | 100 calls in <10ms | Performance OK |
+----+-------------------------------------+----------------------+--------------------------+
| 2 | q=referenceNumber:{referenceNumber} | 99 calls in <10ms | 1 call to warm up |
| | &fq=startDate:[* to NOW/DAY] | 1 call in >=1000ms | the cache then all |
| | AND endDate:[NOW/DAY to *] | | queries hit the filter |
| | | | cache. Problem: as |
| | | | soon as new documents |
| | | | come in, they invalidate |
| | | | the cache. |
+----+-------------------------------------+----------------------+--------------------------+
| 3 | q=referenceNumber:{referenceNumber} | 99 calls in >=500ms | The average of |
| | &fq={!cache=false cost=200} | 1 call in >=1000ms | calls is 734.5ms. |
| | startDate:[* to NOW/DAY] | | |
| | AND endDate:[NOW/DAY to *] | | |
+----+-------------------------------------+----------------------+--------------------------+
How is it possible that the additional date range filter query creates a 100x slowdown? From this blog, I would have expected similar performance from the daterange query as without the additional filter: http://yonik.com/advanced-filter-caching-in-solr/
Or is the only option is to change the softCommit/hardCommit delays, create 30 warmup fq for the past 30 days and tolerate poor performance on 25% of our queries?
Edit 1: Thanks for the answer, unfortunately, using integers instead of tdate does not seem to provide any performance gains. It can only leverage caching, like the query ID 2 above. That means we need a strategy of warmup of 30+ fq.
+----+-------------------------------------+----------------------+--------------------------+
| ID | Query | Results | Comment |
+----+-------------------------------------+----------------------+--------------------------+
| 4 | fq={!cache=false} | 35 calls in <10ms | |
| | referenceNumber:{referenceNumber} | 65 calls in >10ms | |
+----+-------------------------------------+----------------------+--------------------------+
| 5 | fq={!cache=false} | 9 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 6 calls in >500ms | |
| | AND versionNumber:[2 TO *] | 85 calls in >1000ms | |
+----+-------------------------------------+----------------------+--------------------------+
edit 2: It seems that passing my referenceNumber from fq to q and setting different costs improve the query time (no perfect, but better). What's weird though is that the cost >= 100 is supposed to be executed as a postFilter, but setting the cost from 20 to 200 does not seem to impact performance at all. Does anyone know how to see whether a fq param is executed as a post filter?
+----+-------------------------------------+----------------------+--------------------------+
| 6 | fq={!cache=false cost=0} | 89 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 11 calls in >500ms | |
| | &fq={!cache=false cost=200} | | |
| | startDate:[* TO NOW] AND | | |
| | endDate:[NOW TO *] | | |
+----+-------------------------------------+----------------------+--------------------------+
| 7 | fq={!cache=false cost=0} | 36 calls in >100ms | |
| | referenceNumber:{referenceNumber} | 64 calls in >500ms | |
| | &fq={!cache=false cost=20} | | |
| | startDate:[* TO NOW] AND | | |
| | endDate:[NOW TO *] | | |
+----+-------------------------------------+----------------------+--------------------------+
Hi I have an another solution for you, it will give a good performance after performing same query to solr.
My Suggestion is store date in int format, please find below example.
Your Start Date : 2017-03-01
Your END Date : 2029-03-01
**Suggested format in int format.
Start Date : 20170301
END Date : 20290301**
When you are trying fire same query with int number instead of dates it works faster as expected.
So your query will be.
q=referenceNumber:{referenceNumber}
&fq=startNewDate:[* to YYMMDD]
AND endNewDate:[YYMMDD to *]
Hope it will help you ..
Can someone help to clarify restrict operator?
I understand Sybase restrict operator is used to evaluate expressions based on columns. But I still can't figure out the exact meaning of RESTRICT Operator in the query plan.
e.g, below is a query plan snippet of my sql. RESTRICT Operator (VA = 1)(4)(0)(0)(0)(0): what does (4)(0)(0)(0)(0) mean?
10 operator(s) under root
|ROOT:EMIT Operator (VA = 10)
|
| |SCALAR AGGREGATE Operator (VA = 9)
| | Evaluate Ungrouped COUNT AGGREGATE.
| |
| | |N-ARY NESTED LOOP JOIN Operator (VA = 8) has 7 children.
| | |
| | | |RESTRICT Operator (VA = 1)(4)(0)(0)(0)(0)
| | | |
| | | | |SCAN Operator (VA = 0)
| | | | | FROM TABLE
| | | | | trade
| | | | | t
| | | | | Index : i1
| | | | | Forward Scan.
| | | | | Positioning by key.
| | | | | Keys are:
| | | | | order_number ASC
| | | | | Using I/O Size 16 Kbytes for index leaf pages.
| | | | | With LRU Buffer Replacement Strategy for index leaf pages.
| | | | | Using I/O Size 16 Kbytes for data pages.
| | | | | With LRU Buffer Replacement Strategy for data pages.
| | |
| | | |SCAN Operator (VA = 2)
| | | | FROM TABLE
| | | | product
| | | | mp
| | | | Index : mp
| | | | Forward Scan.
| | | | Positioning by key.
| | | | Keys are:
| | | | prod_id ASC
| | | | Using I/O Size 16 Kbytes for index leaf pages.
| | | | With LRU Buffer Replacement Strategy for index leaf pages.
| | | | Using I/O Size 16 Kbytes for data pages.
| | | | With LRU Buffer Replacement Strategy for data pages.
| | |
| | | |SCAN Operator (VA = 3)
| | | | FROM TABLE
| | | | Accounts
| | | | a
| | | | Index : i2
| | | | Forward Scan.
| | | | Positioning by key.
| | | | Index contains all needed columns. Base table will not be read.
| | | | Keys are:
| | | | account ASC
| | | | Using I/O Size 16 Kbytes for index leaf pages.
| | | | With LRU Buffer Replacement Strategy for index leaf pages.
Part of the showplan output, like those numbers behind the operator, are internals of the ASE optimizer. There is no documented information about these, and this information is included to help techsupport in resolving issues.
The 'VA = n' part is just reflecting the unique number 'n' for every operator in the query plan.
In Cucumber, we can directly validate the database table content in tabular format by mentioning the values in below format:
| Type | Code | Amount |
| A | HIGH | 27.72 |
| B | LOW | 9.28 |
| C | LOW | 4.43 |
Do we have something similar in Robot Framework. I need to run a query on the DB and the output looks like the above given table.
No, there is nothing built in to do exactly what you say. However, it's fairly straight-forward to write a keyword that takes a table of data and compares it to another table of data.
For example, you could write a keyword that takes the result of the query and then rows of information (though, the rows must all have exactly the same number of columns):
| | ${ResultOfQuery}= | <do the database query>
| | Database should contain | ${ResultOfQuery}
| | ... | #Type | Code | Amount
| | ... | A | HIGH | 27.72
| | ... | B | LOW | 9.28
| | ... | C | LOW | 4.43
Then it's just a matter of iterating over all of the arguments three at a time, and checking if the data has that value. It would look something like this:
**** Keywords ***
| Database should contain
| | [Arguments] | ${actual} | #{expected}
| | :FOR | ${type} | ${code} | ${amount} | IN | #{expected}
| | | <verify that the values are in ${actual}>
Even easier might be to write a python-based keyword, which makes it a bit easier to iterate over datasets.
subtable or extended table, I do not know how to be more correct to call it.
I need to implement the table with the following structure:
Car | Number | Price | Date |
Mazda | 0122335 | $20000 | 01.08.10 |
____________________________| $19999 | 02.08.10 |
____________________________| $19500 | 03.08.10 |
____________________________| $19000 | 04.08.10 |
Toyota| 0254785 | $50000 | 01.08.10 |
_BMW | 1212222 | $80000 | 04.08.10 |
____________________________| $75000 | 06.08.10 |
____________________________| $70000 | 08.08.10 |
____________________________| $65000 | 10.08.10 |
____________________________| $60000 | 12.08.10 |
____________________________| $55000 | 15.08.10 |
as you see, one row of Сar - we have several lines with Price and Date.
I not found examples of such a structure, so please help on the forum. Maybe someone knows how to implement such a table.
Thks.
Grouping grid?
http://dev.sencha.com/deploy/dev/examples/grid/grouping.html
Yes, perhaps you're right, it is necessary to magic some standard examples