I have a situation where I have to search a document in Solr with multiple OR keywords. Now the number of keywords may lead up to 5000 which is resulting in a awfully large query with 5000 OR conditions. This is resulting in the Solr server to hang. Is there any other way I can design the query to work. Short sample of the query is given below
tweet_id:337931022601699328 OR 337931064293081089 OR 337931089538584576 OR 337931098761871361 OR 337931138851016704 OR 337931143099854848 OR 337931160082591745 OR 337931163857453056 OR 337931230819516416 OR 337931239996665857 OR 337931287518126080 OR 337931322850951168 OR 337931325648535553 OR 337931331398934528 OR 337931413057830912 OR 337931442363441152 OR 337931448629731329 OR 337931453344129025 OR 337931465016877056 OR 337931482066726912 OR 337931514388029442 OR 337931533149155328 OR 337931645527130114 OR 337931704935256064 OR 337931784459268096 OR 337931845545103360 OR 337931889086185472 OR 337931892668108801 OR 337931963983855617 OR 337932154212319233 OR 337932176454721536 OR 337932193198374912 OR 337932229659459584 OR 337932437290090496 OR 337932436807749632 OR 337932436828725250 OR 337932437449474048 OR 337932448518250496 OR 337932458832035843 OR 337932458634915840 OR 337932458278387712 OR 337932474246119425 OR 337932476209041409 OR 337932477408620544 OR 337932480478842880 OR 337932478775959554 OR 337932480566931456 OR 337932478763376640 OR 337932481841999872 OR 337932479337992192 OR 337932479296045057 OR 337932479333797889 OR 337932484614434816 OR 337932484606038017 OR 337932482777317376 OR 337932484664758272 OR 337932482785718273 OR 337932484589273088 OR 337932487399444481 OR 337932489031032833 OR 337932489114923008 OR 337932486573166592 OR 337932490704560130 OR 337932489144270848 OR 337932488762601472 OR 337932492097069056 OR 337932497780355072 OR 337932498900230144 OR 337932499722321921 OR 337932514431729665 OR 337932561806409731 OR 337932567284154368 OR 337932567300935680 OR 337932574603214848 OR 337932571134533632 OR 337932574674518016 OR 337932575484026881 OR 337932578206121984 OR 337932582215892994 OR 337932586653454336 OR 337932584917024768 OR 337932592986865664 OR 337932597017587712 ....
I intend to facet the result based on a few fields.
I'm not sure whether this solution would help you or not, but tried something for your problem.
Whatever the query you provide to Solr, first it parses that query to it's understandable format. Then Solr executes that for result. You have to do some calculations before querying to Solr. Let's take the following scenario to solve your use case.
Suppose You have total 5000 tweet_id. You have to do an OR query on around 4000 tweet_id. In this type of scenario, it's better to query on other (5000-4000=1000) 1000 tweet_id with negation AND query. So, your query will have less values passed.
So, try querying with rest of the tweet_id with negation AND query instead of OR query.
If I were you, I'd create a new field denoting this custom_list_id .. Whenever you generate a new list, index the new data then query by the list I'd.
Related
I am trying to upgrade my Solr 4.x version to 5.2.1 Solrcloud implementation. I had written following code to get all the results from Sorl query which works well in Solr single instance mode.
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.addSort("agent_status", ORDER.desc);
query.addFilterQuery("account_id:\"" + accountId + "\"");
query.set("rows", Integer.MAX_VALUE);
But code will not work well in SolrCloud implemenation.It throws following exception.
2015-08-14 16:44:45,648 ERROR [solr.core.SolrCore] - [http-8080-8] : java.lang.NegativeArraySizeException
at org.apache.lucene.util.PriorityQueue.<init>(PriorityQueue.java:58)
at org.apache.lucene.util.PriorityQueue.<init>(PriorityQueue.java:39)
at org.apache.solr.handler.component.ShardFieldSortedHitQueue.<init>(ShardDoc.java:113)
at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:972)
at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:750)
at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:729)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:388)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
I found that it is failing because of query.set("rows", Integer.MAX_VALUE) statement.People suggested me to use pagination.
But, I can not afford doing pagination as there will be too many changes at UI side.
There is one more way where I can first query with some small number & get total number of documents using response.getResults().getNumFound() method & try setting that value to setRows method.But this approach will increase one more call to server.
Is there any other way I can solve this problem?
You can always set your rows to be a large value that would encompass your results. Integer.MAX_VALUE will not work due to the size limits of Java Arrays (see here) and the Lucene Priority Queue (see lines 42 - 58).
Solr-534 requested to have essentially what your asking for; there is some good conversation about why and why-not such a feature would be good.
A better question might be how many documents can the UI hold without becoming unusable? However many documents that is, would be a good value for your query to return.
I was working months with my class, but now give me the following error:
"System.QueryException: Non-selective query against large object type (more than 100000 rows). Consider an indexed filter or contact salesforce.com about custom indexing. Even if a field is indexed a filter might still not be selective when: 1. The filter value includes null (for instance binding with a list that contains null) 2. Data skew exists whereby the number of matching rows is very large (for instance, filtering for a particular foreign key value that occurs many times)"
In this Line:
usuarios=[select id,PersonEmail,marca__c,
Marcas_de_las_que_quiere_recibir_ofertas__c,dni__c,
Segmento__c,Datos_Preferencias__c,
Datos_Test_Compatibilidad__c,Datos_Test_Big_Five__c,
Datos_CV_Obligatorio__c, datos_contratacion__c,
Fecha_orientacion_a_marca__c,Resultado_orientacion_a_marca__c,
Fecha_formacion_online__c,Resultado_formacion_online__c
from Account
where
Segmento__c in :setsegmentos
and Marca_relacional__c in: listaMarcas
and Baja__c=false and Ya_estoy_trabajando__c=false
and Back_list__c=false
and Inactivo__c=false limit 3000];
My SOQL just take 3000 records. Anyone can help me please?
I made some change for an advice fromm chri (thank you!), but still the same error.
I have a graph database with 5M of nodes and 10M of relationships.
I'm on a Macbook Pro with 4GB RAM. I have already try to adjust java heap size and neo4j memory without success.
My problem is that i have a simply cypher query like that :
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
LIMIT 30;
This query takes 100ms , which is impressive. But when i add an "ORDER BY" this query takes a long time => 8s :/
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
ORDER BY r.date DESC
LIMIT 30;
Does Someone has an idea ?
You might want to consider relationship indexes to speed up your query. The date property could be indexed this way. You're using the ORDER BY keyword which will almost always make your query slower as it needs to iterate the entire result set to perform the ordering.
Also consider using a single MATCH statement if that suits your needs:
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)-[r:POSTED]->(n)
I executed some query like "Address:Jack*". It show numFound = 5214 and display 100 documents in results page(I changed default display results from 10 to 100).
How can I get all documents.
I remember myself doing &rows=2147483647
2,147,483,647 is integer's maximum value. I recall using a number bigger than that once and having a NumberFormatException because it couldn't be parsed into an int. I don't know if they use Long nowadays, but 2 billion rows is normally more than enough.
Small note:
Be careful if you are planning to do this in production. If you do a query like * : * and your index is big, you could transferring a couple of gigabytes in that query.
If you know you won't have many docs, go ahead and use integer's max value.
On the other hand, if you are doing a one-time script and just need to dump all results (for example document ID's) then this approach is valid, if you don't mind waiting 3-5 minutes for a query to return.
Don't use &rows=2147483647
Don't use Integer.MAX_VALUE(2147483647) as value of rows in production. This will heavily slow down your query even if you have a small resultset, because solr preallocates a queue in this size. see https://issues.apache.org/jira/browse/SOLR-7580
I strongly suggest to use Exporting Result Sets
It’s possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Or I suggest to use Deep Paging.
Simple Pagination is a easy thing when you have few documents to read and all you have to do is play with start and rows parameters. But this is not a feasible way when you have many documents, I mean hundreds of thousands or even millions.
This is the kind of thing that could bring your Solr server to their knees.
For typical applications displaying search results to a human user,
this tends to not be much of an issue since most users don’t care
about drilling down past the first handful of pages of search results
— but for automated systems that want to crunch data about all of the
documents matching a query, it can be seriously prohibitive.
This means that if you have a website and are paging search results, a real user do not go so further but consider on the other hand what can happen if a spider or a scraper try to read all the website pages.
Now we are talking of Deep Paging.
I’ll suggest to read this amazing post:
https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
And take a look at this document page:
https://solr.apache.org/guide/pagination-of-results.html
And here is an example that try to explain how to paginate using the cursors.
SolrQuery solrQuery = new SolrQuery();
solrQuery.setRows(500);
solrQuery.setQuery("*:*");
solrQuery.addSort("id", ORDER.asc); // Pay attention to this line
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (!done) {
solrQuery.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrClient.query(solrQuery);
String nextCursorMark = rsp.getNextCursorMark();
for (SolrDocument d : rsp.getResults()) {
...
}
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Returning all the results is never a good option as It would be very slow in performance.
Can you mention your use case ?
Also, Solr rows parameter helps you to tune the number of the results to be returned.
However, I don't think there is a way to tune rows to return all results. It doesn't take a -1 as value.
So you would need to set a high value for all the results to be returned.
What you should do is to first create a SolrQuery shown below and set the number of documents you want to fetch in a batch.
int lastResult=0; //this is for processing the future batch
String query = "id:[ lastResult TO *]"; // just considering id for the sake of simplicity
SolrQuery solrQuery = new SolrQuery(query).setRows(500); //setRows will set the required batch, you can change this to whatever size you want.
SolrDocumentList results = solrClient.query(solrQuery).getResults(); //execute this statement
Here I am considering an example of search by id, you can replace it with any of your parameter to search upon.
The "lastResult" is the variable you can change after execution of the first 500 records(500 is the batch size) and set it to the last id got from the results.
This will help you execute the next batch starting with last result from previous batch.
Hope this helps. Shoot up a comment below if you need any clarification.
For selecting all documents in dismax/edismax via Solarium php client, the normal query syntax : does not work. To select all documents set the default query value in solarium query to empty string. This is required as the default query in Solarium is :. Also set the alternative query to :. Dismax/eDismax normal query syntax does not support :, but the alternative query syntax does.
For more details following book can be referred
http://www.packtpub.com/apache-solr-php-integration/book
As the other answers pointed out, you can configure the rows to be max integer to yield back all the results for a query.
I would recommend though to use Solr feature of pagination, and build a function that will return for you all the results using the cursorMark API. The gist of it is you set the cursorMark parameter to '*', you set the page size(rows parameter), and on each result you'll get a cursorMark for the next page, so you execute the same query only with the cursorMark given from the last result. This way you'll have more flexibility on how much of the results you want back, in a much more performant way.
The way I dealt with the problem is by running the query twice:
// Start with your (usually small) default page size
solrQuery.setRows(50);
QueryResponse response = solrResponse(query);
if (response.getResults().getNumFound() > 50) {
solrQuery.setRows(response.getResults().getNumFound());
response = solrResponse(query);
}
It makes a call twice to Solr, but gets you all matching records....with the small performance penalty.
query.setRows(Integer.MAX_VALUE);
works for me!!
Is it possible to conduct multiple spatial queries within the same SOLR (3.1+) request?
We currently have a need to allow user to search for inventory with a location of their choice via a frontend search form. But we want to also add another spatial search behind the scenes so it will include more inventory. The resulting search would result in a venn diagram type of search.
Edit 10.4.2011
Example construct: q=*:*&fq={!geofilt}&sfield=Location&(ClientId:"client1"&pt=40.68063802521456,-74.00390625&d=80.4672)%20OR%20_query_:(ClientId:"client2"&pt=36.1146460,-115.1728160&d=80.4672)
The above construct does not work, but hopefully demonstrates what I am trying to accomplish.
This is old, but it doesn't seem like it ever got a full answer. I had the same issue and found that this syntax works:
q =*:*& fq = (({
!geofilt sfield = Location pt = 40.68063802521456,
-74.00390625 d = 80.4672
}
AND ClientId : "client1")OR({
!geofilt sfield = Location pt = 36.1146460,
-115.1728160 d = 80.4672
}
AND ClientId : "client2"))
It looks like, you like to run N querys in one request in order to get one result set per query?!
So Field Collapsing ( http://wiki.apache.org/solr/FieldCollapsing ) is what you are looking for. Unfortunately FieldCollapsing is only available from 3.3.
Depending on your needs, maybe counted results from different faceted searches could be also useful?!
What if you moved your second location query into an additional filter query, like below:
q=*:*&fq={!geofilt}&sfield=Location&(ClientId:"client1"&pt=40.68063802521456,-74.00390625&d=80.4672)&fq={!geofilt}&sfield=Location&(ClientId:"client2"&pt=36.1146460,-115.1728160&d=80.4672)
Will that provide the results that you are looking for? It might end up being too limiting, but thought it was worth trying.
You might also try:
q=*:*&fq={!geofilt}&sfield=Location&((ClientId:"client1"&pt=40.68063802521456,-74.00390625&d=80.4672)%20OR%20(ClientId:"client2"&pt=36.1146460,-115.1728160&d=80.4672))