Cloudant full text search "too many boolean "requests" - cloudant

Can anyone tell me what the limit is for boolean requests in a Cloudant full text search?
I have 206 OR's and it appears to be too much.
This is basically my query:
( this:"that" AND ping:pong AND (yes:a1 OR yes:b1 OR yes:c1 OR ...x50 ) ) OR
( this:"that" AND ping:pong AND (yes:a2 OR yes:b2 OR yes:c2 OR ...x50 ) ) OR
( this:"that" AND ping:pong AND (yes:a3 OR yes:b3 OR yes:c3 OR ...x50 ) ) OR
( this:"that" AND ping:pong AND (yes:a4 OR yes:b4 OR yes:c4 OR ...x50 ) )

Cloudant full text search is a front-end to Apache Lucene which has a default limit of the number of boolean clauses of 1024.
https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/BooleanQuery.html#getMaxClauseCount--
Each of your OR clause contains about half a dozen sub-clauses, meaning that you fall foul of this limit.
If you really find yourself writing queries that tickle that limit, it may be worth investigating if you can restructure your documents a bit.

Related

Max LOB size (16777216) exceeded for array_agg

I have a table with about 30k rows and each of them is put in {}
in the end I would like to get it like this:
[
{Objekt1},
{Objekt2}
]
This solution worked well, as we haven't had that many rows. But now we get this limit.
COPY INTO FROM (
SELECT array_agg(*) FROM (
SELECT OBJECT_CONSTRUCT( ......
OBJECT_CONSTRUCT(.....) )
from
(select * from (select
REPLACE(parse_json(OFFER):"spec":"im:offerID",'"')::varchar AS ID,
...,
... )))) )
FILE_FORMAT = (TYPE = JSON COMPRESSION = None )
credentials =
(aws_key_id=''aws_secret_key='')
OVERWRITE = TRUE single = true
HEADER = FALSE
max_file_size=267772160
We offer this to some external agency and that style is the only way, they can read it.
Is there another solution? Or a way to go around this problem?
Thanks
As you've discovered, there is a hard limit of 16Mb on array_agg (and in a lot of other places in Snowflake e.g. it's the max size for a variant column).
If it is acceptable to create multiple files then you can probably achieve this in a Stored Proc - find some combination of column values that will guarantee that the data in each partition will result in an array_agg size < 16Mb - and then loop through those partitions running a COPY INTO for each one and outputting to a different file each time.
If you have to produce a single file then I can't think of a way of achieving this in Snowflake (though someone else may be able to). If you can process the file once it is written to S3 then it would be straightforward to copy the data to a file as JSON and then edit it to add the '[' and ']' around it

Neo4J / Cypher Query very slow with order by property

I have a graph database with 5M of nodes and 10M of relationships.
I'm on a Macbook Pro with 4GB RAM. I have already try to adjust java heap size and neo4j memory without success.
My problem is that i have a simply cypher query like that :
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
LIMIT 30;
This query takes 100ms , which is impressive. But when i add an "ORDER BY" this query takes a long time => 8s :/
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)
MATCH (friend)-[r:POSTED]->(n)
RETURN friend.id, TYPE(r),LABELS(n),n.id
ORDER BY r.date DESC
LIMIT 30;
Does Someone has an idea ?
You might want to consider relationship indexes to speed up your query. The date property could be indexed this way. You're using the ORDER BY keyword which will almost always make your query slower as it needs to iterate the entire result set to perform the ordering.
Also consider using a single MATCH statement if that suits your needs:
MATCH (pet:Pet {id:52163})-[r:FOLLOWS]->(friend)-[r:POSTED]->(n)

How to penalize a document for a particular value in solr?

I am trying to implement a jobsearch in solr.
What I want is to boost the title and keyword field.
And also to negatively boost the those documents in which location is Anywhere.
For example :
I searched for "Perl" and Location "Mumbai"
The The result must contain all resumes with Perl in their title or keyword and location "Mumbai or Anywhere".
But Resume with Anywhere field must come Last.
I made the following query:
((((perl)) AND ( (perl) ttl:(perl)^5 kw:(perl)^2) )
AND (( pref:(Mumbai) (pref:Anywhere)^0.000000001)) )
But It is not giving proper result.
Please suggest.
One way to fake a "negative boost" is to give a large boost to everything that does not match. You can do this something like this with you query (not tested , so experiment with this) :
((((perl)) AND ( (perl) ttl:(perl)^5 kw:(perl)^2) )
AND (( pref:(Mumbai) (*:* -pref:Anywhere)^999 ) )
Here is more about it: http://wiki.apache.org/solr/SolrRelevancyFAQ#How_do_I_give_a_negative_.28or_very_low.29_boost_to_documents_that_match_a_query.3F

How to get all results from solr query?

I executed some query like "Address:Jack*". It show numFound = 5214 and display 100 documents in results page(I changed default display results from 10 to 100).
How can I get all documents.
I remember myself doing &rows=2147483647
2,147,483,647 is integer's maximum value. I recall using a number bigger than that once and having a NumberFormatException because it couldn't be parsed into an int. I don't know if they use Long nowadays, but 2 billion rows is normally more than enough.
Small note:
Be careful if you are planning to do this in production. If you do a query like * : * and your index is big, you could transferring a couple of gigabytes in that query.
If you know you won't have many docs, go ahead and use integer's max value.
On the other hand, if you are doing a one-time script and just need to dump all results (for example document ID's) then this approach is valid, if you don't mind waiting 3-5 minutes for a query to return.
Don't use &rows=2147483647
Don't use Integer.MAX_VALUE(2147483647) as value of rows in production. This will heavily slow down your query even if you have a small resultset, because solr preallocates a queue in this size. see https://issues.apache.org/jira/browse/SOLR-7580
I strongly suggest to use Exporting Result Sets
It’s possible to export fully sorted result sets using a special rank query parser and response writer specifically designed to work together to handle scenarios that involve sorting and exporting millions of records.
Or I suggest to use Deep Paging.
Simple Pagination is a easy thing when you have few documents to read and all you have to do is play with start and rows parameters. But this is not a feasible way when you have many documents, I mean hundreds of thousands or even millions.
This is the kind of thing that could bring your Solr server to their knees.
For typical applications displaying search results to a human user,
this tends to not be much of an issue since most users don’t care
about drilling down past the first handful of pages of search results
— but for automated systems that want to crunch data about all of the
documents matching a query, it can be seriously prohibitive.
This means that if you have a website and are paging search results, a real user do not go so further but consider on the other hand what can happen if a spider or a scraper try to read all the website pages.
Now we are talking of Deep Paging.
I’ll suggest to read this amazing post:
https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
And take a look at this document page:
https://solr.apache.org/guide/pagination-of-results.html
And here is an example that try to explain how to paginate using the cursors.
SolrQuery solrQuery = new SolrQuery();
solrQuery.setRows(500);
solrQuery.setQuery("*:*");
solrQuery.addSort("id", ORDER.asc); // Pay attention to this line
String cursorMark = CursorMarkParams.CURSOR_MARK_START;
boolean done = false;
while (!done) {
solrQuery.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark);
QueryResponse rsp = solrClient.query(solrQuery);
String nextCursorMark = rsp.getNextCursorMark();
for (SolrDocument d : rsp.getResults()) {
...
}
if (cursorMark.equals(nextCursorMark)) {
done = true;
}
cursorMark = nextCursorMark;
}
Returning all the results is never a good option as It would be very slow in performance.
Can you mention your use case ?
Also, Solr rows parameter helps you to tune the number of the results to be returned.
However, I don't think there is a way to tune rows to return all results. It doesn't take a -1 as value.
So you would need to set a high value for all the results to be returned.
What you should do is to first create a SolrQuery shown below and set the number of documents you want to fetch in a batch.
int lastResult=0; //this is for processing the future batch
String query = "id:[ lastResult TO *]"; // just considering id for the sake of simplicity
SolrQuery solrQuery = new SolrQuery(query).setRows(500); //setRows will set the required batch, you can change this to whatever size you want.
SolrDocumentList results = solrClient.query(solrQuery).getResults(); //execute this statement
Here I am considering an example of search by id, you can replace it with any of your parameter to search upon.
The "lastResult" is the variable you can change after execution of the first 500 records(500 is the batch size) and set it to the last id got from the results.
This will help you execute the next batch starting with last result from previous batch.
Hope this helps. Shoot up a comment below if you need any clarification.
For selecting all documents in dismax/edismax via Solarium php client, the normal query syntax : does not work. To select all documents set the default query value in solarium query to empty string. This is required as the default query in Solarium is :. Also set the alternative query to :. Dismax/eDismax normal query syntax does not support :, but the alternative query syntax does.
For more details following book can be referred
http://www.packtpub.com/apache-solr-php-integration/book
As the other answers pointed out, you can configure the rows to be max integer to yield back all the results for a query.
I would recommend though to use Solr feature of pagination, and build a function that will return for you all the results using the cursorMark API. The gist of it is you set the cursorMark parameter to '*', you set the page size(rows parameter), and on each result you'll get a cursorMark for the next page, so you execute the same query only with the cursorMark given from the last result. This way you'll have more flexibility on how much of the results you want back, in a much more performant way.
The way I dealt with the problem is by running the query twice:
// Start with your (usually small) default page size
solrQuery.setRows(50);
QueryResponse response = solrResponse(query);
if (response.getResults().getNumFound() > 50) {
solrQuery.setRows(response.getResults().getNumFound());
response = solrResponse(query);
}
It makes a call twice to Solr, but gets you all matching records....with the small performance penalty.
query.setRows(Integer.MAX_VALUE);
works for me!!

Designing a near real time streaming backend

I have the following requirements for designing a streaming backend :
Documents are getting added # 20 docs/sec. Each doc has a timestamp field.
Searches are primarily based on timestamp range ( e.g. show me documents arrived in last 20 minutes )
Search QueriesPerSecond : 100 searches/sec
Documents older than 2 days could be continuously deleted for optimization purposes ( by a cron )
I am thinking of using Solr ( with SolrReplication/NRT ). The problem with Solr is basically frequent updates/deletes. For freshest data I will need to do commit on each update ( otherwise data wont be visible by searchers). Setting pollInterval~1 minute might kill the master/server both. NRT/SolrCloud could be one fo the options, but not very sure about their stability.
Any other approaches/suggestions based on SQL/NoSQL architectures ?
mysql + memcached. Facebook runs their entire site on these two widely-available, widely supported, free and open source packages.

Resources