I am using Salesforce's parameterized search API - https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_search_parameterized.htm - to search my SF instance. However, it sometimes runs quite slow and I want to just the counts to begin with. I see there's a record count API - https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_record_count.htm - but it doesn't accept search terms.
Is there a way to combine these both ? Should I just use the query and use a SOSL query that would just return the counts ? Any pointers on what that SOSL query would look like ?
SOSL is not a good choice for counting records matching specific criteria. A SOSL result set maxes out at 2,000 records. Additionally, SOSL results have a latency of up to approximately 15 minutes while indexes are updated, and hence may not be fully up to date at any given time.
Instead, use the Query REST API resource to execute a SOQL query using the filters you're interested in on a single object at a time, using the COUNT() aggregate function in your SELECT clause.
Bear in mind that complex criteria and large data volumes, especially in combination, may cause even a COUNT() query to time out or execute slowly. The fix is situation-specific but likely to involve careful work tuning your query to use indexed fields and efficient comparisons.
Related
I am developing an advanced search engine using .Net where users can build their query based on several Fields:
Title
Content of the Document
Date From, Date To
From Modified Date, To modified Date
Owner
Location
Other Metadata
I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).
How I can Perform the Search?
when any user search for a term:
Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.
AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.
Questions:
How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).
Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?
I welcome any idea, so please share your thoughts.
Your current solution incurs the following overhead at query-time:
1) Narrowing search space via MS-SQL
Generating query in your app
Sending it over the wire to MS-SQL
Parsing/Optimizing/Execution of SQL query
[!!] I/O overhead of returning 100,000s of IDs
2) Executing bounded full-text search via Lucene.NET
[!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
Standard Lucene full text search execution
Returning matching IDs
3) Materializing result details via MS-SQL
Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)
There are two assumptions you may be making that would be worth reconsidering
A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.
Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.
Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).
B) MS-SQL queries will be the quickest way to filter on non-text metadata.
Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.
So to summarize the best practice:
Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).
Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)
Return IDs
Materialize documents from MS-SQL using returned IDs.
I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:
You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
You'll have tools in place to scale/tune/backup/restore search without impacting your application.
We have SOLR storing 3 billions of records in 23 machines and each machine have 4 shards and only 230 million documents have some field like aliasName. Currently queryCache or documentCache or Filter Cache is disable.
Problem: We are trying to get the results which have query like (q=alisaName:[* TO *] AND firstname:ash AND lastName:Coburn) is returning the match documents in 4.3 seconds. Basically we want only those matched firstname and lastname records where aliasName is not empty.
I am thinking to enable filter query fq=aliasName:[* TO *] and not sure it will make it faster as firstname and last name is mostly different in the each queries? how much memory should we allocate for filter query to perform? It should not impact the other existing queries like q=firstanme:ash AND last name:something)
Please don't worry about I/O operations as we are using flash drive.
Really appreciate the reply if you have worked on similar issue and suggest the best solution.
According to solr documentation...
filterCache
This cache stores unordered sets of document IDs that match the key (usually queries)
URL: https://wiki.apache.org/solr/SolrCaching#filterCache
So I think it comes down to two things:
What is the percentage of documents that you have with populated aliasName ? In my opinion if most documents have this field populated, then the filter cache might be useless. But, if it is only a small percentage of documents, the filter cache will have a huge performance impact, and less memory used.
What kind of Id are you using? Although I assume that the documentation refers to lucene document Ids, and not solr Ids. But maybe a smaller Solr Ids could result in a smaller cache size as well (I am not sure).
At the end you will have to perform a trial and see how it goes, maybe try on a couple of nodes first and see if there is a performance improvement.
I have an index with something like 60-100 Million documents. We almost always query these documents (in addition to other filter queries and field queries, etc) on a foreign key id, to scope the query to a specific parent object.
So, for example: /solr/q=*:*&fq=parent_id_s:42
Yes, that _s means this is currently a solr.StrField field type.
My question is: should I change this to a TrieIntField? Would that speed up performance? And if so, what would be the ideal precisionStep and positionIncrementGap values, given that I know that I will always be querying for a single specific value, and that the cardinality of that parent_id is in the 10,000-100,000 (maximum) order of magnitude?
Edit for aditional detail (from comment on an answer below):
The way our system is used, it turns out that we end up using that same fq for many queries in a row. And when the cache is populated, the system runs blazing fast. When the cache gets dumped because of a commit, this query (even a test case with ONLY this fq) can take up to 20 seconds. So I'm trying to figure out how to speed up that initial query that populates the cache.
Second Edit:
I apologize, after further testing it turns out that the above poor performance only happens when there are also facet fields being returned (e.g. stuff like &facet=true&facet.field=resolved_facet_facet). With a dozen or so of these fields, that's when the query takes up to 20-30 seconds sometimes, but only with a fresh searcher. It's instant when the cache is populated. So maybe my problem is the facet fields, not the parent_id field.
TrieIntField with a precisionStep is optimized for range queries. As you're only searching for a specific value your field type is optimal.
Have you looked at autowarming queries? These run whenever a new IndexSearcher is being created (on startup, on an index commit for example), so that it becomes available with some cache already in place. Depending on your requirements, you can also set useColdSearcher flag to true, so that the new Searcher is only available when the cache has been warmed. For more details have a look here: https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig#QuerySettingsinSolrConfig-Query-RelatedListeners
It sounds like you probably aren't getting much benefit from the caching of result sets from the filter. One of the more important features of filters is that they cache their result sets. This makes the first run of a certain filter take longer while a cache is built, but subsequent uses of the same filter are much faster.
With the cardinality you've described, you are probably just wasting cycles, and polluting the filter cache, by building caches without them ever being of use. You can turn off caching of a filter query like:
/solr/q=*:*&fq={!cache=false}parent_id_s:42
I also think filter query doesn't help in this case.
q=parent_id_s:42 is to query the index by the term "parent_id_s:42" and get a set of document ids. Since the postings (document ids) are indexed by the term, and assuming you have enough memory to hold this (either in JVM or OS cache), then this lookup should be pretty fast.
Assuming filter cache is already warmed up and you have 100% hit ratio, which one of the following is faster?
q=parent_id_s:42
fq=parent_id_s:42
I think they are very close. But I could be wrong. Anyone knows? Any know ran performance test for this?
I have a pretty basic positional inverted index, in which I store a lot of words (search terms) and I use this to implement an efficient general purpose search.
My problem is that the query plan compilation is actually taking notably longer than the execution itself, I wondered if there's something that can be done about that.
I'm using dynamic T-SQL (building up the query from strings)
I'm using a lot of CTEs
There's a bunch of filter check boxes that depend on the initial search result for population (take the search result and get me the count of some property of some entity). e.g. for each person found by the search text give me the distinct number of organizations involved and their respective frequency (count). These needs to be reevaluated a lot.
I've done parameterization (given them default sizes, not some constants though, that should be fine eh?) and qualified all tables, I rely on views where possible.
The query structurally changes every time I apply a new filter or change the number of search terms which necessitates recompilation and takes time, other than that the query plan works really well.
The thing is these CTEs and filter box results are virtually the same or near identical even if they are not structurally equivalent, I'm wondering if there's anything that can be done to improve the compilation time.
If you wanna see the T-SQL I can provide samples, it's just that it's big, it's roughly 100 lines of T-SQL per search. I thought I'd ask first, before we go down that road, maybe the solution is a lot simpler that I believe it to be?
Have you considered applying the OPTIMIZE FOR query hint?
If you can split the large query into smaller parameterised stored procedures and combine their results, they are more likely to be cached.
There is also the option of optimizing for ad hoc workloads in SQL Server 2008 (although this might be a last resort):
sp_CONFIGURE 'show advanced options',1
RECONFIGURE
GO
sp_CONFIGURE ‘optimize for ad hoc workloads’,1
RECONFIGURE
GO
Say I have a query that returns 10,000 records. When the first record has returned what can I assume about the state of my query?
Has it finished and is just returning records from the server to my instance of SSMS?
Is the query itself still being executed on the server?
What is it that causes the 10,000 records to be slowly returned for one query and nearly instantly for another?
There is potentially some mix of progressive processing on the server side, network transfer of the data, and rendering by the client.
If one query returns 10,000 rows quickly, and another one slowly -- and they are of similar row size, data types, etc., and are both destined for results to grid or results to text -- there is little we can do to analyze the differences unless you show us execution plans and/or client statistics for each one. These are options you can set in SSMS when running a query.
As an aside, switching between results to grid and results to text you might notice slightly different runtimes. This is because in one case Management Studio has to work harder to align the columns etc.
You can not make a generic assumption, a query's plan is composed of a number of different types of operations, or iterators. Some of these are Navigational based, and work like a pipeline, whilst others are set based operations, such as a sort.
If any query contains a set based operation, it requires all the records before it could output the results (i.e an order by clause within your statement.) But if you have no set based iterators you could expect the rows to be streamed to you as they become available.
The answer to each of your individual questions is "it depends."
For example, consider if you include an order by clause, and there isn't an index for the column(s) you're ordering by. In this case, the server has to find all the records that satisfy your query, then sort them, before it can return the first record. This causes a long pause before you get your first record, but you (should normally) get them quite quickly once you start getting any.
Without the order by clause, the server will normally send each record as its found, so the first record will often show up sooner, but you may see a long pause between one record and the next.
As as far simply "why is one query faster than another", a lot depends on what indexes are available, and whether they can be used for a particular query. For example, something like some_column like '%something' will almost always be quite slow. The leading '%' means this won't be able to use an index, even if some_column has one. A search for something% instead of %something% might easily be 100 or 1000 times faster. If you really need the former, you really want to use full-text searching instead (create a full-text index, and use contains() instead of like.
Of course, a lot can also depend simply on whether the database has an index for a particular column (or group of columns). With a suitable index, the query will usually be quite a lot faster.