Will searchable field give me better performance in azure cognitive search? - azure-cognitive-search

We use azure search and there are some collection (size upto 40 or 50) fields, for example:
CacheId:["1","2","1a"].
Then we may have query like: for items belong to CacheId 1 or 2, retrieve facet for field "Category".
The index has around 500k documents and sometimes we do see slowdown or throttle when it is busy.
I am wondering if we can change this CacheId field from Collection to a space separated string (e.g. "1 2 1a"), and then use the standard analyser for the field.
After that, I can run query such as:
search=CacheId:2b 1&searchMode=any
This will give all the documents that has cacheId 2b or 1 and then I add facet in query.
However, I couldn't find any documentation to see if this way will be any quicker comparing to current Collection field.
Does anyone have more knowledge on this? Will it make things better, worse or no difference at all?

Azure Search has some documentation on how to analyze, monitor, and improve query performance. You could use those resources to try and optimize your current queries first.
If no optimizations can be made, your best bet will be to test the performance of both setups using your production queries. I'm doubtful that moving from a collection to a string will improve performance, especially if following the best practices mentioned in the linked docs, but you can gather data through testing to be sure.

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Optimize SOLR for retrieving all search results

Sometimes I don't need just the top X results from a SOLR query, but all results (running into millions). This is easily achievable by searching once with 0 rows as a request parameter, and then re-execute the search with the numFound from the result as number of rows(*)
Of course we can sort the results by e.g. "id asc" to remove relevancy ranking, however, I would like to be able to disable the entire scoring calculation for these queries, as they probably are quite computational intensive and we just don't need them in these cases.
My question:
Is there a way to make SOLR work in boolean mode and effectively run faster on these often slow queries, when all we need is just all results?
(*) I actually usually simply do a paged query where a script walks through the pages (multi threaded), to prevent timeouts on large result sets, yet keep it fast as possible, but this is not important for the question.
This looks like a related question, but apparently the user asked the wrong question and was only after retrieving all results: Solr remove ranking or modify ranking feature; This question is not answered there.
Use filters instead of queries; there is no score calculation for filters.
There is a couple of things to be aware of
Solr deep paging allows you to export large number of results much quicker
Using an export format such as CSV could be faster than using an XML format just due to the formatting and it being more compact
And, as already mentioned, if you are exporting all, put your queries into FilterQuery with caching off
For very complex queries, if you can split it into several steps, you can actually assign different weights to the filters and have them execute in sequence. This allows to use cheap first filter that gets rid of most of the results and only then apply more expensive, more precise, filters

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

Lucene - few or a lot of indexes

Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.
In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.

Multi Criteria Search Algorithm

Here's the problem : I've got a huuge (well at my level) mysql database with technical products in it. I ve got something like 150k rows of products in my database plus 10 to 20 others tables with the same amount of rows. Each tables contains a lot of criteria. Some of the criteria are text values, some are decimal, some are just boolean. I would like to provide a web access (php) to this database with filters on each criteria but I dont know how to do that really fast. I started to create a big table with all colums merged to avoid multiple join, it's cool, faster than the big join but still very very slow. Putting an index on all criteria, doesnt improve things (and i heard it was a bad idea). I was wondering if there were some cool algorithms that could help me preprocess the multi criteria search. Any idea ?
Thanks ahead.
If you're frustrated trying to do this in SQL, you could take a look at Lucene. It lets you do range searches, full text, etc.
Try Full Text Search
You might want to try globbing your text fields together and doing full text search.
Optimize Your Queries
For the other columns, rank them in order of how frequently you expect them to be used.
Write a test suite of queries, and run them all to get a sense of the performance. Then start adding indexes, and see how it affects performance. Keep adding indexes while the performance gets better. Stop when it gets worse.
Use Explain Plan
Since you didn't provide your SQL or table layout, I can't be more specific. But use the Explain Plan command to make sure your queries are hitting indexes, rather than doing table scans. This can be tricky since subtle stuff like the order of the columns in the query can affect whether or not an index is operative.

Resources