Lucene - few or a lot of indexes - solr

Is it better to use
a lot of indexes (eg. for every user as your application allows that)
in Lucene
or just one, having every document in int
... if you think about:
performance
disk space
health
I am using elasticsearch, therefore I am using Lucene.

In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.
Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.
Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.
In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.
Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Will searchable field give me better performance in azure cognitive search?

We use azure search and there are some collection (size upto 40 or 50) fields, for example:
CacheId:["1","2","1a"].
Then we may have query like: for items belong to CacheId 1 or 2, retrieve facet for field "Category".
The index has around 500k documents and sometimes we do see slowdown or throttle when it is busy.
I am wondering if we can change this CacheId field from Collection to a space separated string (e.g. "1 2 1a"), and then use the standard analyser for the field.
After that, I can run query such as:
search=CacheId:2b 1&searchMode=any
This will give all the documents that has cacheId 2b or 1 and then I add facet in query.
However, I couldn't find any documentation to see if this way will be any quicker comparing to current Collection field.
Does anyone have more knowledge on this? Will it make things better, worse or no difference at all?
Azure Search has some documentation on how to analyze, monitor, and improve query performance. You could use those resources to try and optimize your current queries first.
If no optimizations can be made, your best bet will be to test the performance of both setups using your production queries. I'm doubtful that moving from a collection to a string will improve performance, especially if following the best practices mentioned in the linked docs, but you can gather data through testing to be sure.

Partitioning a current solr index into shards

I've been analyzing the best method to improve the performance of our SOLR index and will likely shard the current index to allow searches to become distributed.
However given that our index is over 400GB and contains about 700MM documents, reindexing the data seems burdensome. I've been toying with the idea of duplicating the indexes and deleting documents as a means to more efficiently create the sharded environment.
Unfortunally it seems that modulus isn't available to query against the document's internal numeric ID. What other possible partitioning strategies could I use to delete by query rather than a full reindex?
A lucene tool would do the job IndexSplitter, see mentioned here with a link to an article (japanese, tranlate it with google...)
If you can find a logical key to partition the data, then it will be helpful in more than one way. For eg. can you have these documents split across shards based on some chronological order?
We have a similar situation. We have an index of 250M docs that are split across various shards based on their created date. A major use case involves searching across these shards based on a range of created date. So, the search is only submitted to the shards that contain the docs with the given date range. There may be other benefits to logically partitioned data - for eg. different capacity planning, applying different qualities of service to search terms etc..
I answered this in another StackOverflow question. There's a command-line utility I wrote (Hash-Based Index Splitter) to split a Lucene index based on each document's ID hash.

Indexes for google app engine data models

We have many years of weather data that we need to build a reporting app on. Weather data has many fields of different types e.g. city, state, country, zipcode, latitude, longitude, temperature (hi/lo), temperature (avg), preciptation, wind speed, date etc. etc.
Our reports require that we choose combinations of these fields then sort, search and filter on them e.g.
WeatherData.all().filter('avg_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
or
WeatherData.all().filter('lo_temp =',20).filter('city','palo alto').filter('hi_temp',30).order('date').fetch(100)
May be easy to see that these queries require different indexes. May also be obvious that the 200 index limit can be crossed very very easily with any such data model where a combination of fields will be used to filter, sort and search entities. Finally, the number of entities in such a data model can obviously run into millions considering that there are many cities and we could do hourly data instead of daily.
Can anyone recommend a way to model this data which allows for all the queries to still be run, at the same time staying well under the 200 index limit? The write-cost in this model is not as big a deal but we need super fast reads.
Your best option is to rely on the built-in support for merge join queries, which can satisfy these queries without an index per combination. All you need to do is define one index per field you want to filter on and sort order (if that's always date, then you're down to one index per field). See this part of the docs for details.
I know it seems counter-intuitive but you can use a full-text search system that supports categories (properties/whatever) to do something like this as long as you are primarily using equality filters. There are ways to get inequality filters to work but they are often limited. The faceting features can be useful too.
The upcoming Google Search API
IndexTank is the service I currently use
EDIT:
Yup, this is totally a hackish solution. The documents I am using it for are already in my search index and I am almost always also filtering on search terms.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.
You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.
If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.
I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.
If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)
Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Resources