search results by distinct field - cloudsearch / solr / lucene

search results by distinct field - cloudsearch / solr / lucene - solr

This is on Amazon Cloudsearch, but it probably holds true for any generic Lucene/Solr installation.
I am indexing a bunch of articles and comments on those articles to be searched. When I search for "Trump sucks", I want the ability to get back a list of comments that match, or a list of articles which have comments that match.
I know I can index them in 2 separate domains, but I wonder if there is an easier way to do a "distinct" on a field... in other words...
I have a list of indexed documents for each comment which also contains the article_id as a field .. so:
id=1 {'article_id':10}
id=2 {'article_id':10}
right now if both of these comments match, I will get back 2 results. (and yes I can do a distinct on the client side, but it would mess up paging and such). I want to be able to just get back [10]

There is no way to do distinct in CloudSearch so you will need to come up with another solution.
The best I can offer is to concatenate all comments into a single text field on article records and add a type field to differentiate comments and articles (if you don't already have one). You can then query on type=Article while searching over the concatenated comments and article body and will only ever receive one result per article.
Even with thousands of comments concatenated to a single field on each article I am sure CloudSearch will perform well (maybe even better than with tens of thousands of extra records to concider) however your update process to concatenate all the comments might get heavy. If you do get thousands of comments than adding a flag tracking if it has been concatenated so you don't have to re-build them every time will be helpful.

Related

In Azure Search, is there a way to scope facet counts to the top parameter?

If I have an index with 10,000,000 documents and search text and ask to retrieve the top 1,000 items, is there a way to scope the facets to those 1,000 items?
My current problem is:
We have a very large index with a few different facets, including manufacturer. If I search for a product (WD-40 for instance), that matches a lot of different document and document fields. It finds the product and it is the top scoring match, but because they only make 1 or 2 products, the manufacturer doesn't show up as a top facet option because it is sorted by count.
Is there a way to scope the facets to the top X documents? Or, is there a way to only grab documents which are above a certain #search.score?

The purpose of a refiner is to give users options to narrow down the result set. I would say the $top parameter and returned facets works as it should. Trying to limit the number of refiners to be based on the top 1000 results is a bad idea when we think about it. You'll end up with confusing usability- and recall issues.
Your query for WD-40 returns a large result set. So large that there are 155347 unique manufacturers listed. I'm guessing you have several million hits. The intent of that query is to return the products called WD-40 (my assumption). But, since you search all properties in all types of content, you end up with various products like doors, hinges, and bikes that might have some text saying that "put some WD-40 on it to stop squeaks".
I'm guessing that most of the hits you get are irrelevant. Thus, you should either limit the scope of your initial query by default. For example, limit to searching only the title property. Or add a filter to exclude a category of documents (like manuals, price lists, etc.).
You could also consider submitting different queries from your frontend application. One narrowly scoped query that retrieves the refiners and another, broader query that returns the results.

I don't have a relevant data set to test on, but I believe the $top parameter might do what you want. See this link:
https://learn.microsoft.com/en-us/rest/api/searchservice/search-documents#top-optional
That said, there are other approaches to solve your use case.
Normalize your data
I don't know how clean your data is. But, for any data set of this size, it's common that the manufacturer name is not consistent. For example, your manufacturer may be listed as
WD40 Company
WD-40 Company
WDFC
WD 40
WD-40 Inc.
...
Normalizing will greatly reduce the number of values in your refiners. It's probably not enough for your use case, but still worth doing.
Consider adding more refiners
When you have a refiner with too many options it's always a good idea to consider having more refiner with course values. For example a category or perhaps a simple refiner that splits the results in two. Like a "Physical vs. Digital" product as a first choice. Or consumer vs. professional product. In stock or on back-order. This pattern allows users to quickly reduce the result set without having to use the brand refiner.
Categorize your refiner with too many options
In your case, your manufacturer refiner contained too many options. I have seen examples where people add a search box within the refiner. Another option is to group your refiner options in buckets. For text values like a manufacturer, you could generate a new property with the first character of the manufacturer's name. That way you could present a refiner that lets users select a manufacturer from A-Z.

Long queries on very short documents

I'm fresh out of the nursery as far as Lucene/Solr are concerned, so I may be trying to utilize it completely wrong, but I hope someone can point me in the right direction.
My documents (less than 3,000) are short statements from a taxonomy. All are single sentences, with some having no more than 4-6 words long. There is only one field for each document, so searching across multiple fields is not a route I would be looking into. What I would like to do is query the contents of a work related document and have the taxonomy statements that are relevant returned.
Currently I am using the default example setup that came with Solr with added verb synonyms from Wordnet since performed actions are what I am trying to identify (i.e. taxonomy statement of 'Alter garments to specifications').
Basic word matching works as expected, but I would like to make things somewhat more sophisticated. Since the queries are so long I never end up with a high relevancy scores when searching against the tiny documents. I'm sure this can be resolved by normalizing scores in some fashion so I am not real concerned about the scores coming out, but the actual statements (documents) that are being identified.
Would I be better off indexing the documents (currently the long queries) on the fly and querying each taxonomy statement and compiling/sorting the results or can I perform these long queries on the tiny documents effectively in some other fashion? I presume this may present it's own difficulties.

I see no end to what are you trying to do here, i mean your short documents index will definitely suffer from lake of information, and a long query will make every result almost flat in front of it, even expanding the document by adding every term with Wordnet synonyms will be confusing and misleading i think, my advice is to chack other possible forms of the query.

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.

This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

solr schema for article->paragraph structure

I want to index some articles and show the paragraph number in the search result. So I guess the solr schema should looks like this:
article_id, paragraph_number, paragraph_content
Therefore, I need to parse article first, extract paragraphs and index it one by one.
I'm worried about the performance since one article can contain 100 paragraphs.
Any suggestion?

It is better to do the heavy lifting at index time rather than search time. So parsing the paragraphs out of the document when you index is probably the right way to go.
How many articles do you have? It really shouldn't be a problem to strip paragraphs (we do much more complex pre-processing that that).

If you only need to match individual paragraphs against the fulltext query (as opposed to filters etc.), you could also do this using highlighting -- split up the paragraphs, prefix each one with its paragraph number, and then index the paragraphs as multiple values in a single field in a single document. At search time, you'd do a highlight on the field with a full match (e.g. fragment size of -1) and no decoration of the highlight; so what you'd get back is the paragraph that matched the fulltext query, prefixed by its paragraph number (which you'd probably want to then pull back out).
Not sure if this fits your use case exactly but might be an interesting approach to try -- I do something similar to identify photos whose caption matches the fulltext query to display next to article search results.

Creating an efficient search capability using SQL Server (and/or coldfusion)

I am trying to visualize how to create a search for an application that we are building. I would like a suggestion on how to approach 'searching' through large sets of data.
For instance, this particular search would be on a 750k record minimum table, of product sku's, sizing, material type, create date, etc;
Is anyone aware of a 'plugin' solution for Coldfusion to do this? I envision a google like single entry search where a customer can type in the part number, or the sizing, etc, and get hits on any or all relevant results.
Currently if I run a 'LIKE' comparison query, it seems to take ages (ok a few seconds, but still), and it is too long. At times making a user sit there and wait up to 10 seconds for queries & page loads.
Or are there any SQL formulas to help accomplish this? I want to use a proven method to search the data, not just a simple SQL like or = comparison operation.
So this is a multi-approach question, should I attack this at the SQL level (as it ultimately looks to be) or is there a plug in/module for ColdFusion that I can grab that will give me speedy, advanced search capability.

You could try indexing your db records with a Verity (or Solr, if CF9) search.
I'm not sure it would be faster, and whether even trying it would be worthwhile would depend a lot on how often you update the records you need to search. If you update them rarely, you could do an Verity Index update whenever you update them. If you update the records constantly, that's going to be a drag on the webserver, and certainly mitigate any possible gains in search speed.
I've never indexed a database via Verity, but I've indexed large collections of PDFs, Word Docs, etc, and I recall the search being pretty fast. I don't know if it will help your current situation, but it might be worth further research.

If your slowdown is specifically the search of textual fields (as I surmise from your mentioning of LIKE), the best solution is building an index table (not to be confiused with DB table indexes that are also part of the answer).
Build an index table mapping the unique ID of your records from main table to a set of words (1 word per row) of the textual field. If it matters, add the field of origin as a 3rd column in the index table, and if you want "relevance" features you may want to consider word count.
Populate the index table with either a trigger (using splitting) or from your app - the latter might be better, simply call a stored proc with both the actual data to insert/update and the list of words already split up.
This will immediately drastically speed up textual search as it will no longer do "LIKE", AND will be able to use indexes on index table (no pun intended) without interfering with indexing on SKU and the like on the main table.
Also, ensure that all the relevant fields are indexed fully - not necessarily in the same compund index (SKU, sizing etc...), and any field that is searched as a range field (sizing or date) is a good candidate for a clustered index (as long as the records are inserted in approximate order of that field's increase or you don't care about insert/update speed as much).
For anything mode detailed, you will need to post your table structure, existing indexes, the queries that are slow and the query plans you have now for those slow queries.
Another item is to enure that as little of the fields are textual as possible, especially ones that are "decodable" - your comment mentioned "is it boxed" in the text fields set. If so, I assume the values are "yes"/"no" or some other very limited data set. If so, simply store a numeric code for valid values and do en/de-coding in your app, and search by the numeric code. Not a tremendous speed improvement but still an improvement.

I've done this using SQL's full text indexes. This will require very application changes and no changes to the database schema except for the addition of the full text index.
First, add the Full Text index to the table. Include in the full text index all of the columns the search should perform against. I'd also recommend having the index auto update; this shouldn't be a problem unless your SQL Server is already being highly taxed.
Second, to do the actual search, you need to convert your query to use a full text search. The first step is to convert the search string into a full text search string. I do this by splitting the search string into words (using the Split method) and then building a search string formatted as:
"Word1*" AND "Word2*" AND "Word3*"
The double-quotes are critical; they tell the full text index where the words begin and end.
Next, to actually execute the full text search, use the ContainsTable command in your query:
SELECT *
from containstable(Bugs, *, '"Word1*" AND "Word2*" AND "Word3*"')
This will return two columns:
Key - The column identified as the primary key of the full text search
Rank - A relative rank of the match (1 - 1000 with a higher ranking meaning a better match).
I've used approaches similar to this many times and I've had good luck with it.

If you want a truly plug-in solution then you should just go with Google itself. It sounds like your doing some kind of e-commerce or commercial site (given the use of the term 'SKU'), So you probably have a catalog of some kind with product pages. If you have consistent markup then you can configure a google appliance or service to do exactly what you want. It will send a bot in to index your pages and find your fields. No SQl, little coding, it will not be dependent on your database, or even coldfusion. It will also be quite fast and familiar to customers.
I was able to do this with a coldfusion site in about 6 hours, done! The only thing to watch out for is that google's index is limited to what the bot can see, so if you have a situation where you want to limit access based on a users role or permissions or group, then it may not be the solution for you (although you can configure a permission service for Google to check with)

Because SQL Server is where your data is that is where your search performance is going to be a possible issue. Make sure you have indexes on the columns you are searching on and if using a like you can't use and index if you do this SELECT * FROM TABLEX WHERE last_name LIKE '%FR%'
But it can use an index if you do it like this SELECT * FROM TABLEX WHERE last_name LIKE 'FR%'. The key here is to allow as many of the first characters to not be wild cards.
Here is a link to a site with some general tips. https://web.archive.org/web/1/http://blogs.techrepublic%2ecom%2ecom/datacenter/?p=173

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight