preventing certain docs from being indexed in clucene - clucene

I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.
Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.
Cheers!

You can't really access that type of data in a Document
What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.
This introduces some more work, but the best way to achieve that IMO.

Related

Reference in B2C_1A_TrustFrameworkExtensions missing in Identity Experience Framework examples

I'm getting an error when uploading my customized policy, which is based on Microsoft's SocialAccounts example ([tenant] is a placeholder I added):
Policy "B2C_1A_TrustFrameworkExtensions" of tenant "[tenant].onmicrosoft.com" makes a reference to ClaimType with id "client_id" but neither the policy nor any of its base policies contain such an element
I've done some customization to the file, including adding local account signon, but comparing copies of TrustFrameworkExtensions.xml in the examples, I can't see where this element is defined. It is not defined in TrustFrameworkBase.xml, which is where I would expect it.
I figured it out, although it doesn't make sense to me. Hopefully this helps someone else running into the same issue.
The TrustFrameworkBase.xml is not the same in each scenario. When Microsoft documentation said not to modify it, I assumed that meant the "base" was always the same. The implication of this design is: If you try to mix and match between scenarios then you also need to find the supporting pieces in the TrustFrameworkBase.xml and move them into your extensions document. It also means if Microsoft does provide an update to their reference policies and you want to update, you need to remember which one you implemented originally and potentially which other ones you had to pull from or do line-by-line comparison. Not end of the world, but also not how I'd design an inheritance structure.
This also explains why I had to work through previous validation errors, including missing <DisplayName> and <Protocol> elements in the <TechnicalProfile> element.
Yes - I agree that is a problem.
My suggestion is always to use the "SocialAndLocalAccountsWithMfa" scenario as the sample.
That way you will always have the correct attributes and you know which one to use if there is an update.
It's easy enough to comment out the MFA stuff in the user journeys if you don't want it.
There is one exception. If you want to use "username" instead of "email", the reads/writes etc. are only in the username sample.

Why is it not suggested to implement typeahead using Wildcard search?

Normally a majority of tutorials either suggest implementing autosuggest, either using Suggester component or primitive typehead techniques:
https://blog.griddynamics.com/implementing-autocomplete-with-solr/
However my question is why no one suggests using simple wildcard search for this like for giving name suggestions when user types mob:
q=name:(*mob*)
Is it feasible to use this approach for implementing autosuggest against other approaches?What will be the repercussions?
The strategy can work - for simple queries. The problem is that when you're querying with wildcards, the analysis chain is not invoked (a bit of a simplification - most filters are not invoked, only those that are MultiTermAware) - so as soon as you type a space, you're out of luck. You can work around this with the ComplexPhraseQuery, but that might not be what you're looking for (and can get expensive in regards to the number of terms quickly).
In your example with a leading wildcard, the query will also be very expensive - since it will require Lucene (Solr's underlying search library) to in effect look at each generated token and see if somewhere inside that token there's the text mob. And since you don't have any analysis taking place - if you'd have indexed men's (which would be processed to match just men as a single token in most cases), and searched for men's* - you wouldn't get a hit.
So it works - kind of - but it's not ideal. That's the reason why the suggester was implemented. The suggester component supports many different configuration options to get the behavior you want, as well as (for some backends) context filtering (which would be easier to implement with just a wildcard, since it'd be a regular fq). The suggester also supports weights - while wildcards wouldn't really do that in a proper way.

Custom Searcher - Blending of hits from different sources

We have a need for "Blending of hits from different sources", as per your documentation it is recommended to write a custom-searcher in JAVA. Is there a demo of this written somewhere on Github ? I wouldn't even know where to start :( I understand I can create search "chains" , preferably Asynchronous, and then blend results in JAVA before returning them...but then how would I handle paginations, limits...etc ? This all seems very complicated, for someone who doesn't even know JAVA that much. So, I am hoping someone has already written a demo for this ? Please ? Anyone ?
Thank you so much
EDIT to make my quesion clearer:
We are writing a search engine that fetches data from various websites. Some websites have 10mil indexable items, other websites only 100,000. When we present the results to end user, we want to include results from all our sources ( when match applies ). Let's say 10 results from each of the websites we crawl, so that they all get equal amount of attention on page. If we don't do custom blending, what happens is that the largest website with most items wins all our traffic.
I understand that we can send 10 separate queries to VESPA, and blend the results in our front end, but that seems very inefficient. Thus, the quesion of "Custome Searcher". Thank you so much !
That documentation covers some very advanced use cases which you do not have. Are your sources different Vespa schemas or content clusters? If so Vespa will by default blend the hits returned from each according to their relevance scores so there's nothing you need to do.
The two other most common use-cases are:
Some (or all) the data sources are external, so you need to write a Searcher component to fetch the external data and turn it into a Result.
You want the data to be blended in some custom way (rather than by relevance score). If so you need to exclude the default blending Searcher (com.yahoo.prelude.searcher.BlendingSearcher) and write your own.
If you provide some more information about your use cases I can give you some code examples.
EDIT: Use grouping to solve the need explained under "EDIT" in the question:
Create a "siteid" field when feeding (e.g in document processing).
Use the grouping expression all(group(siteid) each(max(10) output(summary())))
See http://docs.vespa.ai/documentation/grouping.html

How to make another search call inside SOLR

I would like to implement some kind of fallback querying mechanism inside SOLR. That is if a first search call doesn't generate enough results, I would like to make another call with different ranking and then combine the results and return it. I guess this can be done in the SOLR client side but I hope to do this inside the SOLR. By reading documentation, I guess I need to implement a search component and then add it next to "query" component? Any reference or experience in this regard would be highly appreciated.
SearchHandler calls all the registered search components in order you define, and there are several stages (prepare,process etc.).
You know the number of results only after the distributed processing phase (I suppose you work with distributed mode),so your custom search component should check the number of results in response object and run its own query if necessary.
Actually you may inherit (or wrap) a regular QueryComponent for that, augmenting its process/distributed process phases.

Best Practice of Field Collapsing in SOLR 1.4

I need a way to collapse duplicate (defined in terms of a string field with an id) results in solr. I know that such a feature is comming in the next version (1.5), but I can't wait for that. What would be the best way to remove duplicates using the current stable version 1.4?
Given that finding duplicates in my case is really easy (comparison of a string field), should it be a Filter, should I overwrite the existing SearchComponent or write a new Component, or use some external libraries like carrot2?
The overall result count should reflect the shortened result.
Well, there is a solution: just apply the collapse field patch (see http://issues.apache.org/jira/browse/SOLR-236 for the latest news about this feature, i also recommend you http://blog.jteam.nl/author/martijn).
Doing this you will get working the CollapseComponent . Notice that there is a searching performance degradation associated with this feature.

Resources