Best Practice of Field Collapsing in SOLR 1.4 - solr

I need a way to collapse duplicate (defined in terms of a string field with an id) results in solr. I know that such a feature is comming in the next version (1.5), but I can't wait for that. What would be the best way to remove duplicates using the current stable version 1.4?
Given that finding duplicates in my case is really easy (comparison of a string field), should it be a Filter, should I overwrite the existing SearchComponent or write a new Component, or use some external libraries like carrot2?
The overall result count should reflect the shortened result.

Well, there is a solution: just apply the collapse field patch (see http://issues.apache.org/jira/browse/SOLR-236 for the latest news about this feature, i also recommend you http://blog.jteam.nl/author/martijn).
Doing this you will get working the CollapseComponent . Notice that there is a searching performance degradation associated with this feature.

Related

Polarion document baseline with links in description field

I have a generic LiveDoc in Polarion which contains a series of referenced requirements. Recently I started to insert links into the description of some of the requirements to make it easier to navigate from one requirement to another. However, I've discovered that when I baseline the document the links in the description don't get updated to point to the baselined version of the requirement, but the links (to the same requirement) in the Linked Work Items section are updated to include the baseline revision.
Is there a way to get the links in the description to point to the baselined revision like the ones in the Linked Work Items section?
I'm using Polarion 21 R1 if that matters.
Thanks in advance for your help.
Interesting approach. I doubt you get this working 100%. HTML is notoriously hard to parse(complete and correctly), so you should avoid this workflow.
Use Linked Workitems instead and use the new Collection Feature, which most probably does what you need.
Also while it is possible to link to older / specific revisions (of artefacts) in Polarion, I never found a scenario which was maintainable and useful in same time.
Note that Revisions get big very fast (5-7 digits). Comparing or updating these links is very error prone and demanding work, full of devastating pitfalls.
We follow the approach to keep items unchanged after release and create new items instead of changing existing ones. We have then more WIs but Polarion's UI (and most peoples head) can deal with large number of WIs better than with versioned links.

Why is it not suggested to implement typeahead using Wildcard search?

Normally a majority of tutorials either suggest implementing autosuggest, either using Suggester component or primitive typehead techniques:
https://blog.griddynamics.com/implementing-autocomplete-with-solr/
However my question is why no one suggests using simple wildcard search for this like for giving name suggestions when user types mob:
q=name:(*mob*)
Is it feasible to use this approach for implementing autosuggest against other approaches?What will be the repercussions?
The strategy can work - for simple queries. The problem is that when you're querying with wildcards, the analysis chain is not invoked (a bit of a simplification - most filters are not invoked, only those that are MultiTermAware) - so as soon as you type a space, you're out of luck. You can work around this with the ComplexPhraseQuery, but that might not be what you're looking for (and can get expensive in regards to the number of terms quickly).
In your example with a leading wildcard, the query will also be very expensive - since it will require Lucene (Solr's underlying search library) to in effect look at each generated token and see if somewhere inside that token there's the text mob. And since you don't have any analysis taking place - if you'd have indexed men's (which would be processed to match just men as a single token in most cases), and searched for men's* - you wouldn't get a hit.
So it works - kind of - but it's not ideal. That's the reason why the suggester was implemented. The suggester component supports many different configuration options to get the behavior you want, as well as (for some backends) context filtering (which would be easier to implement with just a wildcard, since it'd be a regular fq). The suggester also supports weights - while wildcards wouldn't really do that in a proper way.

How to temporarily disable sitecore indexing while editing items

I am developing a Sitecore project that has several data import jobs running on daily basis. Every time a job is executed, it may update a large amount of Sitecore items (thousands) and I've noticed that all these editings trigger Solr index updates.
My concern is, I don't really sure if this is better or update everything at the end of the job is. So, I would love to try both options. Could anyone tell me how can I use code to temporarily disable Lucene/Solr indexing and enable it later when I finish editing all items?
This is a common requirement, and you're right to have such concerns. In general it's considered good practice to disable indexing during big import jobs, then rebuild afterwards.
Assuming you're using Sitecore 7 or above, this is pretty much what you need:
IndexCustodian.PauseIndexing();
IndexCustodian.ResumeIndexing();
Here's a comprehensive article discussing this:
http://blog.krusen.dk/disable-indexing-temporarily-in-sitecore-7/
In addition to #Martin answer, you can pass (silent=true) when you finish the editing of the item, Something like:
item.Editing.BeginEdit();
//Change fields values
item.Editing.EndEdit(true,true);
The second parameter in EndEdit() method force a silent update of the item, which means no Events/Indexing will be triggered on item save.
I feel this is safer than pausing indexing on the whole application level during import process, you just skip indexing of the items you are updating.
EDIT:
In case you need to rebuild the index for the updated items after the import process is done, you can use the following code, It will index the content tree starting from RootItemInTree and below:
var index = Sitecore.ContentSearch.ContentSearchManager.GetIndex("Your_Index_Name")
index.Refresh(new SitecoreIndexableItem(RootItemInTree));
To disable indexing during large import/update tasks you should wrap your logic inside a BulkUpdateContext block. You can also use other wrappers like the EventDisabler to stop events from being fired if that is appropriate in your context. Alternatively you could wrap your code in an EditContext and set it to silent. So your code could end up something like this:
using (new BulkUpdateContext())
using (new EditContext(targetItem, false, true))
{
// insert update logic here...
}
here is a older question that discusses this topic: Optimisation tips when migrating data into Sitecore CMS

Should I use ids to locate elements?

Started with Angular and Protractor.
It just feels wrong to write some heavy css selectors which will break instant when you change something.
Using ID's would make testing way easier.
I'm not using any id attribute for styling yet. Are there any drawbacks using ids for testing I haven't considered?
The general rule is to use IDs whenever possible assuming they are unique across the DOM and not dynamically generated. Quoting Jim Holmes:
Whenever possible, use ID attributes. If the page is valid HTML, then
IDs are unique on the page. They're extraordinarily fast for
resolution in every browser, and the UI can change dramatically but
your script will still locate the element.
Sometimes IDs aren't the right choice. Dynamically generated IDs are
almost always the wrong choice when you're working with something like
a grid control. You rely on an id that is likely tied to the specific
row position and then you're screwed if your row changes.
Also, in general try to use the "data-oriented" approach: by.model, by.binding, by.repeater locators, or if you rely on class names, choose them wisely: do not use layout-oriented classes like .col-xs-4 or .container-fluid.
See also these related topics:
Best Practices for Watir and Selenium Locators
best way to detect an element on a web page for seleniumRC in java

preventing certain docs from being indexed in clucene

I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.
Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.
Cheers!
You can't really access that type of data in a Document
What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.
This introduces some more work, but the best way to achieve that IMO.

Resources