I am trying to implement a sane way to search using Solr, but I am getting stuck at a particular place, I am indexing a bunch of company names. Lets say one of them is Lowe's. Now when someone types lowes, I want a result to show up, but I am unable to get this functionality working. Does anyone know how to get this working?
The problem is, if you manage to configure your analyzers to do it one way (i.e., searching lowes and matching Lowe's), you'll most probably break the other way (i.e., searching lowe's and getting Lowe's).
One quick workaround that doesn't need black magic with your schema is fuzzy searching. Try searching for lowes~.
One possible solution might be to add them to the synonym text files. Also, the WordDelimiterFilterFactory mentions a way to treat trailing 's by removing them. But that is probably not what you want.
Related
I am facing a problem with singular and plural keyword search.
For example, if I search men, it should return "men" and also "man". However, it is not working.
The easiest way is to use a SynonymFilter with those terms that you're aware of - the hard part is thinking of every alternative.
While you usually use stemming to get the common stem for words, this problem is known as lemmatization - where you're interested in the different forms of a word, and not the common stem.
For Solr your best bet is probably to be to go for something like Solr Lemmatizer by Nicholas Ding.
Official Solr documentation points to the separation of Simplified and Traditional Chinese by using different Tokenizers. I wonder if people use ICU Transform Filter to Traditional <-> Simplified conversion and then be able to have one unique field for both Chineses.
At the same time, it seems that this conversion is a really hard task and it doesn't seem to be solved.
The simple question is what is the recommended way of indexing traditional and simplified Chinese in Solr? It would be really convenient to have a unique field for both, but I couldn't find a good success case for that.
The truth is, it is possible. This video shows how you could create an field with as many languages as possible. But looks tricky.
I tried using Sitecore.Search namespace and it seems to do basic stuff. I am now evaluating AdvancedDatabaseCrawler module by Alex Shyba. What are some of the advantages of using this module instead of writing my own crawler and search functions?
Thanks
Advantages:
You don't have to write anything.
It handles a lot of the code you need to write to even query Sitecore, e.g. basic search, basic search with field-level sorting, field-level searches, relation searches (GUID matches for lookup fields), multi-field searches, numeric range and date range searches, etc.
It handles combined searches, with logical operators
You can access the code.
This video shows samples of the code and front-end running various search types.
Disadvantages:
None that I can think of, because if you find an issue or a way to extend it, you have full access to the code and can amend it per your needs. I've done this before by creating the GetHashCode() and Equals() methods for the SkinnyItem class.
First of all, the "old" way of acecssing the Lucene index was very simple, but unfortunately it's deprecated from Sitecore 6.5.
The "new" way of accessing the Lucene index is very complex as the possibilities are endless. Alex Shyba's implementation is the missing part that makes it sensible to use the "new" way.
Take a look at this blog post: http://briancaos.wordpress.com/2011/10/12/using-the-sitecore-open-source-advanceddatabasecrawler-lucene-indexer/
It's a 3 part description on how to configure the AdvancedDatabaseCrawler, how to make a simple search and how to make a multi field search. Without Alex's AdvancedDatabaseCrawler, these tasks would take almost 100 lines of code. With the AdvancedDatabaseCrawler, it takes only 7 lines of code.
So if you are in need of an index solution, this is the solution to use.
As far as I know almost all do spell checking based on single query term and are unable to do changes on whole input query to increase coverage in corpra. I have one in lingpipe but it is very expensive... http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html
So my question what is the best Apache alternative to lingpipe like spell checker?
The spellcheckers in lucene treat whitespace like any other character. So in general you can feed them your query logs or whatever, and spellcheck/autocomplete full queries.
For lucene this should just work, for solr you need to ensure the QueryConverter doesn't split up your terms... see https://issues.apache.org/jira/browse/SOLR-3143
On the other hand, these suggesters currently work on the whole input, so if you want to suggest queries that have never been searched before, instead you want something that maybe only takes the last N words of context similar to http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html.
I'm hoping we will provide that style of suggester soon also as an alternative, possibly under https://issues.apache.org/jira/browse/LUCENE-3842.
But keep in mind, thats not suitable for all purposes, so I think its going to likely just be an option. For example, if you are doing e-commerce there is no sense is suggesting products you don't sell :)
We have millions of simple txt documents containing various data structures we extracted from pdf, the text is printed line by line so all formatting is lost (because when we tried tools to maintain the format they just messed it up). We need to extract the fields and there values from this text document but there is some variation in structure of these files (new line here and there, noise on some sheets so spellings are incorrect).
I was thinking we would create some sort of template structure with information about the coordinates (line, word/words number) of keywords and values and use this information to locate and collect keyword values like that using various algorithms to make up for inconsistant formatting.
Is there any standard way of doing this, any links that might help? any other ideas?
the noise can be corrected or ignored by using fuzzy text matching tools like agrep: http://www.tgries.de/agrep/
However, the problem with extra new-lines will remain.
One technique that i would suggest is to limit the error propagation in a similar way compilers do. For example, you try to match your template or a pattern, and you can't do that. Later on in the text there is a sure match, but it might be a part of the current un-matched pattern.
In this case, the sure match should be accepted and the chunk of text that was un-matched should be left aside for future processing. This will enable you to skip errors that are too hard to parse.
Larry Wall's Perl is your friend here. This is precisely the sort of problem domain at which it excels.
Sed is OK, but for this sort of think, Perl is the bee's knees.
While I second the recommendations for the Unix command-line and for Perl, a higher-level tool that may help is Google Refine. It is meant to handle messy real-world data.
I would recoomnd using graph regular expression here with very weak rules and final accpetion predicate. Here you can write fuzzy matching on token level, then on line level etc.
I suggest Talend data integration tool. It is open source (i.e. FREE!). It is build on Java and you can customize your data integration project anyway you like by modifying underlying java code.
I used it and found very helpful on low budget highly complex data integration projects. Here's the link to their WEB site;Talend
Good luck.