lucene ngram tokenizer usage for fuzzy phrase match - solr

I am trying to achieve fuzzy phrase search (to match misspelled words) by using lucene, by referring various blogs I thought to try ngram indexes on fuzzy phrase search.
But I couldn't find ngram tokenizer as part of my lucene3.4 JAR library, is it deprecated and replaced with something else ? - currently I am using standardAnalyzer where I am getting decent results for exact match of terms.
I have below two requirements to handle.
My index is having document with phrase "xyz abc pqr", when I provide query "abc xyz"~5, I am able to get results, but my requirement is to get results for same document even though I have one extra word like "abc xyz pqr tst" in my query (I understand match score will be little less) - using proximity extra word in phrase is not working, if I remove proximity and double quotes " " from my query, I am getting expected results (but there I get many false positives like documents containing only xyz, only abc etc.)
In same above example, if somebody misspell query "abc xxz", I still want to get results for same document.
I want to give a try with ngram but not sure it will work as expected.
Any thoughts ?

Try to use BooleanQuery and FuzzyQuery like:
public void fuzzysearch(String querystr) throws Exception{
querystr=querystr.toLowerCase();
System.out.println("\n\n-------- Start fuzzysearch -------- ");
// 3. search
int hitsPerPage = 10;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
BooleanQuery bq = new BooleanQuery();
String[] searchWords = querystr.split(" ") ;
int id=0;
for(String word: searchWords ){
Query query = new FuzzyQuery(new Term(NAME,word));
if(id==0){
bq.add(query, BooleanClause.Occur.MUST);
}else{
bq.add(query, BooleanClause.Occur.SHOULD);
}
id++;
}
System.out.println("query ==> " + bq.toString());
searcher.search(bq, collector );
parseResults( searcher, collector ) ;
searcher.close();
}
public void parseResults(IndexSearcher searcher, TopScoreDocCollector collector ) throws Exception {
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get(NAME));
}
}

Related

AzureSearch filter not working expected

I migrated my SQL data into AzureSearch document to try new search experience. I'm not able to filter data using .net sdk (3.0.4)
public IActionResult Search(string state, string category, string search, short pageNumber = 1, short pageSize = 10)
{
SearchIndexClient indexClient = new SearchIndexClient(searchServiceName, "search", new SearchCredentials(searchServiceApiKey));
DocumentSearchResult<SearchResultDto> results = null;
if (string.IsNullOrWhiteSpace(search))
search = "*";
if (state.Equals("All", StringComparison.InvariantCultureIgnoreCase))
state = string.Empty;
SearchParameters parameters = new SearchParameters()
{
Filter = "state eq " + state,
Top = pageSize,
Skip = (pageNumber - 1) * pageSize,
SearchMode = SearchMode.All,
IncludeTotalResultCount = true
};
try
{
results = indexClient.Documents.Search<SearchResultDto>(search, parameters);
return Ok(results.Results);
}
catch (Exception ex)
{
Console.WriteLine("Error querying index: {0}\r\n", ex.Message.ToString());
throw ex;
}
}
I'm getting error "Exception has been thrown by the target of an invocation."
parameters raw value : $count=true&$filter=state%20eq%20&queryType=simple&searchMode=all&$skip=0&$top=10
when I used parameters value in AzureSearch explore I'm getting error
Invalid expression: Expression expected at position 19 in 'state eq delhi eq '.\r\nParameter name: $filter
What is wrong with my code??
There are a few problems with your filter.
String literals in OData are delimited by single quotes. If you leave out the quotes, the string looks like a field name, but comparing fields to other fields is not allowed in Azure Search (also there is likely no field named delhi in your index). Try state eq 'delhi'.
The filter you tried with Search Explorer has an extra eq operator on the end: “state eq delhi eq “. If you remove the extra eq and put single quotes around delhi it should work.
Once you fix the syntax errors, the filter still might not work as intended. Filters are case-sensitive, so if the value you’re trying to match is actually ‘Delhi’ with a capital D, you won’t get a match. If the state field is matched to raw user input that might have the wrong case, it might be better to use the searchText parameter instead of Filter.

AppEngine full text search cursors broken in dev/unit test environment

I've noticed an inconsistency in the behavior of App Engine text search cursors in the devserver or unit test environment vs. production environments. The dev and unit test environments appear to exhibit a bug for cursors used in combination with sort expressions. Consider the following unit test code:
#Test
public void testQueryCursor( ) throws Exception
{
testQueryCursor("id_%02d"); // works
testQueryCursor("id_%d"); // fails
}
private void testQueryCursor( final String idFmt ) throws Exception
{
final int TEST_COUNT = 12;
final Index index =
SearchServiceFactory.getSearchService().getIndex(IndexSpec.newBuilder().setName("MY_TEST_IDX").build());
final List<String> docIds = new ArrayList<String>(TEST_COUNT);
try {
// populate some test data into an index
for (int i = 0; i < TEST_COUNT; i++) {
final String docId = String.format(idFmt, i);
final Document.Builder builder = Document.newBuilder().setId(docId);
builder.addField(Field.newBuilder().setName("some_field").setText("str1 " + docId)); // include varied docId in field for sorting
index.put(builder.build());
docIds.add(docId);
}
// for comparison to sorted search results
Collections.sort(docIds);
// define query options
final QueryOptions.Builder optionsBuilder =
QueryOptions
.newBuilder()
.setReturningIdsOnly(true)
.setLimit(10)
.setSortOptions(
SortOptions
.newBuilder()
.setLimit(20)
.addSortExpression(
SortExpression.newBuilder().setExpression("some_field")
.setDirection(SortDirection.ASCENDING).setDefaultValue("")));
// see https://developers.google.com/appengine/docs/java/search/results#Java_Using_cursors
// create an initial per-query cursor
Cursor cursor = Cursor.newBuilder().build();
final Iterator<String> idIter = docIds.iterator();
int batchIdx = 0;
do {
// build options and query
final QueryOptions options = optionsBuilder.setCursor(cursor).build();
final Query query = Query.newBuilder().setOptions(options).build("some_field : str1");
// search at least once
final Results<ScoredDocument> results = index.search(query);
int batchCount = 0;
for (final ScoredDocument match : results) {
batchCount++;
assertTrue(idIter.hasNext());
assertEquals(idIter.next(), match.getId());
System.out.println("Document " + match.getId() + " matched.");
}
System.out.println("Read " + batchCount + " results from batch " + ++batchIdx);
cursor = results.getCursor();
} while (cursor != null);
} finally {
index.delete(docIds);
}
}
If the assertEquals(idIter.next(), match.getId()); line is commented out the full output of the previously failing call the testQueryCursor("id_%d") can be observed and we see that the proper ordering of results appears to be ignored. What's more, the last search performed on the cursor repeats the last two elements retrieved from the previous search call. Since these two elements SHOULD BE the last two returned from the search perhaps this behavior is simply an artifact of the flaw which causes the improper sort.
This code can be easily run as a unit test as shown here or run from a JSP on the devserver and the behavior is consistent. When run as a JSP on a production instance of App Engine the behavior differs in that the search returns the correctly ordered results in all cases. It would be nice if the devserver environment and unit test tools were fixed to provide correct behavior consistent with production.

How to get each StateName TotalPopulation #2009

Stateid StateName Year Population
1 andhra 2008 25000
2 andhra 2009 10000
3 ap 2008 15000
2 ap 2009 20000
How to get each StateName TotalPopulation #2009
without using the Linq, here a solution
Dictionary<string,int> data=new Dictionary<string,int>(); // to store the words and count
string inputString = "I love red color. He loves red color. She love red kit.";
var details=inputString.Split(' '); // split the string you have on space, u can exclude the non alphabet characters
foreach(var detail in details)
{
// based on Ron comment you should trim the empty detail in case you have multi space in the string
if(!string.IsNullOfEmpty(detail) && data.ContainsKey(detail))
data[detail].Value++;
else
data.Add(detail,1);
}
What I did was, i broke the string into an array using the split function. Then looped through each of the element and checked whether that element has been parsed or not. If yes, the add the count by 1 else add the element to the dictionary.
class Program
{
static void Main(string[] args)
{
string inputString = "I love red color. He loves red color. She love red kit.";
Dictionary<string, int> dict = new Dictionary<string, int>();
var arr = inputString.Split(' ','.',',');
foreach (string s in arr)
{
if (dict.ContainsKey(s))
dict[s] += 1;
else
dict.Add(s, 1);
}
foreach (var item in dict)
{
Console.WriteLine(item.Key + "- " + item.Value);
}
Console.ReadKey();
}
}
Try this way
string inputString = "I love red color. He loves red color. She love red kit."; Dictionary<string, int> wordcount = new Dictionary<string, int>();
var words = inputString.Split(' ');
foreach (var word in words)
{
if (!wordcount.ContainsKey(word))
wordcount.Add(word, words.Count(p => p == word));
}
wordcount will have the output you are looking for. Note that it will have all entries for all words, so if you want for only a subset, then alter it to lookup against a master list.
Check the link given bellow.
Count Word
This example shows how to use a LINQ query to count the occurrences of a specified word in a string. Note that to perform the count, first the Split method is called to create an array of words. There is a performance cost to the Split method. If the only operation on the string is to count the words, you should consider using the Matches or IndexOf methods instead. However, if performance is not a critical issue, or you have already split the sentence in order to perform other types of queries over it, then it makes sense to use LINQ to count the words or phrases as well.
class CountWords
{
static void Main()
{
string text = #"Historically, the world of data and the world of objects" +
#" have not been well integrated. Programmers work in C# or Visual Basic" +
#" and also in SQL or XQuery. On the one side are concepts such as classes," +
#" objects, fields, inheritance, and .NET Framework APIs. On the other side" +
#" are tables, columns, rows, nodes, and separate languages for dealing with" +
#" them. Data types often require translation between the two worlds; there are" +
#" different standard functions. Because the object world has no notion of query, a" +
#" query can only be represented as a string without compile-time type checking or" +
#" IntelliSense support in the IDE. Transferring data from SQL tables or XML trees to" +
#" objects in memory is often tedious and error-prone.";
string searchTerm = "data";
//Convert the string into an array of words
string[] source = text.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' }, StringSplitOptions.RemoveEmptyEntries);
// Create and execute the query. It executes immediately
// because a singleton value is produced.
// Use ToLowerInvariant to match "data" and "Data"
var matchQuery = from word in source
where word.ToLowerInvariant() == searchTerm.ToLowerInvariant()
select word;
// Count the matches.
int wordCount = matchQuery.Count();
Console.WriteLine("{0} occurrences(s) of the search term \"{1}\" were found.", wordCount, searchTerm);
// Keep console window open in debug mode
Console.WriteLine("Press any key to exit");
Console.ReadKey();
}
}
/* Output:
3 occurrences(s) of the search term "data" were found.

Solr / Lucene: Get all field names sorted by number of occurrences in index

I want to get the list of all fields (i.e. field names) sorted by the number of times they occur in the Solr index, i.e.: most frequently occurring field, second most frequently occurring field and so on.
Alternatively, getting all fields in the index and the number of times they occur would also be sufficient.
How do I accomplish this either with a single solr query or through solr/lucene java API?
The set of fields is not fixed and ranges in the hundreds. Almost all fields are dynamic, except for id and perhaps a couple more.
As stated in Solr: Retrieve field names from a solr index? you can do this by using the LukeRequesthandler.
To do so you need to enable the requestHandler in your solrconfig.xml
<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
and call it
http://solr:8983/solr/admin/luke?numTerms=0
If you want to get the fields sorted by something you are required to do this on your own. I would suggest to use Solrj in case you are in a java environment.
Fetch fields using Solrj
#Test
public void lukeRequest() throws SolrServerException, IOException {
SolrServer solrServer = new HttpSolrServer("http://solr:8983/solr");
LukeRequest lukeRequest = new LukeRequest();
lukeRequest.setNumTerms(1);
LukeResponse lukeResponse = lukeRequest.process(solrServer );
List<FieldInfo> sorted = new ArrayList<FieldInfo>(lukeResponse.getFieldInfo().values());
Collections.sort(sorted, new FieldInfoComparator());
for (FieldInfo infoEntry : sorted) {
System.out.println("name: " + infoEntry.getName());
System.out.println("docs: " + infoEntry.getDocs());
}
}
The comparator used in the example
public class FieldInfoComparator implements Comparator<FieldInfo> {
#Override
public int compare(FieldInfo fieldInfo1, FieldInfo fieldInfo2) {
if (fieldInfo1.getDocs() > fieldInfo2.getDocs()) {
return -1;
}
if (fieldInfo1.getDocs() < fieldInfo2.getDocs()) {
return 1;
}
return fieldInfo1.getName().compareTo(fieldInfo2.getName());
}
}

Lucene find index in multivalued field?

I have a multivalued field of names and I have to find the index of the matching value in the list.
DOC example:
profile_id: 1
names: [ "My name", "something", "My second name", "My nickname"]
query:
profile_id:1 AND names:"My secon name"~
Expected result:
my doc, and the index of the matched, 2
Is it possible?
SpanTermQuery matches documents just like TermQuery, but it also keeps track of position of the same terms that appear within the same document.
Spans positionInfoForAllMatchingDocs = spanQuery.getSpans(..arguments..);
int position = -1;
while(positionInfoForAllMatchingDocs.next()){
position = positionInfoForAllMatchingDocs.start() // Returns the start position of the current match.
System.out.println("Found match in the document with id: " + positionInfoForAllMatchingDocs.doc() + " at position: " + position); // You obviously want to replace this sysout with something elegant.
}
Make sure that the field, for which you are planning to retrieve the positional information, was indexed with Field.TermVector.WITH_POSITIONS or Field.TermVector.WITH_POSITIONS_AND_OFFSETS.

Resources