Why does this code - adding wordnet synonyms to index - fail?

Why does this code - adding wordnet synonyms to index - fail? - solr

I am writing this code as part of my CustomAnalyzer:
public class CustomAnalyzer extends Analyzer {
SynonymMap mySynonymMap = null;
CustomAnalyzer() throws IOException {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
FileReader fr = new FileReader("/home/watsonuser/Downloads/wordnetSynonyms.txt");
BufferedReader br = new BufferedReader(fr);
String line = "";
while ((line = br.readLine()) != null) {
String[] synset = line.split(",");
for(String syn: synset)
builder.add(new CharsRef(synset[0]), new CharsRef(syn), true);
}
br.close();
fr.close();
try {
mySynonymMap = builder.build();
} catch (IOException e) {
System.out.println("Unable to build synonymMap");
e.printStackTrace();
}
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new PorterStemFilter(new SynonymFilter(
(new StopFilter(true,new LowerCaseFilter
(new StandardFilter(new StandardTokenizer
(Version.LUCENE_36,reader)
)
),StopAnalyzer.ENGLISH_STOP_WORDS_SET)), mySynonymMap, true)
);
}
}
Now, if I use the same CustomAnalyzer as part of my querying, then if I enter the query as
myFieldName: manager
it expands the query with synonyms for manager.
But, I want the synonyms to be part of only my index and I don't want my query to be expanded with synonyms.
So, when I removed the SynonymFilter from my CustomAnalyzer only when querying the index, the query remains as
myFieldName: manager
but, it fails to retrieve documents that have the synonyms of manager.
How do we solve this problem?

If you do not have your synonym builder during Query processing then the only term it will match is what you mapped to during indexing. And you are not showing that part here.
The best way to troubleshoot this is to look at Admin/Core/Analysis screen (in Solr 4+) and put your text in. It will show what happens with the text after each stage in indexing and queries is run.
You don't even need to run reindexer. You can just define a bunch of different types you are trying to figure out and then run the analysis of the sample sentences directly against those types.

Related

Reading in all docs (doc id only if possible) from Solr without searching

I know Solr is meant to be used for searching.
However, I am doing some benchmarking and I wonder if there is a way to retrieve doc id of every document indexed.
The best option is retrieving without searching (if there exist a way).
I guess the alternative is to query all documents but only asks for doc id.
I will be using SolrJ, so operations of SolrJ would be useful

Use the /export end point: Exporting result sets.
It supports using the same fl parameter as regular search (although searching for just *:* will probably behave quite similar when you're using SolrJ).
In SolrJ you'll have to use the CloudSolrStream class instead to properly stream the results (as compared to the regular behavior when searching for *:*).
From Joel Bernstein's example when introducing the feature:
import org.apache.solr.client.solrj.io.*;
import java.util.*;
public class StreamingClient {
public static void main(String args[]) throws IOException {
String zkHost = args[0];
String collection = args[1];
Map props = new HashMap();
props.put("q", "*:*");
props.put("qt", "/export");
props.put("sort", "fieldA asc");
props.put("fl", "fieldA,fieldB,fieldC");
CloudSolrStream cstream = new CloudSolrStream(zkHost,
collection,
props);
try {
cstream.open();
while(true) {
Tuple tuple = cstream.read();
if(tuple.EOF) {
break;
}
String fieldA = tuple.getString("fieldA");
String fieldB = tuple.getString("fieldB");
String fieldC = tuple.getString("fieldC");
System.out.println(fieldA + ", " + fieldB + ", " + fieldC);
}
} finally {
cstream.close();
}
}
}

solrj returns the same result for different queries

Here is my code:-
SolrClient client = new HttpSolrClient.Builder("http://arlmsendeavour01:8983/solr/ImageMatch").build();
SolrQuery query = new SolrQuery();
query.setRequestHandler("/select");
//System.currentTimeMillis();
String q = "{!cache=false}*:*&debugQuery=true&sort=lirefunc(eh,\"opKg0dKEtZOSsaSBkfPChsTEopGykqHExYTEw5GylbKx8KKXkqHRww==\")+asc";
query.setQuery("q");
QueryResponse response = null;
try {
response = client.query(query);
} catch (SolrServerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
SolrDocumentList results = response.getResults();
for (int i = 0; i < results.size(); ++i) {
System.out.println(results.get(i)/*.getFieldValue("id")*/);
}
I am using a function query lirefunc where the first parameter defines whether it is a color or edge or texture and the second parameter is the extracted feature from the image. Every time i run the code that is even for different images and different features I get the same output as if it is extracted from the solr xml. The out put remains the same for all the types of queries. Where am I going wrong?

query.setQuery("q"); - this sets the query to the string "q". I'm certain that's not what you meant to do.
The setQuery method isn't used to set a query string either - it's used to set whatever is present in the q parameter (the query) to Solr.
There are separate methods for each part of the request to Solr in SolrJ.
To set the sort= parameter, use addSort:
query.addSort(SortClause.desc("lirefunc(eh,\"opKg0dKEtZOSsaSBkfPChsTEopGykqHExYTEw5GylbKx8KKXkqHRww==\")"));

Matching entire sentence with spaces in lucene BooleanQuery

I have a search string ,
Tulip INN Riyadhh
Tulip INN Riyadhh LUXURY
Suites of Tulip INN RIYAHdhh
I need search term , if i mention
*Tulip INN Riyadhh*
it has to return all the three above, i have restriction that i have to achieve this without QueryParser or Analyser, it has to be only BooleanQuery/WildCardQuery/etc....
Regards,
Raghavan

What you need here is a PhraseQuery. Let me explain.
I don't know which analyzer you're using, but I'll suppose you have a very basic one for simplicity, that just converts text to lowercase. Don't tell me you're not using an anlayzer since it's mandatory for Lucene to do any work, at least at the indexing stage - this is what defines the tokenizer and the token filter chain.
Here's how your strings would be tokenized in this example:
tulip inn ryiadhh
tulip inn ryiadhh luxury
suites of tulip inn ryiadhh
Notice how these all contain the token sequence tulip inn ryiadhh. A sequence of tokens is what a PhraseQuery is looking for.
In Lucene.Net building such a query looks like this (untested):
var query = new PhraseQuery();
query.Add(new Term("propertyName", "tulip"));
query.Add(new Term("propertyName", "inn"));
query.Add(new Term("propertyName", "ryiadhh"));
Note that the terms need to match those produced by the analyzer (in this example, they're all lowercase). The QueryParser does this job for you by running parts of the query through the analyzer, but you'll have to do it yourself if you don't use the parser.
Now, why wouldn't WildcardQuery or RegexQuery work in this situation? These queries always match a single term, yet you need to match an ordered sequence of terms. For instance a WildcardQuery with the term Riyadhh* would find all words starting with Riyadhh.
A BooleanQuery with a collection of TermQuery MUST clauses would match any text that happens to contain these 3 terms in any order - not exactly what you want either.

Lucas has the right idea, but there is a more specialized MultiPhraseQuery that can be used to build up a query based on the data that is already in the index to get a prefix match as demonstrated in this unit test. The documentation of MultiPhraseQuery reads:
MultiPhraseQuery is a generalized version of PhraseQuery, with an added method Add(Term[]). To use this class, to search for the phrase "Microsoft app*" first use Add(Term) on the term "Microsoft", then find all terms that have "app" as prefix using IndexReader.GetTerms(Term), and use MultiPhraseQuery.Add(Term[] terms) to add them to the query.
As Lucas pointed out, a *something WildCardQuery is the way to do the suffix match, provided you understand the performance implications.
They can then be combined with a BooleanQuery to get the result you want.
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
using System;
using System.Collections.Generic;
namespace LuceneSQLLikeSearch
{
class Program
{
static void Main(string[] args)
{
// Prepare...
var dir = new RAMDirectory();
var writer = new IndexWriter(dir,
new IndexWriterConfig(LuceneVersion.LUCENE_48,
new StandardAnalyzer(LuceneVersion.LUCENE_48)));
WriteIndex(writer);
// Search...
var reader = writer.GetReader(false);
// Get all terms that end with tulip
var wildCardQuery = new WildcardQuery(new Term("field", "*tulip"));
var multiPhraseQuery = new MultiPhraseQuery();
multiPhraseQuery.Add(new Term("field", "inn"));
// Get all terms that start with riyadhh
multiPhraseQuery.Add(GetPrefixTerms(reader, "field", "riyadhh"));
var query = new BooleanQuery();
query.Add(wildCardQuery, Occur.SHOULD);
query.Add(multiPhraseQuery, Occur.SHOULD);
var result = ExecuteSearch(writer, query);
foreach (var item in result)
{
Console.WriteLine("Match: {0} - Score: {1:0.0########}",
item.Value, item.Score);
}
Console.ReadKey();
}
}
}
WriteIndex
public static void WriteIndex(IndexWriter writer)
{
Document document;
document = new Document();
document.Add(new TextField("field", "Tulip INN Riyadhh", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Tulip INN Riyadhh LUXURY", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhh", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhhll", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "myTulip INN Riyadhh LUXURY", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "some bogus data that should not match", Field.Store.YES));
writer.AddDocument(document);
writer.Commit();
}
GetPrefixTerms
Here we scan the index to find all of the terms that start with the passed-in prefix. The terms are then added to the MultiPhraseQuery.
public static Term[] GetPrefixTerms(IndexReader reader, string field, string prefix)
{
var result = new List<Term>();
TermsEnum te = MultiFields.GetFields(reader).GetTerms(field).GetIterator(null);
te.SeekCeil(new BytesRef(prefix));
do
{
string s = te.Term.Utf8ToString();
if (s.StartsWith(prefix, StringComparison.Ordinal))
{
result.Add(new Term(field, s));
}
else
{
break;
}
} while (te.Next() != null);
return result.ToArray();
}
ExecuteSearch
public static IList<SearchResult> ExecuteSearch(IndexWriter writer, Query query)
{
var result = new List<SearchResult>();
var searcherManager = new SearcherManager(writer, true, null);
// Execute the search with a fresh indexSearcher
searcherManager.MaybeRefreshBlocking();
var searcher = searcherManager.Acquire();
try
{
var topDocs = searcher.Search(query, 10);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
var doc = searcher.Doc(scoreDoc.Doc);
result.Add(new SearchResult
{
Value = doc.GetField("field")?.GetStringValue(),
// Results are automatically sorted by relevance
Score = scoreDoc.Score,
});
}
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
finally
{
searcherManager.Release(searcher);
searcher = null; // Don't use searcher after this point!
}
return result;
}
SearchResult
public class SearchResult
{
public string Value { get; set; }
public float Score { get; set; }
}
If this seems cumbersome, note that QueryParser can mimic a "SQL LIKE" query. As pointed out here, there is an option to AllowLeadingWildCard on QueryParser to build up the correct query sequence easily. It is unclear why you have a constraint that you can't use it, as it is definitely the simplest way to get the job done.

Selenium Webdriver - passing bulk data with excel sheet by header name- more than 50 fields of form

I am looking for some solution where i want to pass 100s of records to the form where i am having more than 50 fields. I did some research for the testNG data providers but it looks like that it returns only strings so i feel that it will not be feasible to go with data providers as if its not good to pass 50 string arguments to specific function. Also i did some research to read excel file and i get two ways that either i can go with the jxl or Apache poi but with that also i am not able to read the data by the column header as if i can not go with the row and column number of approach as i have so many fields that i need to work with. The reason behind that is that in future is one field has added to single form that its going to be rework and again its not feasible.
enter image description here
I have been following this link:
http://www.softwaretestinghelp.com/selenium-framework-design-selenium-tutorial-21/
for reading data column wise but any how i am not getting the records based on the column header. Do we have any other way to achieve this.
Thanks

"testNG data providers but it looks like that it returns only strings" - incorrect. It allows you to return a multidimensional array of type Object. What kind of object you create is your own code. You may choose to read from the excel, encapsulate all the fields in one object (your own pojo) or multiple objects and then the method argument can have just those object types declared and not the 50 strings.
Both jxl and poi are libraries to interact with excel. If you want to have specific interaction with excel, like reading based on header, then you need to write code for that - it doesn't come out of the box.
If you are concerned about addition of one more column , then build your indices first by reading the header column, then put it in a relevant data structure and then go about reading your data.

I finally achieved that with the help of apache poi. I created on centralized function that is returning the hashmap having title as an index.
Here is that function:
Here is my main test function:
#Test(dataProvider="dpCreateNewCust")
public void createNewCustomer(List<Map<String, String>> sheetList){
try{
//Step 2. Login
UtilityMethods.SignIn();
for(Map<String, String> map : sheetList){
//Step 3. New Customer
if(map.get("Testcase").equals("Yes"))
{
//Process with excel data
ProcessNewCustomer(map);
}
}
}
catch(InterruptedException e)
{
System.out.println ("Login Exception Raised: <br> The exception get caught" + e);
}
}
//My data provider
#DataProvider(name = "dpCreateNewCust")
public Object[][] dpCreateNewCust(){
XLSfilename = System.getProperty("user.dir")+"//src//watts//XLSFiles//testcust.xlsx";
List<Map<String, String>> arrayObject = UtilityMethods.getXLSData(XLSfilename,Sheetname));
return new Object[][] { {arrayObject } };
}
//----GetXLSData Method in UtilityMethods Class :
public static List<Map<String, String>> getXLSData(String filename, String sheetname)
{
List<String> titleList = new ArrayList<String>();
List<Map<String, String>> sheetList = new ArrayList<Map<String, String>>();
try {
FileInputStream file = new FileInputStream(filename);
//Get the workbook instance for XLS file
XSSFWorkbook XLSbook = new XSSFWorkbook(file);
//Get first sheet from the workbook
//HSSFSheet sheet = workbook.getSheetAt(0);
WorkSheet = XLSbook.getSheet(sheetname);
//Iterate through each rows from first sheet
int i = 0;
Iterator<Row> rowIterator = WorkSheet.iterator();
while(rowIterator.hasNext()) {
Row row = rowIterator.next();
//For each row, iterate through each columns
Iterator<Cell> cellIterator = row.cellIterator();
int j = 0;
Map<String, String> valueMap = new HashMap<>();
while(cellIterator.hasNext()) {
Cell cell = cellIterator.next();
if(i==0){
titleList.add(cell.getStringCellValue());
}
else
{
String cellval = "";
switch(cell.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
cellval = cell.getBooleanCellValue()+"";
break;
case Cell.CELL_TYPE_NUMERIC:
cellval = String.valueOf(cell.getNumericCellValue())+"";
break;
case Cell.CELL_TYPE_STRING:
cellval = cell.getStringCellValue();
break;
default:
break;
}
if(cellval!="")
{
valueMap.put(titleList.get(j), cellval); valueMap.put("ResultRow",String.valueOf(row.getRowNum()));
valueMap.put("ResultCol",String.valueOf(0));
}
}
j++;
}
if(i!=0 && !valueMap.isEmpty()){
//System.out.println(valueMap);
sheetList.add(valueMap);
}
i++;
}
//System.out.println(sheetList); System.exit(0);
file.close();
XLSbook.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return sheetList;
}

i am not able to create a collection in ravendb, if i do not have any document in it

I have created a document in ravendb. Using session.advanced.getmetadata(see in code) , i gave a name to Raven-Entity-Name in metadata, after that i deleted that document in same function.Then i saw collection is also removed.If i delete the document manually from raven studio then the collection remains in the database.How a collection persist even if there is no document from code part? thanks in advance !!
My c# code is :
public CreateCollectionResult CreateCollection(string databaseName, string collectionName)
{
CreateCollectionResult createCollectionResult = new CreateCollectionResult();
Collection collection1234 = new Collection();
try
{
using (var session = documentStore.OpenSession(databaseName))
{
Guid guid = new Guid("12345678-1111-1111-2222-000000000000");
session.Store(collection1234, guid, "april-Days/10");
session.Advanced.GetMetadataFor<Collection>(collection1234)[Constants.RavenEntityName] = collectionName;
//session.Delete<Collection>(collection1234);
session.SaveChanges();
createCollectionResult.IsOperationSuccessfull = true;
}
}
//exception if database not found
catch (InvalidOperationException ex)
{
createCollectionResult.IsOperationSuccessfull = false;
createCollectionResult.Error = ex;
}
return createCollectionResult;
}

In RavenDB, collections are virtual, they are only there as long as you have at least one doc in that document.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight