Matching entire sentence with spaces in lucene BooleanQuery - solr

I have a search string ,
Tulip INN Riyadhh
Tulip INN Riyadhh LUXURY
Suites of Tulip INN RIYAHdhh
I need search term , if i mention
*Tulip INN Riyadhh*
it has to return all the three above, i have restriction that i have to achieve this without QueryParser or Analyser, it has to be only BooleanQuery/WildCardQuery/etc....
Regards,
Raghavan

What you need here is a PhraseQuery. Let me explain.
I don't know which analyzer you're using, but I'll suppose you have a very basic one for simplicity, that just converts text to lowercase. Don't tell me you're not using an anlayzer since it's mandatory for Lucene to do any work, at least at the indexing stage - this is what defines the tokenizer and the token filter chain.
Here's how your strings would be tokenized in this example:
tulip inn ryiadhh
tulip inn ryiadhh luxury
suites of tulip inn ryiadhh
Notice how these all contain the token sequence tulip inn ryiadhh. A sequence of tokens is what a PhraseQuery is looking for.
In Lucene.Net building such a query looks like this (untested):
var query = new PhraseQuery();
query.Add(new Term("propertyName", "tulip"));
query.Add(new Term("propertyName", "inn"));
query.Add(new Term("propertyName", "ryiadhh"));
Note that the terms need to match those produced by the analyzer (in this example, they're all lowercase). The QueryParser does this job for you by running parts of the query through the analyzer, but you'll have to do it yourself if you don't use the parser.
Now, why wouldn't WildcardQuery or RegexQuery work in this situation? These queries always match a single term, yet you need to match an ordered sequence of terms. For instance a WildcardQuery with the term Riyadhh* would find all words starting with Riyadhh.
A BooleanQuery with a collection of TermQuery MUST clauses would match any text that happens to contain these 3 terms in any order - not exactly what you want either.

Lucas has the right idea, but there is a more specialized MultiPhraseQuery that can be used to build up a query based on the data that is already in the index to get a prefix match as demonstrated in this unit test. The documentation of MultiPhraseQuery reads:
MultiPhraseQuery is a generalized version of PhraseQuery, with an added method Add(Term[]). To use this class, to search for the phrase "Microsoft app*" first use Add(Term) on the term "Microsoft", then find all terms that have "app" as prefix using IndexReader.GetTerms(Term), and use MultiPhraseQuery.Add(Term[] terms) to add them to the query.
As Lucas pointed out, a *something WildCardQuery is the way to do the suffix match, provided you understand the performance implications.
They can then be combined with a BooleanQuery to get the result you want.
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
using System;
using System.Collections.Generic;
namespace LuceneSQLLikeSearch
{
class Program
{
static void Main(string[] args)
{
// Prepare...
var dir = new RAMDirectory();
var writer = new IndexWriter(dir,
new IndexWriterConfig(LuceneVersion.LUCENE_48,
new StandardAnalyzer(LuceneVersion.LUCENE_48)));
WriteIndex(writer);
// Search...
var reader = writer.GetReader(false);
// Get all terms that end with tulip
var wildCardQuery = new WildcardQuery(new Term("field", "*tulip"));
var multiPhraseQuery = new MultiPhraseQuery();
multiPhraseQuery.Add(new Term("field", "inn"));
// Get all terms that start with riyadhh
multiPhraseQuery.Add(GetPrefixTerms(reader, "field", "riyadhh"));
var query = new BooleanQuery();
query.Add(wildCardQuery, Occur.SHOULD);
query.Add(multiPhraseQuery, Occur.SHOULD);
var result = ExecuteSearch(writer, query);
foreach (var item in result)
{
Console.WriteLine("Match: {0} - Score: {1:0.0########}",
item.Value, item.Score);
}
Console.ReadKey();
}
}
}
WriteIndex
public static void WriteIndex(IndexWriter writer)
{
Document document;
document = new Document();
document.Add(new TextField("field", "Tulip INN Riyadhh", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Tulip INN Riyadhh LUXURY", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhh", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhhll", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "myTulip INN Riyadhh LUXURY", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "some bogus data that should not match", Field.Store.YES));
writer.AddDocument(document);
writer.Commit();
}
GetPrefixTerms
Here we scan the index to find all of the terms that start with the passed-in prefix. The terms are then added to the MultiPhraseQuery.
public static Term[] GetPrefixTerms(IndexReader reader, string field, string prefix)
{
var result = new List<Term>();
TermsEnum te = MultiFields.GetFields(reader).GetTerms(field).GetIterator(null);
te.SeekCeil(new BytesRef(prefix));
do
{
string s = te.Term.Utf8ToString();
if (s.StartsWith(prefix, StringComparison.Ordinal))
{
result.Add(new Term(field, s));
}
else
{
break;
}
} while (te.Next() != null);
return result.ToArray();
}
ExecuteSearch
public static IList<SearchResult> ExecuteSearch(IndexWriter writer, Query query)
{
var result = new List<SearchResult>();
var searcherManager = new SearcherManager(writer, true, null);
// Execute the search with a fresh indexSearcher
searcherManager.MaybeRefreshBlocking();
var searcher = searcherManager.Acquire();
try
{
var topDocs = searcher.Search(query, 10);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
var doc = searcher.Doc(scoreDoc.Doc);
result.Add(new SearchResult
{
Value = doc.GetField("field")?.GetStringValue(),
// Results are automatically sorted by relevance
Score = scoreDoc.Score,
});
}
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
finally
{
searcherManager.Release(searcher);
searcher = null; // Don't use searcher after this point!
}
return result;
}
SearchResult
public class SearchResult
{
public string Value { get; set; }
public float Score { get; set; }
}
If this seems cumbersome, note that QueryParser can mimic a "SQL LIKE" query. As pointed out here, there is an option to AllowLeadingWildCard on QueryParser to build up the correct query sequence easily. It is unclear why you have a constraint that you can't use it, as it is definitely the simplest way to get the job done.

Related

Reading in all docs (doc id only if possible) from Solr without searching

I know Solr is meant to be used for searching.
However, I am doing some benchmarking and I wonder if there is a way to retrieve doc id of every document indexed.
The best option is retrieving without searching (if there exist a way).
I guess the alternative is to query all documents but only asks for doc id.
I will be using SolrJ, so operations of SolrJ would be useful
Use the /export end point: Exporting result sets.
It supports using the same fl parameter as regular search (although searching for just *:* will probably behave quite similar when you're using SolrJ).
In SolrJ you'll have to use the CloudSolrStream class instead to properly stream the results (as compared to the regular behavior when searching for *:*).
From Joel Bernstein's example when introducing the feature:
import org.apache.solr.client.solrj.io.*;
import java.util.*;
public class StreamingClient {
public static void main(String args[]) throws IOException {
String zkHost = args[0];
String collection = args[1];
Map props = new HashMap();
props.put("q", "*:*");
props.put("qt", "/export");
props.put("sort", "fieldA asc");
props.put("fl", "fieldA,fieldB,fieldC");
CloudSolrStream cstream = new CloudSolrStream(zkHost,
collection,
props);
try {
cstream.open();
while(true) {
Tuple tuple = cstream.read();
if(tuple.EOF) {
break;
}
String fieldA = tuple.getString("fieldA");
String fieldB = tuple.getString("fieldB");
String fieldC = tuple.getString("fieldC");
System.out.println(fieldA + ", " + fieldB + ", " + fieldC);
}
} finally {
cstream.close();
}
}
}

Entity Framework : Create a model from Dictionary<TKey,TValue> to be mapped to a database table

Earlier I had a table named ApplicationConfiguration which simply had [Key],[Value] columns to store some config data. This was queried straight away using SQL queries.
Now I intend to make use of Entity Framework (EF) Code First approach to query this table. The specialty of this table is that the table will have only a fixed number of rows in its lifetime. Only the Value column can be updated.
So as per the code first approach, we have to first write our POCO classes with its properties that will be mapped to columns in the underlying table. However, I wish to have a Dictionary<> structure to represent these configuration KV pairs. My concern is, will EF be able to fire update queries against any updation to the the value of a particular pair.
Also since I am using Code First approach, I would want some seed data(i.e the fixed number of rows and its initial content) to the added after the table itself is created on the fly when the application is first executed.
If Dictionary<> cannot be used, please suggest some alternative. Thanks in advance.
Coded this way:
public class ApplicationConfiguration
{
public int Id { get; set; }
public string Key { get; set; }
public int Value { get; set; } // should be string, but I'm lazy
}
class Context : DbContext
{
internal class ContextInitializer : DropCreateDatabaseIfModelChanges<Context>
{
protected override void Seed(Context context)
{
var defaults = new List<ApplicationConfiguration>
{
new ApplicationConfiguration {Key = "Top", Value = 5},
new ApplicationConfiguration {Key = "Bottom", Value = 7},
new ApplicationConfiguration {Key = "Left", Value = 1},
new ApplicationConfiguration {Key = "Right", Value = 3}
};
// foreach (var c in defaults)
// context.ConfigurationMap.Add(c.Key, c); // by design, no IReadOnlyDictionary.Add
foreach (var c in defaults)
context.ApplicationConfigurations.Add(c);
base.Seed(context);
}
}
public Context()
{
Database.SetInitializer(new ContextInitializer());
}
private IDbSet<ApplicationConfiguration> ApplicationConfigurations
{
get { return Set<ApplicationConfiguration>(); }
}
public IReadOnlyDictionary<string, ApplicationConfiguration> ConfigurationMap
{
get { return ApplicationConfigurations.ToDictionary(kvp => kvp.Key, kvp => kvp); }
}
}
Used this way:
using (var context = new Context())
{
ReadConfigurationOnly(context.ConfigurationMap);
}
using (var context = new Context())
{
ModifyConfiguration(context.ConfigurationMap);
context.SaveChanges();
}
static void ReadConfigurationOnly(IReadOnlyDictionary<string, ApplicationConfiguration> configuration)
{
foreach (var k in configuration.Keys)
Console.WriteLine("{0} = {1}", k, configuration[k].Value);
}
static void ModifyConfiguration(IReadOnlyDictionary<string, ApplicationConfiguration> configuration)
{
foreach (var k in configuration.Keys)
configuration[k].Value++; // this is why I was lazy, using an int for a string
}
So, I wrote it up this way — using an int Value property rather than a string — just so I could run the "Used this way" code over and over, and see the database update each time, without having to come up with some other way to change Value in an interesting way.
It's not quite as nifty here to use a IReadOnlyDictionary<string, ApplicatonConfiguration> instead of a IReadOnlyDictionary<string, string>, the way we'd really like, but that's more than made up for by the fact that we can easily modify our collection values without resorting to a clumsier Set method taking a dictionary as input. The drawback, of course, is that we have to settle for configuration[key].Value = "new value" rather than configuration[key] = "new value", but — as I say — I think it's worth it.
EDIT
Dang! I wrote this code up specifically to answer this question, but I think I like it so much, I'm going to add it to my bag of tricks ... this would fit in really well when my company goes from local databases to Azure instances in the cloud, and the current app.config has to go into the database.
Now all I need is a ContextInitializer taking a System.Configuration.ConfigurationManager as a ctor parameter in order to seed a new database from an existing app.config ...
I don't think you can map a table directly to a Dictionary; you will probably have to write your own wrapper to fill a dictionary from the table and update it back to the DB. Entities are each a row of a given table... Something like this (untested):
public Dictionary<string, string> GetDictionary()
{
Dictionary<string, string> dic = new Dictionary<string, string>();
using (var db = new Context())
{
var configs = db.ApplicationConfiguration.Select();
foreach (var entry in configs)
{
dic.Add(config.Key, config.Value);
}
}
return dic;
}
public void SaveConfig(Dictionary<string, string> dic)
{
using (var db = new Context())
{
foreach (KeyValuePair kvp in dic)
{
if (!db.ApplicationConfiguration.First(a => a.Key == kvp.Key).Value == kvp.Value)
{
var ac = new ApplicationConfiguration();
ac.Key = kvp.Key;
ac.Value = kvp.Value;
db.Entry(ac).State = EntityState.Modified;
}
}
db.SaveChanges();
}
}
For your second question, you want to use the Seed() method to add initial values to the database. See here for an example implementation.

Multi-dimension queryable HashMap Array

I have a CSV file that I'd like to store as a Java object. I would like the name of the columns to be the first dimension of the array, and key pair values the second dimension of the array. I've tried different solutions (mostly LinkedHashMaps) but none seem to work properly.
The CSV looks like this:
TimeStamp;Column 1;Column 2;Column3
1385733406;Value1;Value12;Value13
1385733409;Value21;Value22;Value23
1385733411;Value31;Value32;Value33
I would like the array to look something like this:
["Column 1"]
["1385733406","Value1"]
["1385733409","Value21"]
["1385733411","Value31"]
["Column 2"]
["1385733406","Value2"]
["1385733409","Value22"]
["1385733411","Value32"]
["Column 3"]
["1385733406","Value2"]
["1385733409","Value22"]
["1385733411","Value33"]
This way, I would be able to query the object and retrieve all the key pair values from a given column, for instance, all the data from Column 1. Using HashMaps doesn't seem to work because they require two arguments, and doing this doesn't seem to be the proper way. This is the code I could come up with so far, which I don't think is the right track but it's all I could come up with. I'm using OpenJDK 1.7
public class CsvCollection {
public Map<String,Map<String,Integer>> Results = new LinkedHashMap<String, Map<String,Integer>>();
public CsvCollection(){
}
public void parseCsvResultFile(File csvFile){
CSVReader reader = null;
List myEntries = null;
try {
reader = new CSVReader(new FileReader(csvFile.getAbsolutePath()), ';');
} catch (FileNotFoundException e) {
System.out.println("Error opening [], aborting parsing");
}
try {
myEntries = reader.readAll();
} catch (IOException e) {
System.out.println("Error reading content of CSV file (but file was opened)");
}
for(String header: (String[]) myEntries.get(0)){
Results.put(header, null);
// What now?
}
}
}
You can make 2 change as follows to implements the function you needed.
1) Change the following code
public Map<String,Map<String,Integer>> Results = new LinkedHashMap<String, Map<String,Integer>>();
to
public Map<String, List<String[]>> Results = new LinkedHashMap<String, List<String[]>>();
This change is made because for a specify column, like Column 1, it has 3 rows with Timestamp and corresponding Column value. You need to use a List to store them.
2) Change the following for-loop
for(String header: (String[]) myEntries.get(0)){
Results.put(header, null);
// What now?
}
to
String[] headerColumns = (String[]) myEntries.get(0);
// First column is TimeStamp, skip it
for (int i = 1; i < headerColumns.length; i++) {
List<String[]> list = new ArrayList<String[]>();
for (int rowIndex = 1; rowIndex < myEntries.size(); rowIndex++) {
String[] row = (String[]) myEntries.get(rowIndex);
list.add(new String[] { row[0], row[i] });
}
Results.put(headerColumns[i], list);
}
With the above 2 changes, if you print the Results (type of Map< String, List< String[] > >) in console using the following code,
for(Map.Entry<String, List<String[]>> entry : Results.entrySet())
{
System.out.printf("[%s]\n",entry.getKey());
for(String[] array : entry.getValue())
{
System.out.println(Arrays.toString(array));
}
}
you will get the result you need:
[Column 1]
[1385733406, Value1]
[1385733409, Value21]
[1385733411, Value31]
[Column 2]
[1385733406, Value12]
[1385733409, Value22]
[1385733411, Value32]
[Column3]
[1385733406, Value13]
[1385733409, Value23]
[1385733411, Value33]
Note: the above example is executed using the content from a CSV file below:
TimeStamp;Column 1;Column 2;Column3
1385733406;Value1;Value12;Value13
1385733409;Value21;Value22;Value23
1385733411;Value31;Value32;Value33

Why does this code - adding wordnet synonyms to index - fail?

I am writing this code as part of my CustomAnalyzer:
public class CustomAnalyzer extends Analyzer {
SynonymMap mySynonymMap = null;
CustomAnalyzer() throws IOException {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
FileReader fr = new FileReader("/home/watsonuser/Downloads/wordnetSynonyms.txt");
BufferedReader br = new BufferedReader(fr);
String line = "";
while ((line = br.readLine()) != null) {
String[] synset = line.split(",");
for(String syn: synset)
builder.add(new CharsRef(synset[0]), new CharsRef(syn), true);
}
br.close();
fr.close();
try {
mySynonymMap = builder.build();
} catch (IOException e) {
System.out.println("Unable to build synonymMap");
e.printStackTrace();
}
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new PorterStemFilter(new SynonymFilter(
(new StopFilter(true,new LowerCaseFilter
(new StandardFilter(new StandardTokenizer
(Version.LUCENE_36,reader)
)
),StopAnalyzer.ENGLISH_STOP_WORDS_SET)), mySynonymMap, true)
);
}
}
Now, if I use the same CustomAnalyzer as part of my querying, then if I enter the query as
myFieldName: manager
it expands the query with synonyms for manager.
But, I want the synonyms to be part of only my index and I don't want my query to be expanded with synonyms.
So, when I removed the SynonymFilter from my CustomAnalyzer only when querying the index, the query remains as
myFieldName: manager
but, it fails to retrieve documents that have the synonyms of manager.
How do we solve this problem?
If you do not have your synonym builder during Query processing then the only term it will match is what you mapped to during indexing. And you are not showing that part here.
The best way to troubleshoot this is to look at Admin/Core/Analysis screen (in Solr 4+) and put your text in. It will show what happens with the text after each stage in indexing and queries is run.
You don't even need to run reindexer. You can just define a bunch of different types you are trying to figure out and then run the analysis of the sample sentences directly against those types.

salesforce SOQL : query to fetch all the fields on the entity

I was going through the SOQL documentation , but couldn't find query to fetch all the field data of an entity say , Account , like
select * from Account [ SQL syntax ]
Is there a syntax like the above in SOQL to fetch all the data of account , or the only way is to list all the fields ( though there are lot of fields to be queried )
Create a map like this:
Map<String, Schema.SObjectField> fldObjMap = schema.SObjectType.Account.fields.getMap();
List<Schema.SObjectField> fldObjMapValues = fldObjMap.values();
Then you can iterate through fldObjMapValues to create a SOQL query string:
String theQuery = 'SELECT ';
for(Schema.SObjectField s : fldObjMapValues)
{
String theLabel = s.getDescribe().getLabel(); // Perhaps store this in another map
String theName = s.getDescribe().getName();
String theType = s.getDescribe().getType(); // Perhaps store this in another map
// Continue building your dynamic query string
theQuery += theName + ',';
}
// Trim last comma
theQuery = theQuery.subString(0, theQuery.length() - 1);
// Finalize query string
theQuery += ' FROM Account WHERE ... AND ... LIMIT ...';
// Make your dynamic call
Account[] accounts = Database.query(theQuery);
superfell is correct, there is no way to directly do a SELECT *. However, this little code recipe will work (well, I haven't tested it but I think it looks ok). Understandably Force.com wants a multi-tenant architecture where resources are only provisioned as explicitly needed - not easily by doing SELECT * when usually only a subset of fields are actually needed.
You have to specify the fields, if you want to build something dynamic the describeSObject call returns the metadata about all the fields for an object, so you can build the query from that.
I use the Force.com Explorer and within the schema filter you can click the checkbox next to the TableName and it will select all the fields and insert into your query window - I use this as a shortcut to typeing it all out - just copy and paste from the query window. Hope this helps.
In case anyone was looking for a C# approach, I was able to use reflection and come up with the following:
public IEnumerable<String> GetColumnsFor<T>()
{
return typeof(T).GetProperties(System.Reflection.BindingFlags.Public | System.Reflection.BindingFlags.Instance)
.Where(x => !Attribute.IsDefined(x, typeof(System.Xml.Serialization.XmlIgnoreAttribute))) // Exclude the ignored properties
.Where(x => x.DeclaringType != typeof(sObject)) // & Exclude inherited sObject propert(y/ies)
.Where(x => x.PropertyType.Namespace != typeof(Account).Namespace) // & Exclude properties storing references to other objects
.Select(x => x.Name);
}
It appears to work for the objects I've tested (and matches the columns generated by the API test). From there, it's about creating the query:
/* assume: this.server = new sForceService(); */
public IEnumerable<T> QueryAll<T>(params String[] columns)
where T : sObject
{
String soql = String.Format("SELECT {0} FROM {1}",
String.Join(", ", GetColumnsFor<T>()),
typeof(T).Name
);
this.service.QueryOptionsValue = new QueryOptions
{
batchsize = 250,
batchSizeSpecified = true
};
ICollection<T> results = new HashSet<T>();
try
{
Boolean done = false;
QueryResult queryResult = this.service.queryAll(soql);
while (!finished)
{
sObject[] records = queryResult.records;
foreach (sObject record in records)
{
T entity = entity as T;
if (entity != null)
{
results.Add(entity);
}
}
done &= queryResult.done;
if (!done)
{
queryResult = this.service.queryMode(queryResult.queryLocator);
}
}
}
catch (Exception ex)
{
throw; // your exception handling
}
return results;
}
For me it was the first time with Salesforce today and I came up with this in Java:
/**
* #param o any class that extends {#link SObject}, f.ex. Opportunity.class
* #return a list of all the objects of this type
*/
#SuppressWarnings("unchecked")
public <O extends SObject> List<O> getAll(Class<O> o) throws Exception {
// get the objectName; for example "Opportunity"
String objectName= o.getSimpleName();
// this will give us all the possible fields of this type of object
DescribeSObjectResult describeSObject = connection.describeSObject(objectName);
// making the query
String query = "SELECT ";
for (Field field : describeSObject.getFields()) { // add all the fields in the SELECT
query += field.getName() + ',';
}
// trim last comma
query = query.substring(0, query.length() - 1);
query += " FROM " + objectName;
SObject[] records = connection.query(query).getRecords();
List<O> result = new ArrayList<O>();
for (SObject record : records) {
result.add((O) record);
}
return result;
}
I used following to get complete records-
query_all("Select Id, Name From User_Profile__c")
To get complete fields of record, we have to mention those fields as mentioned here-
https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_select.htm
Hope will help you !!!

Resources