Lucene Index grown too large

Lucene Index grown too large - sql-server

I have a Lucene Index .FDT file that is about 5GB. I add records to it very often(1000 records per day) and none will be deleted. It has 5 fields and only one of them is text content of html page. I also run a query parser on this index to look for some keywords. Even though the index is optimized every time I insert, it is taking almost a minute to find the keyword in the text content of html page. Has anyone gone through this problem and any suggestions on how to resolve this?
These are the following steps that I do in my code
1.
Using SQLData Reader, get contents of table which contains title,EmployeeID,headline(short description of employee department), Date(date this employee was added to the table or his info changed), data (html version of employee details)
2. For each record in table do the following
string body= strip text from html from webpage or data;
var doc = new Document();
doc.Add(new Field("title", staticname, Field.Store.YES, Field.Index.ANALYZED)); //title is always "Employee info"
doc.Add(new Field("Employeeid", keyid.Replace(",", " "), Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("headline", head, Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("date", DateTools.DateToString(date, DateTools.Resolution.SECOND), Field.Store.YES, Field.Index.NOT_ANALYZED));
if (data == null)
data = "";
else if (data.Length > 500)
{
data = data.Substring(0, 500);
}
doc.Add(new Field("body", data, Field.Store.YES, Field.Index.ANALYZED));
indexWriter.AddDocument(doc);
indexWriter.Optimize();
indexWriter.Commit();
indexWriter.Dispose();
----In the search program
string searchword="disability";
QueryParser queryParser = new QueryParser(VERSION, "body", analyzer);
string word = "+Employeeid:" + Employeeid+ " +body:" + searchword;
Query query = queryParser.Parse(word);
try
{
IndexReader reader = IndexReader.Open(luceneIndexDirectory, true);
Searcher indexSearch = new IndexSearcher(reader);
TopDocs hits = indexSearch.Search(query, 1);
if (hits.TotalHits > 0)
{
float score = hits.ScoreDocs[0].Score;
if (score > MINSCORE)
{
results.Add(result); //it is a list that has EmployeeID,searchwordID,searchword,score
}
}
indexSearch.Dispose();
reader.Dispose();
indexWriter.Dispose();
}
Any input is appreciated.
Thanks
M

Do not store the body and headline field to your index.
doc.Add(new Field("headline", head, Field.Store.No, Field.Index.ANALYZED));
doc.Add(new Field("body", head, Field.Store.No, Field.Index.ANALYZED));
It is useless for search.

Related

advance search text box in angular

I work on an online search project
I want to make a search textbox that works like: if I want to search about a book that Joe wrote it with title my book and publisher is tia.
I type in the search: joe my book tia or tia jo book >
so I will get a result for it.
tia is from a table in SQL database
joe is from a table in SQL database
my book is from a table in SQL database
can somebody help me?

You can easily accomplish that if you add another column, that will contain concatenated data from other three columns, to your data table (in your database) and then make a search in that column.
writer title publisher search_column
joe my book tia joe my book tia
Then, you can make SQL query to search by that column with LIKE
example with ExecuteReader.
var query = "select * from my_table where 1 = 1 " + filterQuery;
Create parameters:
public static SqlParameter AddSqlParameter(string parameterName, object value)
{
var p = new SqlParameter(parameterName, value);
return p;
}
List<SqlParameter> sqlParameters = new List<SqlParameter>();
var filterQuery = "";
this will split your search input by spaces
string[] words = searchInput.Split(' ');
loop through your search phrases and add one parameter for each phrase found:
for (int i = 0; i < words.Count; i++) {
sqlParameters.Add(AddSqlParameter("#p" + i.ToString(), words[i]));
filterQuery = filterQuery + " AND search_column LIKE " + "#p" + i.ToString();
}
add your search query and parameters to ExecuteReader:
public static List<T> ExecuteReader<T>(string commandText, List<SqlParameter> parameters) where T : new()
{
List<T> output = new List<T>();
using (SqlConnection con = new SqlConnection(mySetting.ConnectionString))
using (SqlCommand cmd = new SqlCommand(commandText, con))
{
cmd.Parameters.AddRange(parameters.ToArray());
con.Open();
using (SqlDataReader rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
T t = new T();
for (int i = 0; i < rdr.FieldCount; i++)
{
Type type = t.GetType();
PropertyInfo prop = type.GetProperty(rdr.GetName(i));
if (prop != null)
{
prop.SetValue(t, rdr.GetValue(i) is DBNull ? null : rdr.GetValue(i), null);
}
}
output.Add(t);
}
return output;
}
}
}
call your ExecuteReader like this:
var result = ExecuteReader<myClass>(query, sqlParameters );
If you have further questions, just ask.

Matching entire sentence with spaces in lucene BooleanQuery

I have a search string ,
Tulip INN Riyadhh
Tulip INN Riyadhh LUXURY
Suites of Tulip INN RIYAHdhh
I need search term , if i mention
*Tulip INN Riyadhh*
it has to return all the three above, i have restriction that i have to achieve this without QueryParser or Analyser, it has to be only BooleanQuery/WildCardQuery/etc....
Regards,
Raghavan

What you need here is a PhraseQuery. Let me explain.
I don't know which analyzer you're using, but I'll suppose you have a very basic one for simplicity, that just converts text to lowercase. Don't tell me you're not using an anlayzer since it's mandatory for Lucene to do any work, at least at the indexing stage - this is what defines the tokenizer and the token filter chain.
Here's how your strings would be tokenized in this example:
tulip inn ryiadhh
tulip inn ryiadhh luxury
suites of tulip inn ryiadhh
Notice how these all contain the token sequence tulip inn ryiadhh. A sequence of tokens is what a PhraseQuery is looking for.
In Lucene.Net building such a query looks like this (untested):
var query = new PhraseQuery();
query.Add(new Term("propertyName", "tulip"));
query.Add(new Term("propertyName", "inn"));
query.Add(new Term("propertyName", "ryiadhh"));
Note that the terms need to match those produced by the analyzer (in this example, they're all lowercase). The QueryParser does this job for you by running parts of the query through the analyzer, but you'll have to do it yourself if you don't use the parser.
Now, why wouldn't WildcardQuery or RegexQuery work in this situation? These queries always match a single term, yet you need to match an ordered sequence of terms. For instance a WildcardQuery with the term Riyadhh* would find all words starting with Riyadhh.
A BooleanQuery with a collection of TermQuery MUST clauses would match any text that happens to contain these 3 terms in any order - not exactly what you want either.

Lucas has the right idea, but there is a more specialized MultiPhraseQuery that can be used to build up a query based on the data that is already in the index to get a prefix match as demonstrated in this unit test. The documentation of MultiPhraseQuery reads:
MultiPhraseQuery is a generalized version of PhraseQuery, with an added method Add(Term[]). To use this class, to search for the phrase "Microsoft app*" first use Add(Term) on the term "Microsoft", then find all terms that have "app" as prefix using IndexReader.GetTerms(Term), and use MultiPhraseQuery.Add(Term[] terms) to add them to the query.
As Lucas pointed out, a *something WildCardQuery is the way to do the suffix match, provided you understand the performance implications.
They can then be combined with a BooleanQuery to get the result you want.
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
using System;
using System.Collections.Generic;
namespace LuceneSQLLikeSearch
{
class Program
{
static void Main(string[] args)
{
// Prepare...
var dir = new RAMDirectory();
var writer = new IndexWriter(dir,
new IndexWriterConfig(LuceneVersion.LUCENE_48,
new StandardAnalyzer(LuceneVersion.LUCENE_48)));
WriteIndex(writer);
// Search...
var reader = writer.GetReader(false);
// Get all terms that end with tulip
var wildCardQuery = new WildcardQuery(new Term("field", "*tulip"));
var multiPhraseQuery = new MultiPhraseQuery();
multiPhraseQuery.Add(new Term("field", "inn"));
// Get all terms that start with riyadhh
multiPhraseQuery.Add(GetPrefixTerms(reader, "field", "riyadhh"));
var query = new BooleanQuery();
query.Add(wildCardQuery, Occur.SHOULD);
query.Add(multiPhraseQuery, Occur.SHOULD);
var result = ExecuteSearch(writer, query);
foreach (var item in result)
{
Console.WriteLine("Match: {0} - Score: {1:0.0########}",
item.Value, item.Score);
}
Console.ReadKey();
}
}
}
WriteIndex
public static void WriteIndex(IndexWriter writer)
{
Document document;
document = new Document();
document.Add(new TextField("field", "Tulip INN Riyadhh", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Tulip INN Riyadhh LUXURY", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhh", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "Suites of Tulip INN RIYAHdhhll", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "myTulip INN Riyadhh LUXURY", Field.Store.YES));
writer.AddDocument(document);
document = new Document();
document.Add(new TextField("field", "some bogus data that should not match", Field.Store.YES));
writer.AddDocument(document);
writer.Commit();
}
GetPrefixTerms
Here we scan the index to find all of the terms that start with the passed-in prefix. The terms are then added to the MultiPhraseQuery.
public static Term[] GetPrefixTerms(IndexReader reader, string field, string prefix)
{
var result = new List<Term>();
TermsEnum te = MultiFields.GetFields(reader).GetTerms(field).GetIterator(null);
te.SeekCeil(new BytesRef(prefix));
do
{
string s = te.Term.Utf8ToString();
if (s.StartsWith(prefix, StringComparison.Ordinal))
{
result.Add(new Term(field, s));
}
else
{
break;
}
} while (te.Next() != null);
return result.ToArray();
}
ExecuteSearch
public static IList<SearchResult> ExecuteSearch(IndexWriter writer, Query query)
{
var result = new List<SearchResult>();
var searcherManager = new SearcherManager(writer, true, null);
// Execute the search with a fresh indexSearcher
searcherManager.MaybeRefreshBlocking();
var searcher = searcherManager.Acquire();
try
{
var topDocs = searcher.Search(query, 10);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
var doc = searcher.Doc(scoreDoc.Doc);
result.Add(new SearchResult
{
Value = doc.GetField("field")?.GetStringValue(),
// Results are automatically sorted by relevance
Score = scoreDoc.Score,
});
}
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
}
finally
{
searcherManager.Release(searcher);
searcher = null; // Don't use searcher after this point!
}
return result;
}
SearchResult
public class SearchResult
{
public string Value { get; set; }
public float Score { get; set; }
}
If this seems cumbersome, note that QueryParser can mimic a "SQL LIKE" query. As pointed out here, there is an option to AllowLeadingWildCard on QueryParser to build up the correct query sequence easily. It is unclear why you have a constraint that you can't use it, as it is definitely the simplest way to get the job done.

SQL Server 2008 changed table name bizarre behavior

I changed the name of one of my tables, then afterwards encoded some data then pulled it using a view to my surprise the data is not showing. I tried renaming it back to its original name with no luck the same thing is happening.
Then finally I tried retyping the data on one of the columns and then executed the view and there the data is finally showing now the problem arises I need to re encode the data on one of the column every time a data is inserted which is obviously not a good thing to do.
here is the code on how i added some data
tblcsv.Columns.AddRange(new DataColumn[7] { new DataColumn("unit_name", typeof(string)), new DataColumn("unit", typeof(string)), new DataColumn("adrress", typeof(string)), new DataColumn("latitude", typeof(string))
,new DataColumn("longitude" , typeof(string)) , new DataColumn("region" , typeof(string)) , new DataColumn("linkid" , typeof(string))});
string ReadCSV = File.ReadAllText(forex);
foreach (string csvRow in ReadCSV.Split('\n'))
{
if (!string.IsNullOrEmpty(csvRow))
{
//Adding each row into datatable
tblcsv.Rows.Add();
int count = 0;
foreach (string FileRec in csvRow.Split(','))
{
tblcsv.Rows[tblcsv.Rows.Count - 1][count] = FileRec;
if (count == 5)
{
tblcsv.Rows[tblcsv.Rows.Count - 1][6] = link;
}
count++;
}
}
}
string consString = ConfigurationManager.ConnectionStrings["diposlConnectionString"].ConnectionString;
using (SqlConnection con = new SqlConnection(consString))
{
using (SqlBulkCopy sqlBulkCopy = new SqlBulkCopy(con))
{
//Set the database table name
sqlBulkCopy.DestinationTableName = "dbo.FRIENDLY_FORCES";
//[OPTIONAL]: Map the Excel columns with that of the database table
sqlBulkCopy.ColumnMappings.Add("unit_name", "unit_name");
sqlBulkCopy.ColumnMappings.Add("unit", "unit");
sqlBulkCopy.ColumnMappings.Add("adrress", "adrress");
sqlBulkCopy.ColumnMappings.Add("latitude", "latitude");
sqlBulkCopy.ColumnMappings.Add("longitude", "longitude");
sqlBulkCopy.ColumnMappings.Add("region", "region");
sqlBulkCopy.ColumnMappings.Add("linkid", "linkid");
con.Open();
sqlBulkCopy.WriteToServer(tblcsv);
con.Close();
}
}
the column region is where i manually edited the data
Did the renaming of the table did something to my data?
Or am I just missing something?
Thank you

Why does this code - adding wordnet synonyms to index - fail?

I am writing this code as part of my CustomAnalyzer:
public class CustomAnalyzer extends Analyzer {
SynonymMap mySynonymMap = null;
CustomAnalyzer() throws IOException {
SynonymMap.Builder builder = new SynonymMap.Builder(true);
FileReader fr = new FileReader("/home/watsonuser/Downloads/wordnetSynonyms.txt");
BufferedReader br = new BufferedReader(fr);
String line = "";
while ((line = br.readLine()) != null) {
String[] synset = line.split(",");
for(String syn: synset)
builder.add(new CharsRef(synset[0]), new CharsRef(syn), true);
}
br.close();
fr.close();
try {
mySynonymMap = builder.build();
} catch (IOException e) {
System.out.println("Unable to build synonymMap");
e.printStackTrace();
}
}
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new PorterStemFilter(new SynonymFilter(
(new StopFilter(true,new LowerCaseFilter
(new StandardFilter(new StandardTokenizer
(Version.LUCENE_36,reader)
)
),StopAnalyzer.ENGLISH_STOP_WORDS_SET)), mySynonymMap, true)
);
}
}
Now, if I use the same CustomAnalyzer as part of my querying, then if I enter the query as
myFieldName: manager
it expands the query with synonyms for manager.
But, I want the synonyms to be part of only my index and I don't want my query to be expanded with synonyms.
So, when I removed the SynonymFilter from my CustomAnalyzer only when querying the index, the query remains as
myFieldName: manager
but, it fails to retrieve documents that have the synonyms of manager.
How do we solve this problem?

If you do not have your synonym builder during Query processing then the only term it will match is what you mapped to during indexing. And you are not showing that part here.
The best way to troubleshoot this is to look at Admin/Core/Analysis screen (in Solr 4+) and put your text in. It will show what happens with the text after each stage in indexing and queries is run.
You don't even need to run reindexer. You can just define a bunch of different types you are trying to figure out and then run the analysis of the sample sentences directly against those types.

salesforce SOQL : query to fetch all the fields on the entity

I was going through the SOQL documentation , but couldn't find query to fetch all the field data of an entity say , Account , like
select * from Account [ SQL syntax ]
Is there a syntax like the above in SOQL to fetch all the data of account , or the only way is to list all the fields ( though there are lot of fields to be queried )

Create a map like this:
Map<String, Schema.SObjectField> fldObjMap = schema.SObjectType.Account.fields.getMap();
List<Schema.SObjectField> fldObjMapValues = fldObjMap.values();
Then you can iterate through fldObjMapValues to create a SOQL query string:
String theQuery = 'SELECT ';
for(Schema.SObjectField s : fldObjMapValues)
{
String theLabel = s.getDescribe().getLabel(); // Perhaps store this in another map
String theName = s.getDescribe().getName();
String theType = s.getDescribe().getType(); // Perhaps store this in another map
// Continue building your dynamic query string
theQuery += theName + ',';
}
// Trim last comma
theQuery = theQuery.subString(0, theQuery.length() - 1);
// Finalize query string
theQuery += ' FROM Account WHERE ... AND ... LIMIT ...';
// Make your dynamic call
Account[] accounts = Database.query(theQuery);
superfell is correct, there is no way to directly do a SELECT *. However, this little code recipe will work (well, I haven't tested it but I think it looks ok). Understandably Force.com wants a multi-tenant architecture where resources are only provisioned as explicitly needed - not easily by doing SELECT * when usually only a subset of fields are actually needed.

You have to specify the fields, if you want to build something dynamic the describeSObject call returns the metadata about all the fields for an object, so you can build the query from that.

I use the Force.com Explorer and within the schema filter you can click the checkbox next to the TableName and it will select all the fields and insert into your query window - I use this as a shortcut to typeing it all out - just copy and paste from the query window. Hope this helps.

In case anyone was looking for a C# approach, I was able to use reflection and come up with the following:
public IEnumerable<String> GetColumnsFor<T>()
{
return typeof(T).GetProperties(System.Reflection.BindingFlags.Public | System.Reflection.BindingFlags.Instance)
.Where(x => !Attribute.IsDefined(x, typeof(System.Xml.Serialization.XmlIgnoreAttribute))) // Exclude the ignored properties
.Where(x => x.DeclaringType != typeof(sObject)) // & Exclude inherited sObject propert(y/ies)
.Where(x => x.PropertyType.Namespace != typeof(Account).Namespace) // & Exclude properties storing references to other objects
.Select(x => x.Name);
}
It appears to work for the objects I've tested (and matches the columns generated by the API test). From there, it's about creating the query:
/* assume: this.server = new sForceService(); */
public IEnumerable<T> QueryAll<T>(params String[] columns)
where T : sObject
{
String soql = String.Format("SELECT {0} FROM {1}",
String.Join(", ", GetColumnsFor<T>()),
typeof(T).Name
);
this.service.QueryOptionsValue = new QueryOptions
{
batchsize = 250,
batchSizeSpecified = true
};
ICollection<T> results = new HashSet<T>();
try
{
Boolean done = false;
QueryResult queryResult = this.service.queryAll(soql);
while (!finished)
{
sObject[] records = queryResult.records;
foreach (sObject record in records)
{
T entity = entity as T;
if (entity != null)
{
results.Add(entity);
}
}
done &= queryResult.done;
if (!done)
{
queryResult = this.service.queryMode(queryResult.queryLocator);
}
}
}
catch (Exception ex)
{
throw; // your exception handling
}
return results;
}

For me it was the first time with Salesforce today and I came up with this in Java:
/**
* #param o any class that extends {#link SObject}, f.ex. Opportunity.class
* #return a list of all the objects of this type
*/
#SuppressWarnings("unchecked")
public <O extends SObject> List<O> getAll(Class<O> o) throws Exception {
// get the objectName; for example "Opportunity"
String objectName= o.getSimpleName();
// this will give us all the possible fields of this type of object
DescribeSObjectResult describeSObject = connection.describeSObject(objectName);
// making the query
String query = "SELECT ";
for (Field field : describeSObject.getFields()) { // add all the fields in the SELECT
query += field.getName() + ',';
}
// trim last comma
query = query.substring(0, query.length() - 1);
query += " FROM " + objectName;
SObject[] records = connection.query(query).getRecords();
List<O> result = new ArrayList<O>();
for (SObject record : records) {
result.add((O) record);
}
return result;
}

I used following to get complete records-
query_all("Select Id, Name From User_Profile__c")
To get complete fields of record, we have to mention those fields as mentioned here-
https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_select.htm
Hope will help you !!!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight