Indexing PDF documents with addtional search fields using SolrNet?

Indexing PDF documents with addtional search fields using SolrNet? - solr

I found this article useful when indexing documents, however, how can I attach additional fields so I can pass in, say, the ID of the document in our database for use in displaying the search results? I thought by using the Fields (Of the ExtractParameters class) property I could index additional data with the document, but that doesn't seem to work or that is not its function.
Example code:
var solr = ObjectLocator.Instance.Resolve<ISolrOperations<IndexDocument>>();
var guid = Guid.NewGuid().ToString();
using (var fileStream = System.IO.File.OpenRead(Server.MapPath("~/files/") + "greenroof.pdf"))
{
var response =
solr.Extract(
new ExtractParameters(fileStream, "greenRoof1234")
{
ExtractFormat = ExtractFormat.Text,
ExtractOnly = false,
Fields = new[] { new ExtractField("field1", "value1"), new ExtractField("field2", "value2") }
});
}

#aitchnyu is correct, passing the values via the literal.field=value method is the correct way to do this.
However, according to this post on ExtractingRequestHandler support in the SolrNet Google Group, there was a bug with the ExtractParameters.Fields not working properly. This was fixed in the 0.4.0.X versions of SolrNet. Please make sure you are using one of the latest versions of SolrNet. You can obtain that by one of the following means:
Project Site Downloads
NuGet PreRelease Package
Also that discussion has some good examples of using the ExtractingRequestHandler in SolrNet as well as a workaround for adding the additional field values if you cannot upgrade to a newer version of SolrNet.

This is sufficient: http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .
In general use a literal.field=value while uploading.

It turned out not to be an issue with SOLRNet, but my knowledge of SOLR, in general. I needed to specify the fields in my schema. After i added the fields to my schema they were visible in my SOLR query.

Related

How can we initialize DataChangeDetectionPolicy using .netsdk?

I have created a new index that is populated using an indexer. The indexer's datasource is a SQL view that has a Timestamp column of type datetime. Since we don't want a full reindexing each time the indexer runs, this column should be used to determine which data have changed since the last indexer run.
According to the documentation we need to create or update the datasource by setting the HighWatermarkColumnName and ODataType to the DataChangeDetectionPolicy object. The example in the documentation uses the REST API and there is also way to do it using the azure search portal directly.
However I want to do it using .netsdk and so far I haven't been able to do so. I am using Azure.Search.Documents(11.2.0 - beta.2). Here is the part of the code I use to create the datasource:
SearchIndexerDataSourceConnection CreateIndexerDataSource()
{
var ds = new SearchIndexerDataSourceConnection(DATASOURCE,
SearchIndexerDataSourceType.AzureSql,
this._datasourceConStringMaxEvents,
new SearchIndexerDataContainer(SQLVIEW));
//ds.DataChangeDetectionPolicy = new DataChangeDetectionPolicy();
return ds;
}
The commented code is what I tried to do to initialize the DataChangeDetectionPolicy but there is no ctor exposed. Am I missing something?
Thanks in advance.

Instead of using DataChangeDetectionPolicy, you will need to use HighWaterMarkChangeDetectionPolicy which is derived from DataChangeDetectionPolicy.
So your code would be something like:
ds.DataChangeDetectionPolicy = new HighWaterMarkChangeDetectionPolicy("Timestamp");

java code for solr geolocation indexing

I am using solr for fixing my indexing and searching feature and a beginner to solr.
I actually want to index the geolocation into solr index and also want to make queries on it so went through some articles,
http://wiki.apache.org/solr/SpatialSearch
And exactly some schema type are present in my schema.xml.
Now my question is I want to write a java code to index the geolocation while indexing it for dynamic geolocation fields. So how to write it and is there any sample java code for indexing it. I looked for it but didn't found any so please if anybody can help me with it.
I also understand that when indexing we would need to write some thing like :
document.addField(myDynLocFld+"_p", val));
If using this approach what should be val an instance of location object with both lat and lng value embedded in it. So how to counter this or is there any diferent approach in solr java for this?
Thanking in advance.

Check this sample of code,
// Store the index in memory:
//Directory directory = new RAMDirectory();
// To store an index on disk
Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
For more details check Lucene APIs.

Implementing Tagging and Excluding Filters with Solrj / Spring Data Solr

I am trying to implement a Solr Facet search with multi-select on a field. To take this example: http://docs.lucidworks.com/display/solr/Faceting#Faceting-LocalParametersforFaceting, I would like to generate this call to solr:
q=mainquery&fq=status:public&fq={!tag=dt}doctype:pdf&facet=on&facet.field={!ex=dt}doctype
I am not sure how to call this using solrj (java) as I don't want to add a simple filed but I need to include the tagging and excluding (!tag=dt and !ex=dt). Any ideas what the java code should look like?
I am using Spring-Data-Solr which seems to be too basic to make such an advance call. So I think I need to go a level lower and use solrj. Either solution would be great (solrj or spring-data-solr)

The following block should create the querystring you want.
SimpleFacetQuery query = new SimpleFacetQuery(new SimpleStringCriteria("mainquery"))
.addFilterQuery(new SimpleQuery(new Criteria("status").is("public")))
.addFilterQuery(new SimpleQuery(new Criteria("{!tag=dt}doctype").is("pdf")));
query.setFacetOptions(new FacetOptions("{!ex=dt}doctype"));
solrTemplate.queryForFacetPage(query, YourBean.class);

With link I can do this :-
ModifiableSolrParams params = new ModifiableSolrParams();
params.add("fq", "{!tag=dt}doctype:pdf");
params.add("facet.field", "{!ex=dt}doctype");

Using SolrQuery (which is a subclass of ModifiableSolrParams from the other answer):
solrQuery.addFilterQyery("{!tag=dt}doctype:pdf");
solrQuery.addFacetField("{!ex=dt}doctype");

How to use the composite filter of app-engine data store queries?

I have the this function:
#Override
public List<Expense> getExpensesBetween(Date firstDate, Date secondDate) {
Query query = new Query(expenseEntityKind);
query.addFilter("date", Query.FilterOperator.GREATER_THAN_OR_EQUAL, secondDate);
query.addFilter("date", Query.FilterOperator.GREATER_THAN_OR_EQUAL, firstDate);
Iterable<Entity> entities = service.prepare(query).asIterable();
return getReturnedExpenses(entities);
}
I want to return all the expenses between two date i.e greater than or equal 2012-05-01 AND less than or equal 2012-06-01. I took a look at the documentations of Google app-engine. It says that we must use Composite Filters. Google documentations: " However, if you want to set more than one filter on a query, you must use CompositeFilter. You must have at least two filters to use CompositeFilter. However, this documentations seems to be old and i didn't find any function called setFilter();. Any suggestion how to create a composite filter ? I use App-engine sdk 1.6.6. Thanks in advance.

The following is taken from https://cloud.google.com/appengine/docs/java/datastore/queries which might be helpful here
Filter heightMinFilter =
new FilterPredicate("height",
FilterOperator.GREATER_THAN_OR_EQUAL,
minHeight);
Filter heightMaxFilter =
new FilterPredicate("height",
FilterOperator.LESS_THAN_OR_EQUAL,
maxHeight);
//Use CompositeFilter to combine multiple filters
Filter heightRangeFilter =
CompositeFilterOperator.and(heightMinFilter, heightMaxFilter);

The documentation you are looking is for the newest version (1.7.0). That features have been just introduced. Here an example of how to use CompositeFilter (1.7.0 also).

How to export Rich Text fields as HTML from Notes with LotusScript?

I'm working on a data migration task, where I have to export a somewhat large Lotus Notes application into a blogging platform. My first task was to export the articles from Lotus Notes into CSV files.
I created a Agent in LotusScript to export the data into CSV files. I use a modified version of this IBM DeveloperWorks forum post. And it basically does the job. But the contents of the Rich Text field is stripped of any formatting. And this is not what I want, I want the Rich Text field rendered as HTML.
The documentation for the GetItemValue method explicitly states that the text is rendered into plain text. So I began to research for something that would retrieve the HTML. I found the NotesMIMEEntity class and some sample code in the IBM article How To Access HTML in a Rich Text Field Using LotusScript.
But for the technique described in the above article to work, the Rich Text field need to have the property "Store Contents as HTML and MIME". And this is not the case with my Lotus Notes database. I tried to set the property on the fields in question, but it didn't do the trick.
Is it possible to use the NotesMIMEEntity and set the "Store Contents as HTML and MIME" property after the content has been added, to export the field rendered as HTML?
Or what are my options for exporting the Notes database Rich Text fields as HTML?
Bonus information: I'm using IBM Lotus Domino Designer version 8.5

There is this fairly unknown command that does exactly what you want: retrieve the URL using the command OpenField.
Example that converts only the Body-field:
http://SERVER/your%5Fdatabase%5Fpath.nsf/NEW%5FVIEW/docid/Body?OpenField

Here is how I did it, using the OpenField command, see D.Bugger's post above
Function GetHtmlFromField(doc As NotesDocument, fieldname As String) As String
Dim obj
Set obj = CreateObject("Microsoft.XMLHTTP")
obj.open "GET", "http://www.mydomain.dk/database.nsf/0/" + doc.Universalid + "/" + fieldname + "?openfield&charset=utf-8", False, "", ""
obj.send("")
Dim html As String
html = Trim$(obj.responseText)
GetHtmlFromField = html
End Function

I'd suggest looking at Midas' Rich Text LSX (http://www.geniisoft.com/showcase.nsf/MidasLSX)
I haven't used the personally, but I remember them from years ago being the best option for working with Rich Text. I'd bet it saves you a lot of headaches.
As for the NotesMIMEEntity class, I don't believe there is a way to convert RichText to MIME, only MIME to RichText (or retain the MIME within the document for emailing purposes).

If you upgrade to Notes Domino 8.5.1 then you can use the new ConvertToMIME method of the NotesDocument class. See the docs. This should do what you want.
Alternativly the easiest way to get the Domino server to render the RichText will be to actually retrieve it via a url call. Set up a simple form that just has the RichText field and then use your favourite HTTP api to pull in the page. It should then be pretty straight forward to pull out the body.

Keep it simple.
Change the BODY field to Store contents as HTML and MIME
Open the doc in editmode.
Save.
Close.
You can now use the NotesMIMEEntity to get what you need from script.

You can use the NotesDXLExporter class to export the Rich Text and use an XSLT to transform the output to what you need.

I know you mentioned using LotusScript, but if you don't mind writing a small Java agent (in the Notes client), this can be done fairly easily - and there is no need to modify the existing form design.
The basic idea is to have your Java code open a particular document through a localhost http request (which is simple in Java) and to have your code capture that html output and save it back to that document. You basically allow the Domino rendering engine to do the heavy lifting.
You would want do this:
Create a form which contains only the rich-text field you want to convert, and with Content Type of HTML
Create a view with a selection formula for all of the documents you want to convert, and with a form formula which computes to the new form
Create the Java agent which just walks your view, and for each document gets its docid, opens a URL in the form http://SERVER/your_database_path.nsf/NEW_VIEW/docid?openDocument, grabs the http response and saves it.
I put up some sample code in a similar SO post here:
How to convert text and rich text fields in a document to html using lotusscript?

Works in Domino 10 (have not tested with 9)
HTMLStrings$ = NotesRichTextItem .Converttohtml([options] ) As String
See documentation :
https://help.hcltechsw.com/dom_designer/10.0.1/basic/H_CONVERTOHTML_METHOD_NOTESRICHTEXTITEM.html
UPDATE (2022)
HCL no longer support this method since version 11. The documentation does not include any info about the method.
I have made some tests and it still works in v12 but HCL recommended to not use it.

Casper's recommendation above works well, but make sure the ACL is such to allow Anonymous Access otherwise your HTML will be the HTML from your login form

If you do not need to get the Richtext from the items specifically, you can use ?OpenDocument, which is documented (at least) here: https://www.ibm.com/developerworks/lotus/library/ls-Domino_URL_cheat_sheet/
https://www.ibm.com/support/knowledgecenter/SSVRGU_9.0.1/com.ibm.designer.domino.main.doc/H_ABOUT_URL_COMMANDS_FOR_OPENING_DOCUMENTS_BY_KEY.html
OpenDocument also allows you to expand sections (I am unsure if OpenField does)
Syntax is:
http://Host/Database/View/DocumentUniversalID?OpenDocument
But be sure to include the charset parameter as well - Japanese documents were unreadable without specifying utf-8 as the charset.
Here is the method I use that takes a NotesDocument and returns the HTML for the doc as a string.
private string ConvertDocumentToHml(Domino.NotesDocument doc, string sectionList = null)
{
var server = doc.ParentDatabase.Server.Split('/')[0];
var dbPath = doc.ParentDatabase.FilePath;
string viewName = "0";
string documentId = doc.UniversalID.ToUpper();
var ub = new UriBuilder();
ub.Host = server;
ub.Path = dbPath.Replace("\\", "/") + "/" + viewName + "/" + documentId;
if (string.IsNullOrEmpty(sectionList))
{
ub.Query = "OpenDocument&charset=utf-8";
}
else
{
ub.Query = "OpenDocument&charset=utf-8&ExpandSection=" + sectionList;
}
var url = ub.ToString();
var req = HttpWebRequest.CreateHttp(url);
try
{
var resp = req.GetResponse();
string respText = null;
using (var sr = new StreamReader(resp.GetResponseStream()))
{
respText = sr.ReadToEnd();
}
return respText;
}
catch (WebException ex)
{
return "";
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Indexing PDF documents with addtional search fields using SolrNet? - solr

This is sufficient: http://wiki.apache.org/solr/ExtractingRequestHandler#Literals . In general use a literal.field=value while uploading.

It turned out not to be an issue with SOLRNet, but my knowledge of SOLR, in general. I needed to specify the fields in my schema. After i added the fields to my schema they were visible in my SOLR query.

Related

How can we initialize DataChangeDetectionPolicy using .netsdk?

java code for solr geolocation indexing

Implementing Tagging and Excluding Filters with Solrj / Spring Data Solr

How to use the composite filter of app-engine data store queries?

How to export Rich Text fields as HTML from Notes with LotusScript?

Categories

Resources