I want to index a large number of pdf documents.
I have found a reference showing that it could be done using Apache Tika but unfortunately I cannot find any reference that describes I could configure Apache Tika in Solr 1.4.1.
Once configured I do have it configured, how can I send documents to Solr directly without using curl?
I am using solrnet for indexing.
See ExtractingRequestHandler
Support for ExtractingRequestHandler in SolrNet is not yet complete. You can either finish implementing it, or work around it and craft your own HttpWebRequests.
Related
I am going to use apache solr for first time. I just want to write a basic code for apache solr in java with eclipse. Can you please suggest me that how to get familiar with apache solr.
Basically, SOLR is a full text search engine. You can use it as a data store to keep data of unstructured nature. You can use SolrJ to communicate to SOLR from Java.
my objective is to
- crawl on urls and
- extract micro data and
- save to solr
I used this guide to setup nutch, hbase, and solr
Im using nutch to crawl on urls and hbase, im using tika pluggin for nutch to parse pages, but it only gets meta data.
Did I miss something to config? please guide me or suggest alternatives
You need to implement your own ParseFilter and implement the extraction logic there. You will get a DocumentFragment generated by the Tika parser and could use e.g. XPath to get the micro data.
Note that the DOM generated by Tika are heavily normalised / modified so your Xpath expressions could possibly not match. Maybe better to rely on the old HTML parser instead.
One generic way of doing would be to use Apache Any23 as done for instance in this storm-crawler module.
BTW There is an open JIRA for a MicroDataHandler in Tika which hasn't been committed yet.
HTH
I am trying to implement solr into sitecore but could not find any way for creating a Solr instance for the same. I have few PDFs from SDN I could find any way to create Solr instance in any. Considering that I am new to CMS I hope I could get some help here. Thank you
There are lots of resources available for setting up Solr, and integrating Sitecore.
Essentially Sitecore is ignorant with respects to how you setup Solr (barring a few exceptions), so you need to follow standard methods to set Solr up. If you are doing this on your local machine, then I recommend you simply download Solr and get it running through the provided Jetty App Server.
Once Solr is running, download the Solr Extensions from SDN, then follow the search scaling guide to integrate Solr. This really only boils down to the following;
Remove Lucene config files
Add Solr config files and binaries
Add Solr endpoint into relevant config
Generate Solr Schema via Sitecore -> Control Panel -> Search (within Sitecore)
Add Schema file to Solr Core configuration
et voila
There is a great guide here: http://www.dansolovay.com/2013/05/setting-up-solr-with-sitecore-7.html
I have recently moved my entire SOLR documents into Elasticsearch after creating an exact equivalent mapping of the schema.xml . To test the accuracy, i created about 120 lucene queries and queried it on SOLR and elasticsearch.
However on testing the hitcounts for 17/120 queries differed between SOLR and elasticsearch.Could there be any reasons for this apart from the analyzers, tokenizers, filters defined in schema.xml/ elasticsearch mappings. The SOLR version is 4.3.0 whereas the elasticsearch version is 1.3.2
The elasticsearch query i used is :
{"query_string":{"query":lucene_query}}
Please let me know, if there is any alternative way to test the query accuracy between SOLR and Elasticsearch.
First, make sure that you are using the same semantics. For example, same filters, tokenizers, stemmers.
Also, Apache Solr 4.3.0 is built on Apache Lucene 4.3.0 , while ElasticSearch 1.3.2 is built on Apache Lucene 4.9.0
This might not be the issue, I don't know to be honest. But if I were you, I would check the release notes of Apache Lucene > 4.3.0 and see what is changed.
Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?
I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?
Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.
I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:
There is a guide to setting up Tika & extracting request handler here
Apply tika parser in Nutch's parsing phase.