Key Points/Challenges while working with Apache Tika and Solr

Key Points/Challenges while working with Apache Tika and Solr - solr

Recently I got involved in a task, and part of it require to use Apache Solr ( for Document Search) ,and Apache Tika ( to Extract the meta-text or plain text from documents)
I have n't integrated Solr and tika yet ,But I have worked with both of them individually I might have set of questions related to Apache Solr and Apache Tika , It might be at beginners level or average.
Following types of practical I did with Solr e.g. created a dummy database, wrote a program, configured - schema.xml things, ran Solr sever, and program which fetches documents from database and store in Solr Document Index , Made a Simple client to fetch data from Solr via JSON Interface, Made a Program which keeps MySQL Database to sync with Apache’s Solr document Index.
Following types of practical I did with tika e.g. compiled and Installed Tika, understood its document parsing capablities.
..
My Sample Task statement:
Part of my project require to store around 100,000 of documents (Data of these 100,000 (Doc,PDF,Txt) docs are fetched by Apache tika and pushed to MySql’s Database and later that pushed to apache Solr’s Document Database)for Full Text Search and search them those via a client interface (Browser)
In simple programmatical level this task will get done,
I would like to understand the challenges related to managing the index or something else in Solr e.g.
** In advanced level does it require optimizing the Solr’s Open Source Code?
** While Solr works in proper way, does it provide any specific challenges?
** What Key things need to consider initially so that, Solr should work in a proper way.
** Do you think any extra tool to developed to monitor Solr’s working ?
Hope you got the idea related to questions I have ?
** Also I would like to know If you have any experience of using apache Tika with apache Solr, and any challenges or key things to consider ?
Would you like to recommend and specific sources Or If you have any document or anything which you feel to be helpful.

Related

Apache Solr XPathEntityProcessor DIH - Index Update

I am relatively new to Apache SOlr and have recently been working with DIH, specifically the XPathEntityProcessor. I need a way to periodically index new XML files, however, it appears the delta-import command is only supported by the sqlEntityProcessor [1].
I am working with an increasingly large dataset of XML files and was hoping solr could determine new files and index them...
A potential solution that came to mind is to possibly do a full-import from a staging area consisting of documents that have not been previously index, before moving the documents to their respective permanent locations.
Is there a workaround to mimicking delte-import using XPathEntityProcessor?
What sort of approaches do people using XPathEntityProcessor use to index newer documents?
[1] http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command-1

I've resorted to using the UpdateRequestHandler; it's perfect for what I want to do.
[1] http://wiki.apache.org/solr/XsltUpdateRequestHandler

How to do indexing data from database using apache solr with glassfish server on linux?

I want to create a search box in my web app using Apache Lucene and Apache Solr.I am using postgres database and have to do it with java.
As I new to these concepts (solr,lucene), I am struggling with this. I already installed and configured apache Solr with glassfish.Now I dont know how to start with this, Whether I have to cretae a java project in eclipse or I have to use Solr admin GUI.
can any one help me on this?
Thanks in Advance.....

In order to make data searchable, you have to first index your data. You can use one of the following ways to index data.
By using Solr clients such as Solrj
If you store your data in relational DB then you can use DataImportHandler
By posting XML or Json messages. Check here for documentation.
When new data added you can index it using Solr clients (Solrj). You can also search your data using Solrj or any other client libraries.
You can find other client libraries here.

You can start with Solr DIH to index the data from postgres to Solr.
For more detailed understanding you can refer to :-
how-to-import-data-from-sql-databases-part-1
how-to-import-data-from-sql-databases-part-2
how-to-import-data-from-sql-databases-part-3

Index my own data in Solr

I am new to Solr and have a couple of questions to ask help from more experienced people:
I am able to get example running, however what is exactly the start.jar?
I know by running "java -jar start.jar", i can start solr. But do i run this command after i index my own data, not the given sample data? if not, what should i do to run my own solr instance with my own indexed data?
I do need to index my own sample data, not related to the given example solr thing at all. How exactly should i do it? Should i copy the example directory then modify the fields in sechema.xml? should i then run the post.sh accordingly to index the data like what i did to set up the example solr?
Thanks a lot for your help!

Steps:
Decide what will be the document structure u store in SOLR. (Somewhat like creating the schema of a relational DB for one table).
remove the example core and create your own core with that schema
once the schema works with no errors (you check the server logs that hosts the SOLR app) You can start feed the data you have into SOLR. You POST it via HTTP in a specific structure which is documented in the SOLR Wiki. Various frameworks have some classes to handle that.
Marked as Wiki as this is too broad an answer for someone who did not bother to RTFM...

Dear custom indexing is not a difficult task as I have worked on it just a few days ago. First you need to write your documnet is xml,csv or json( format supported in solr) containing fields according to your schema.xml, then run following command in example/exampledocs
For a document mydoc.xml
./post.sh mydoc.xml
if in output, status value is 0 then indexing is successful and you can search your document in solr
Reference:http://www.solrtutorial.com/solr-in-5-minutes.html

Though the question is old, but I am writing for new visitors with same issue. The question can't be answered in few words. You must understand what Solr is, whats Solr Admin UI, why we need Solr instead a relational database. Then you can understand how to import sample data. I have recently published two articles i.e. Solr Introduction and Importing Sample Data, these might be helpful for you.
http://www.devtrainings.com/2017/03/apache-solr-introduction-and-server.html
http://www.devtrainings.com/2017/03/apache-solr-index-data-and-run-search.html

updating Solr from Lucene Index

I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?

You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.

Using Solr to read OpenGrok's database and failing with "no segments* file found"

I need a simple way to read OpenGrok's DB from a php script to do some weird searches (as doing that in Java in OpenGrok itself isn't in my abilities). So I decided to use Solr as a way to query the Lucene DB directly from another language (probably PHP or C).
The problem is that when I point Solr to /var/opengrok/data, it bombs out with:
java.lang.RuntimeException: org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.MMapDirectory#/var/opengrok/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#3a329572: files: [] at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1103)
(etc, etc, the backtrace is about three screens long)
I tried to point it somewhere inside data with no luck. The structure looks like this:
/var/opengrok/data/index/$projname/segment*
/var/opengrok/data/spelling...
and seems like whatever Solr is using is expecting the segment files directly in the index directory.
I checked to see if there's any version discrepancy, but OpenGrok 0.11 is using Lucene 3.0.2 and I've set Solr to LUCENE_30 as the database version.
Any pointers will be greatly appreciated, google didn't seem to be able to help with this.

opengroks web interface can consume any well formed search query (through url) and reply with xhtml results which are easily parse-able, so you're probably making it too complex to hack inside the lucene rather than using UI provided ...