SOLR - Tika - Store binary version of file - solr

I am using Tika integrated in SOLR to index documents and allow search on said documents. This works pretty smoothly (right now my setup is exactly the same as the example as the example that ships with SOLR) and I can indeed index and search documents. As well as indexing the document I would like to store the binary version in SOLR so that when a search returns a result I can return a full PDF/Word/etc. document for download. Is this possible?

Nope.
Solr is full Text search engine and does not provide any out of the box implementation for storing the binary files.
Instead, you can easily host the binary files outside and have them rendered through http linked through id.

Related

Can we compare Lucene and Solr index files

We are upgrading Sitecore 8 to 9.3 for that we upgraded Lucene to solr
Can we compare Lucene and Solr index files so that we will be able to know the newly generated solr index files have the same data or not
It seem technically possible as you could use Luke to explore the content of the Lucene index folder.
While Solr data can be queried via either Sitecore UI, or Solr admin.
No. The indexes are very different even though the underlying technology is similar. What I find best is to have an old and new version of the same site with the same data. Then you can compare site search pages and any part of the site that runs on search.

tika installation

I integrated Tika with Solr following the instructions provided in this link
Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.
Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?
There are basically two ways to index binary documents within Solr, both with Tika:
Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.
In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it'd normally do with text documents. Nutch already uses Tika, I guess it's just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.
This should theoretically work, let me know if it doesn't.

Interface for Solr

I have a large number of documents (mainly PDFs) that I want to index and query on.
I want to store all these docs in a filesystem structure by year.
I currently have this setup in Solr. But i have to run scripts to extract meta from the PDFs, then update the index.
Is there a product out there that basically lets me pop a new PDF into a folder and its auto indexed by Solr.
I have seen Alfresco does this, but its got some drawbacks - is there anything else along these lines.
Or would I use nutch to crawl my filesystem and post updates to Solr? Im not sure about how I should do this?
Solr is a search server not a crawler. As you noted, Nutch can do this (I have used it for a similar usecase, indexing a knowledgebase dump).
Essentially, you would host a webserver with the root of the folder structure as Document root. Then allow directory listing at this webserver. Nutch could then crawl the top level url of this document dump.
Once you have this Nutch created index, you can then expose it through solr as well.

Path of Solr Document

I would like to know where the indexed document is saved in solr search.
I have installed solr server at C:\solr and using solr 1.4. By making
necessary changes in the configuration files i am able to search data
using solr client.
Just wondering where that indexed document is saved.
Indexed documents are saved in index, which is located in solr/data/index folder.
Here you can find more details about those files.
From LuceneFAQ:
The index database is composed of 'segments' each stored in a separate
file. When you add documents to the index, new segments may be
created. These are periodically merged together.
EDIT:
If you want to examine contents of your index and tweak or troubleshoot your schema (analysis), see instructions about the greatest Lucene tool ever, called Luke in this recent post.

Key Points/Challenges while working with Apache Tika and Solr

Recently I got involved in a task, and part of it require to use Apache Solr ( for Document Search) ,and Apache Tika ( to Extract the meta-text or plain text from documents)
I have n't integrated Solr and tika yet ,But I have worked with both of them individually I might have set of questions related to Apache Solr and Apache Tika , It might be at beginners level or average.
Following types of practical I did with Solr e.g. created a dummy database, wrote a program, configured - schema.xml things, ran Solr sever, and program which fetches documents from database and store in Solr Document Index , Made a Simple client to fetch data from Solr via JSON Interface, Made a Program which keeps MySQL Database to sync with Apache’s Solr document Index.
Following types of practical I did with tika e.g. compiled and Installed Tika, understood its document parsing capablities.
..
My Sample Task statement:
Part of my project require to store around 100,000 of documents (Data of these 100,000 (Doc,PDF,Txt) docs are fetched by Apache tika and pushed to MySql’s Database and later that pushed to apache Solr’s Document Database)for Full Text Search and search them those via a client interface (Browser)
In simple programmatical level this task will get done,
I would like to understand the challenges related to managing the index or something else in Solr e.g.
** In advanced level does it require optimizing the Solr’s Open Source Code?
** While Solr works in proper way, does it provide any specific challenges?
** What Key things need to consider initially so that, Solr should work in a proper way.
** Do you think any extra tool to developed to monitor Solr’s working ?
Hope you got the idea related to questions I have ?
** Also I would like to know If you have any experience of using apache Tika with apache Solr, and any challenges or key things to consider ?
Would you like to recommend and specific sources Or If you have any document or anything which you feel to be helpful.

Resources