I integrated Tika with Solr following the instructions provided in this link
Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.
Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?
There are basically two ways to index binary documents within Solr, both with Tika:
Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.
In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it'd normally do with text documents. Nutch already uses Tika, I guess it's just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.
This should theoretically work, let me know if it doesn't.
Related
I'm using Nutch (2.2.1) to crawl and index a set of web pages. These pages contain many .zip files, and each .zip contains many documents. I'll be searching the crawled data with Solr (4.7), and, within Solr, I'd like each document (within each zip) to have its own record.
Can anyone suggest a good way to set this up?
Is it possible to decompress .zip files in Nutch, and to get Nutch send multiple records to Solr, one for each file inside the .zip? If so, how? Would I need to write a plugin, or can this be done through configuration options alone?
On the other hand, would it make more sense to expand and index the zip files outside of Nutch, using a separate app?
Any advice would be much appreciated.
Thanks!
I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example
.......
Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id
.......
So that i can easily retrieve the matching line.
Can this be done in Apache Tika?
Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found # link
You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.
Edit :-
Tika does not convert the text file to XML before sneding it to Solr.
Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.
You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.
You may want to look at DataImportHandler to parse the file into lines or entries. It is a better match than running Tika on something that already has internal structure.
I have a large number of documents (mainly PDFs) that I want to index and query on.
I want to store all these docs in a filesystem structure by year.
I currently have this setup in Solr. But i have to run scripts to extract meta from the PDFs, then update the index.
Is there a product out there that basically lets me pop a new PDF into a folder and its auto indexed by Solr.
I have seen Alfresco does this, but its got some drawbacks - is there anything else along these lines.
Or would I use nutch to crawl my filesystem and post updates to Solr? Im not sure about how I should do this?
Solr is a search server not a crawler. As you noted, Nutch can do this (I have used it for a similar usecase, indexing a knowledgebase dump).
Essentially, you would host a webserver with the root of the folder structure as Document root. Then allow directory listing at this webserver. Nutch could then crawl the top level url of this document dump.
Once you have this Nutch created index, you can then expose it through solr as well.
I am looking for a way to configure Nutch to crawl the web, but only index certain types of files (XML to be specific) into Solr. I'm pretty sure a custom plugin would do the job, probably based on the index-more code, but I'd rather not do that unless I have to. I'm also sure I could suck everything into Solr then delete unwanted content with Solr's API, but this is a bit hacky. Is there a way to configure Nutch to only index certain filetypes in Solr?
In nutch you can define filters for urls. What about filtering by the name of the fileextension?
You can filter the file type according to the extension.
You can specify the extensions you want to include or exclude in regex-urlfilter.txt
e.g. for exclusion (-) :-
#skip image and other suffixes we can't yet parse 29 # for a more extensive coverage use the urlfilter-suffix plugin
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
with + you can just specify the inclusion list.
I am using Tika integrated in SOLR to index documents and allow search on said documents. This works pretty smoothly (right now my setup is exactly the same as the example as the example that ships with SOLR) and I can indeed index and search documents. As well as indexing the document I would like to store the binary version in SOLR so that when a search returns a result I can return a full PDF/Word/etc. document for download. Is this possible?
Nope.
Solr is full Text search engine and does not provide any out of the box implementation for storing the binary files.
Instead, you can easily host the binary files outside and have them rendered through http linked through id.