All:
I wonder what is a best way to upload a folder of pdf files into solr for indexing?
Right now, what I am doing is generate a files list, and for each file I initiate a request to solr for indexing, but it seems waste a lot of overload, so I wondering if I can use one request to upload all those files?
Thanks
If you are worried about the performance, your best bet is to run Apache Tika on your client side and just send the final extracted content document to Solr. That's the most efficient way and then you could batch multiple extractions together.
Solr extract code just runs Tika under the covers.
Related
I want to upload lots of source files (say, java) to solr to allow indexed search on them.
They should be posted as plain text files.
No special parsing is required.
When trying to upload one java file I get "Unknown Source" related error.
java.lang.NoClassDefFoundError: com/uwyn/jhighlight/renderer/XhtmlRendererFactory
When I rename the file adding .txt in the end, it is uploaded successfully.
I have thousands of files to upload on a daily basis and need to keep original names.
How do I tell solr to treat all files in the directory as .txt?
Advanced thanks!
For googlers, concerning the Solr error:
java.lang.NoClassDefFoundError: com/uwyn/jhighlight/renderer/XhtmlRendererFactory
You can correct this by adding the jar "jhighlight-1.0.jar" in Solr. To do so:
Download the old solr 4.9. In recent version, jhighlight is not present.
Extract solr-4.9.0\contrib\extraction\lib\jhighlight-1.0.jar
Copy jhighlight-1.0.jar to the solr installation under solr/server/lib/ext/
Restart the server.
You can achieve the same by integrating solr with tika.
Apache will help you to extract the text of the source files.
It has a source code parser which supports c,c++ and Java.
Here is the link which will give you more details.
https://googleweblight.com/?lite_url=https://tika.apache.org/1.12/formats.html&lc=en-IN&s=1&m=972&host=www.google.co.in&ts=1461564865&sig=APY536wBFFAcFH7yUyvhh2TFslPz6LeClA
I'm using Nutch (2.2.1) to crawl and index a set of web pages. These pages contain many .zip files, and each .zip contains many documents. I'll be searching the crawled data with Solr (4.7), and, within Solr, I'd like each document (within each zip) to have its own record.
Can anyone suggest a good way to set this up?
Is it possible to decompress .zip files in Nutch, and to get Nutch send multiple records to Solr, one for each file inside the .zip? If so, how? Would I need to write a plugin, or can this be done through configuration options alone?
On the other hand, would it make more sense to expand and index the zip files outside of Nutch, using a separate app?
Any advice would be much appreciated.
Thanks!
I want to sending multiple files to solr using curl.How i can do it ?
I can done with only one file with command for example:
curl
"http://localhost:8983/solr/update/extract?literal.id=paas2&commit=true"
-F "file=#cloud.pdf"
Anyone can help me,
Tks
The api does not support passing multiple files for extraction.
Usually the last file will be the only one thats gets uploaded and added.
You can have individual files indexed as separate entities in Solr.
OR One way to upload multiple files is to zip these files and upload the zip file.
There is one issue with Solr indexing zip files and you can try the SOLR-2332 Patch
i using apache solr 4.0 Beta which have capability to upload multiple file and generate id for each file uploaded using post.jar and It's very helpfull for me.
Let'see on :
http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
Thanks all :)
my problem have solved :)
I integrated Tika with Solr following the instructions provided in this link
Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.
Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?
There are basically two ways to index binary documents within Solr, both with Tika:
Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.
In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it'd normally do with text documents. Nutch already uses Tika, I guess it's just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.
This should theoretically work, let me know if it doesn't.
I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.