I want to upload lots of source files (say, java) to solr to allow indexed search on them.
They should be posted as plain text files.
No special parsing is required.
When trying to upload one java file I get "Unknown Source" related error.
java.lang.NoClassDefFoundError: com/uwyn/jhighlight/renderer/XhtmlRendererFactory
When I rename the file adding .txt in the end, it is uploaded successfully.
I have thousands of files to upload on a daily basis and need to keep original names.
How do I tell solr to treat all files in the directory as .txt?
Advanced thanks!
For googlers, concerning the Solr error:
java.lang.NoClassDefFoundError: com/uwyn/jhighlight/renderer/XhtmlRendererFactory
You can correct this by adding the jar "jhighlight-1.0.jar" in Solr. To do so:
Download the old solr 4.9. In recent version, jhighlight is not present.
Extract solr-4.9.0\contrib\extraction\lib\jhighlight-1.0.jar
Copy jhighlight-1.0.jar to the solr installation under solr/server/lib/ext/
Restart the server.
You can achieve the same by integrating solr with tika.
Apache will help you to extract the text of the source files.
It has a source code parser which supports c,c++ and Java.
Here is the link which will give you more details.
https://googleweblight.com/?lite_url=https://tika.apache.org/1.12/formats.html&lc=en-IN&s=1&m=972&host=www.google.co.in&ts=1461564865&sig=APY536wBFFAcFH7yUyvhh2TFslPz6LeClA
Related
I have some information in a text file. I want to index it on solr. What should be the procedure. Any tool that can be used for indexing in solr ? Please guide me in details as I am not familiar with solr too mutch?
I'd refer you to Solr DataImportHandler Page, it has a comprehensive tutorial on how to import data from various source. Importing text files is under FileDataSource
One way would be to convert the plain text into CSV file. You can then use the CSV file uploading process to index data in Solr. Check the documentation here for more configurations
Here
I wanted to index text files. After searching a lot I got to know about Apache tika. Now in some sites where I studied Apache tika, I got to know that Apache tika converts the text it into XML format and then sends it to solr. But while converting it creates only one tag example
.......
Now the text file I wish to index is a tomcat local host access file. This file is in GB's. I cannot store it and a single index. I want each line to have line-id
.......
So that i can easily retrieve the matching line.
Can this be done in Apache Tika?
Solr with Tika supports extraction of data from multiple file formats.
The complete list of supported file formats can be found # link
You can provide as an input any of the above file formats and Tika would be able to autodetect the file format and extract text from the files and provide it to Solr for indexing.
Edit :-
Tika does not convert the text file to XML before sneding it to Solr.
Tika would just extract the metadata and the content of the file and populate fields in Solr as per the mapping defined.
You either have to feed the entire file as input to solr, which would be indexed as a single document OR you have to read the file line by line and provide it to Solr as a seperate document.
Solr and Tika would not handle this for you.
You may want to look at DataImportHandler to parse the file into lines or entries. It is a better match than running Tika on something that already has internal structure.
Im exploring Solr4 and Polygons/linestrings.
There is some info on it here but not a howto/installation guide for a basic user like me.
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4
As far as I understand, you need to install the spatial4j code into solr. (Im a hack at best).
https://github.com/spatial4j/spatial4j/tree/master/src/main/java
Does one know where I upload this code to, inside the solr4 installation? Keep in mind im using the /example/solr/collection1 directory.
"Due to a combination of things, JTS can't simply be referenced by a "" entry in solrconfig.xml; it needs to be in WEB-INF/lib in Solr's war file, basicall" Does anyone know what that means in terms of an installation instruction? Im after some guidance of what goes where. I use start.jar to start solr on my apache server.
After that I understand that I simply need to add a field type and field () to the schema and as far as that goes it should be installed.
Im trying to send it polygon and linestring queries to find all documents within a polygon or within a radius of a line.
Solr includes Spatial4j already; what it doesn't have is JTS, which is a java library (.jar file). Download JTS from https://sourceforge.net/projects/jts-topo-suite/ (the .jar is within the .zip distro). WEB-INF/lib is a java webapp reference within a WAR file. example/webapps/solr.war is where that is. A .war file is really a zip, and can either be in it's '.war' file form or be uncompressed in a plain directory layout. So if you rename the '.war' to '.zip' in OSX it's trivial to double-click it in order to expand it. But then rename the resulting directory to 'solr.war', and put aside the original war file to some other place as you won't be using it for now. Take the JTS jar and put it in solr.war/WEB-INF/lib/. When you start Solr, it'll have access to JTS. If it doesn't have access due to whatever reason, you'll get a ClassNotFoundException pertaining to a JTS related Java class.
I want to sending multiple files to solr using curl.How i can do it ?
I can done with only one file with command for example:
curl
"http://localhost:8983/solr/update/extract?literal.id=paas2&commit=true"
-F "file=#cloud.pdf"
Anyone can help me,
Tks
The api does not support passing multiple files for extraction.
Usually the last file will be the only one thats gets uploaded and added.
You can have individual files indexed as separate entities in Solr.
OR One way to upload multiple files is to zip these files and upload the zip file.
There is one issue with Solr indexing zip files and you can try the SOLR-2332 Patch
i using apache solr 4.0 Beta which have capability to upload multiple file and generate id for each file uploaded using post.jar and It's very helpfull for me.
Let'see on :
http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
Thanks all :)
my problem have solved :)
I integrated Tika with Solr following the instructions provided in this link
Correct me if I am wrong, it seems to me that it can index the document files(pdf,doc,audio) located on my own system (given the path of directory in which those files are stored), but cannot index those files, located on internet, when I crawl some sites using nutch.
Can I index the documents files(pdf,audio,doc,zip) located on the web using Tika?
There are basically two ways to index binary documents within Solr, both with Tika:
Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr
Using ExtractingRequestHandler through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.
In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it'd normally do with text documents. Nutch already uses Tika, I guess it's just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.
# skip some suffixes
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at this discussion from the nutch mailing list.
This should theoretically work, let me know if it doesn't.