uploaded data is not visible in solr - solr

I am using solr4.7. I have created a new core by copying "collection1(default example provided by solr)" to different name say "wiki" and updated core.properties with new name. Hence new core is visible at solr admin panel.
After starting solr, I am trying to import the data to new core like below.
$ java -jar post.jar ../../../enwiki-20150602-pages-articles1.xml -Durl='http://localhost:8983/solr/#/wiki/update'
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
POSTing file enwiki-20150602-pages-articles1.xml
SimplePostTool: WARNING: No files or directories matching -Durl=http:/localhost:8983/solr/#/wiki/update
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..
Time spent: 0:00:03.671
I also tried
$ java -jar post.jar ../../../enwiki-20150602-pages-articles1.xml
But still while querying at solr admin panel I am not getting any data.So my question is if data has been indexed then why I can't see it. Where exactly I am doing wrong.

Not sure this has been resolved.
I had the exact same problem. You need to specify the path/location of where your file to be ingested is located.
C:\test-solr>java -Durl=http://localhost:8983/solr/testdemo/update -Dtype=text/csv -jar C:/test-solr/exampledocs/post.jar "C:\test-solr\exampledocs\ingestMeFile.csv"

Related

How to specify file types when indexing Solr

I've been indexing a directory of folders/files containing html pages, docs, ppts, pdfs..etc. I noticed a type of file called LOG that is being indexed and I don't want it to be indexed because the contents aren't needed.
To index to Solr i've been using this command (i am a windows user so i use the simple post tool): java -Dc=collection -Dport=4983 -Drecursive -Dauto jar example/exampledocs/post.jar c:/folder Instead, I tried to do the following command to exclude LOG files:
java -Dc=collection -Dport=4983 -Drecursive -Dfiletypes=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt jar example/exampledocs/post.jar c:/folder
Solr refuses to index, and throws errors (#400 http). -Dfiletypes should be an actual command i can use, but Solr doesn't seem to like it. I even tried [] around the list of file types and it won't work. Is my syntax wrong?
If I add -Dauto, it works!
java -Dc=collection -Dport=4983 -Drecursive -Dauto -Dfiletypes=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt jar example/exampledocs/post.jar c:/folder

SimplePosttool: FATAL: specifying either url or core/collection is mandatory

I'm new in solr and I want to start an example that is in the exampledocs folder , but when I try to start it using the windows prompt I have the error message in the title.
Someone can help me?
When running the Solr Quick Start on Windows I faced the same problem. In the comments of the guide I found what worked for me:
C:\opt\solr-5.2.1>java -Dc=gettingstarted -jar example\exampledocs\post.jar example\exampledocs\*.xml
That got me the following output:
C:\opt\solr-5.2.1>java -Dc=gettingstarted -jar example\exampledocs\post.jar example\exampledocs\*.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update using content-type application/xml...
POSTing file gb18030-example.xml to [base]
POSTing file hd.xml to [base]
POSTing file ipod_other.xml to [base]
POSTing file ipod_video.xml to [base]
POSTing file manufacturers.xml to [base]
POSTing file mem.xml to [base]
POSTing file money.xml to [base]
POSTing file monitor.xml to [base]
POSTing file monitor2.xml to [base]
POSTing file mp500.xml to [base]
POSTing file sd500.xml to [base]
POSTing file solr.xml to [base]
POSTing file utf8-example.xml to [base]
POSTing file vidcard.xml to [base]
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:22.043
The -Dc parameter does the trick.
On the other hand when only running with the folder docs it did not work. But this is not scope of my answer since I only wanted to show how to get it running anyhow.
Start your Solr Admin
Create one core using following command:
solr create -c <name>
Select core from core_selector dropdown
click on documents and then add data to be added in the core.Format of document can be specify by you.
OR
Type this command in solr-(versioname)\example\exampledocs
java -Dc=my_core -Dtype=text/csv -jar post.jar test.csv
where my_core is corename,
type is csv and
test.csv is file to be imported.
With version 5, the default collection has effectively gone away. So, there is no way for the tool to know which URL to use to connect to your collection.
If you are using examples, then your server is most probably default at localhost:8983 and you only need to specify the collection by name. If you are doing something more tricky, you may need to specify the whole URL.
Need to mention collection name, for new versions you can write simply this command.
bin/post -c gettingstarted example/exampledocs/*.xml
Executing a command line from tutorial below unix command line helped. I.e. on Windows and solr 5.4.1 my terminal line looked like
c:\solr-5.4.1> java -classpath dist/solr-core-5.4.1.jar -Dauto=yes
-Dc=gettingstarted -Ddata=files
-Drecursive=yes org.apache.solr.util.SimplePostTool docs/

Need help understanding Solr

I'm just getting started with Nutch and Solr. I ran the crawl once with just one seed URL.
I ran this command:
bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ -depth 3 -topN 5
Everything goes fine and I'm assuming Solr indexes the pages? So how do I go about searching now? I went here localhost:8983/solr/admin/ but when I put a search query and click search I get this:
HTTP ERROR 400
Problem accessing /solr/select/.
Reason: undefined field text
I also tried an example from the tutorial but when I run this command:
java -jar post.jar solr.xml monitor.xml
I get this:
SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file solr.xml
SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=SOLR1000] unknown field 'name'
My ultimate goal is to somehow add this data into Accumulo and use it for a search engine.
I'm assuming you are using Nutch 1.4 or up. If that is the case, you need to change the type of the fields you added in the solr/conf/schema.xml file from "text" to "text_general", without the quotes.
I am working towards a similar goal right now and have used that fix to at least get solr working properly, although I still cannot get solr to search the indexed sites. Hope this helps, let me know if you get it working.

#500 Internal Server Error when trying to add PDF to Solr index with extraction

I am a first-time Solr user, using v3.5 with Tomcat 7 on a Windows 7 system. I went through the XML example in example-docs with no problems. However, I'm going to need to use extraction with HTML and PDF files, and when I try to Post a PDF file for indexing I'm getting the following:
SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8080/solr/update/extract?literal.id=doc2..
SimplePostTool: POSTing file test.pdf
SimplePostTool: FATAL: Solr returned an error #500 Internal Server Error
The command I used is:
java -Durl=http://localhost:8080/solr/update/extract?literal.id=doc2 -Dtype=application/pdf -jar post.jar test.pdf
My solr home directory is C:\solr, where I have done the following so far:
Copied the contents of the solr download package's example/solr folder
Copied the solr download package's contrib/extraction/lib folder to C:\solr\lib
Copied the solr download package's dist/apache-solr-cell-3.5.0.jar to C:\solr\dist\apache-solr-cell-3.5.0.jar
Modified the appropriate "lib" tags in C:\solr\conf\solrconfig.xml to <lib dir="lib" /> and <lib dir="dist/" regex="apache-solr-cell-\d.*\.jar" />
What else do I need to do to make this work for PDF and HTML files? I've read multiple tutorials and "Getting Started" guides but can't seem to understand what's wrong. I'm also a Tomcat beginner and as far as I can tell, none of this is showing up in Tomcat's logs ... so I'm pretty much stuck. Again, I'm not having any problem with the XML example, so Tomcat itself is running fine and recognizes solr (I can see the solr admin page). Any help is appreciated.

parsering (using Tika) on remote glassfish

I'm using Tika parser to index my files into Solr. I created my own parser (which extends XMLParser). It uses my own mimetype.
I created a jar file which inside looks like this:
src
|-main
|-some_packages
|-MyParser.java
|resources
|-META-INF
|-services
|-org.apache.tika.parser.Parser (which contains a line:some_packages.MyParser.java)
|_org
|-apache
|-tika
|-mime
|-custom-mimetypes.xml
In custom-mimetypes I put the definition of new mimetype becouse my xml files have some special tags.
Now where is the problem: I've been testing parsing and indexing with Solr on glassfish installed on my local machine. It worked just fine. Then I wanted to install it on some remote server. There is the same version of glassfish installed (3.1.1). I copied-pasted Solr application, it's home directory with all libraries (including tika jars and the jar with my custom parser). Unfortunately it doesn't work. After posting files to Solr I can see in content-type field that it detected my custom mime type. But there are no fields that suppose to be there like if MyParser class was never runned. The only fields I get are the ones from Dublin Core. I checked (by simply adding some printlines) that Tika is only using XMLParser.
Have anyone had similar problem? How to handle this?
Problem was that I was using Java 7 to compile my parser but Apache Tika was compiled with Java 5...

Resources