Solr Error Uploading File using ContentStreamUpdateRequest - solr

I am using the following solrj code for to index the document.
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
ContentStreamBase.FileStream fs = new FileStream(new File(filename));
req.setWaitSearcher(false);
req.setMethod(METHOD.POST );
//req.addFile(new File(filename), null);
req.addContentStream(fs);
req.setParam("literal.id", filename);
req.setParam("resource.name", filename);
//req.setParam("uprefix", "attr_");
//req.setParam("fmap.content", "attr_content");
//req.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList<Object> result = this.mHttpSolrClient.request(req);
However, I am always getting an SolrException right after the request, which says
4767120 [qtp559670971-21] INFO org.apache.solr.update.processor.LogUpdateProcessor [ test] – [test] webapp=/solr path=/update/extract params={waitSearcher=false&resource.name=/tmp/RUNNING.txt&literal.id=/tmp/RUNNING.txt&wt=javabin&version=2} {} 0 6
4767121 [qtp559670971-21] ERROR org.apache.solr.core.SolrCore [ test] – org.apache.solr.common.SolrException: ERROR: [doc=/tmp/RUNNING.txt] Error adding field 'stream_size'='null' msg=For input string: "null"
I am using the latest version of solr 5.1.0. Any ideas?

I fixed the issue by setting the HttpSolrClient to enable multi part post. Ask me how, I don't know. Solr documentation just does not explain things well.
mHttpSolrClient.setUseMultiPartPost(true);

Related

Solr 8 upgrade and stream.body

I'm upgrading Solr from 6.x to 8.x. In the past, we used to build our request thusly in our PHP script:
$aPostData = array(
'stream.body' => '{"add": {"doc":{...stuff here...}}',
'commit' => 'true',
'collection' => 'mycollection',
'expandMacros' => 'false'
);
$oBody = new \http\Message\Body();
$oBody->addForm($aPostData);
sending it to our Solr server at /solr/mycollection/update/json. That worked just fine in 6.x but now that I've upgraded to 8.x, I'm receiving the following response from Solr
{
"responseHeader":{
"status":400,
"QTime":1
},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"missing content stream",
"code":400
}
}
Digging around I ran across the following
https://issues.apache.org/jira/browse/SOLR-10748
and
Solr error - Stream Body is disabled
I tried following the suggestions of both answers. For the first one, I now see a file called "configoverlay.json" in my ./conf directory and it has those settings. For the second answer, I set it up so my requestParsers node had those attributes. However, neither worked. I've searched around but at this point I'm at my wits end. How can I make it so that I can continue using "stream.body"? If I shouldn't be using "stream.body" is there some other request var that I can/should use when sending my data? I couldn't find anything in the documentation. Perhaps I was looking in the wrong place?
Any help would be greatly appreciated.
thnx,
Christoph

Error while indexing documents in solr - SolrException

I am using the following code to index documents in solr server.
String urlString = "http://localhost:8080/solr";
SolrServer solr = new CommonsHttpSolrServer(urlString);
java.io.File file=new java.io.File("C:\\Users\\Guruprasad\\Desktop\\Search\\47975832.doc");
if (file.canRead()) {
System.out.println("adding " + file);
try {
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
String parts[] = file.getName().split("\\.");
String type = "text";
if (parts.length>1) {
type = parts[1];
}
req.addFile(file);
req.setParam("literal.id", file.getAbsolutePath());
req.setParam("literal.name", file.getName());
req.setParam("literal.content_type", type);
req.setParam("uprefix", "attr_");
req.setParam("fmap.content", "attr_content");
req.setAction(ACTION.COMMIT, true, true);
solr.request(req);* //**Line no 36** here i am getting exception
While executing this code i am getting following exception.
Exception: org.apache.solr.common.SolrException
Exception message:
Internal Server Error Internal Server Error request:
http://localhost:8080/solr/update/extract?literal.id=C:\Users\Guruprasad\Desktop\Search\47975832.doc&literal.name=47975832.doc&literal.content_type=doc&uprefix=attr_&fmap.content=attr_content&commit=true&waitFlush=true&waitSearcher=true&wt=javabin&version=2
Exception trace:
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at com.solr.search.test.IndexFiles.indexDocs(IndexFiles.java:36)*
Any help will be useful
i dont't suggest you use dih to index your database data, you can use solrj to index your data , solrj is simple , if you can use jdbc , then things is simple , you can use solrj build solr document and batch data commit to solr server . there are a solrj wiki , hope it can help you solrj wiki
solr 5.0 comes with inbuilt utility DIH handler for indexing data from database which you are using but its configuration is important and tricky could you please post your configuration of DIH handler or share logs of import , it looks like configuration problem to me

fixing lucene 4.1/solr corrupted index

My Lucene index - built with Solr using Lucene4.1 - is, I think, corrupted. Upon trying to read the index using the following code I get org.apache.solr.common.SolrException: No such core: collection1 exception:
File configFile = new File(cacheFolder + File.separator + "solr.xml");
CoreContainer container = new CoreContainer(cacheFolder, configFile);
SolrServer server = new EmbeddedSolrServer(container, "collection1");
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("q", idFieldName + ":" + ClientUtils.escapeQueryChars(queryId));
params.set("fl",idFieldName+","+valueFieldName);
QueryResponse response = server.query(params)
I used "checkindex" util to check the integrity of the index and it seems not able to perform the task by throwing the following error:
Opening index # /....../solrindex_cache/zookeeper/solr/collection1/data/index
ERROR: could not read any segments file in directory
java.io.FileNotFoundException: /....../solrindex_cache/zookeeper/solr/collection1/data/index/segments_b5tb (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:223)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:285)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:347)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:630)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:343)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:383)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1777)
The file segments_b5tb that index checker is looking for is indeed missing in the index folder. The only file that looks similar is segments.gen.
Is there any way to diagnose what has gone wrong and if possible, by all means to fix it as it took me 2 weeks to build this index...
Many many thanks for your kind advice!
If the segments.gen file is the only file you see, you are likely out of luck, but otherwise, you can try using CheckIndex to check for errors, and repair the index. Since the tool fixes the index by removing problematic segments, there may cerainly be some lost data as a result.

Solrj IOException occured when talking to server

I am using basic authentication. My solr version is 4.1. I can get query results but when I try to index documents I am getting the following error message:
org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://192.168.0.1:8983/solr/my_core
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:416)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
at warningletter.Process.run(Process.java:128)
at warningletter.WarningLetter.parseListPage(WarningLetter.java:81)
at warningletter.WarningLetter.init(WarningLetter.java:47)
at warningletter.WarningLetter.main(WarningLetter.java:21)
Caused by: org.apache.http.client.ClientProtocolException
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353)
... 8 more
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity.
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:625)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
... 11 more
This is my code piece:
DefaultHttpClient httpclient = new DefaultHttpClient();
httpclient.getCredentialsProvider().setCredentials(AuthScope.ANY, new UsernamePasswordCredentials("user", "password"));
HttpSolrServer server = new HttpSolrServer("http://192.168.0.1:8983/solr/warning_letter/", httpclient);
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("id", "id1");
solrDoc.addField("letter", "letter");
server.add(solrDoc);
server.commit();
What am I doing wrong?
The trick is to use the preemptive authentication so as to avoid the need to repeat the query after the "unauthorized" response is sent.
Here is the example
Preemptive Basic authentication with Apache HttpClient 4

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Solr 4.0 and DIH (data import handler) with TikaProcessor for extracting text from PDF files stored in database. When I run indexing it gets failed to parse some PDF files and got the stack trace mentioned below.
Since Solr 4.0 uses Tika 1.2 I have written a unit test to parse the same PDF file using Tika 1.2 API, I got the same error.
The same problem with Tika 1.3 jars also. But when I tried using Tika 1.1 jars it works fine. Please let me if any of you have seen this error and how to fix this?
(I have posted the same in tika mailing list, but not much luck)
When I open the PDF file it is showing PDF/A mode. Not sure if this something related to the problem.
Here is the exception:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser#1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser#1fbfd6>
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 3 more
Here is the code snippet in JAVA:
String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
File file = new File(fileString );
URL url = file.toURI().toURL();
ParseContext context = new ParseContext();;
Detector detector = new DefaultDetector();;
Parser parser = new AutoDetectParser(detector);;
Metadata metadata = new Metadata();
context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
outputstream.close();

Resources