fixing lucene 4.1/solr corrupted index - solr

My Lucene index - built with Solr using Lucene4.1 - is, I think, corrupted. Upon trying to read the index using the following code I get org.apache.solr.common.SolrException: No such core: collection1 exception:
File configFile = new File(cacheFolder + File.separator + "solr.xml");
CoreContainer container = new CoreContainer(cacheFolder, configFile);
SolrServer server = new EmbeddedSolrServer(container, "collection1");
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("q", idFieldName + ":" + ClientUtils.escapeQueryChars(queryId));
params.set("fl",idFieldName+","+valueFieldName);
QueryResponse response = server.query(params)
I used "checkindex" util to check the integrity of the index and it seems not able to perform the task by throwing the following error:
Opening index # /....../solrindex_cache/zookeeper/solr/collection1/data/index
ERROR: could not read any segments file in directory
java.io.FileNotFoundException: /....../solrindex_cache/zookeeper/solr/collection1/data/index/segments_b5tb (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:223)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:285)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:347)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:630)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:343)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:383)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1777)
The file segments_b5tb that index checker is looking for is indeed missing in the index folder. The only file that looks similar is segments.gen.
Is there any way to diagnose what has gone wrong and if possible, by all means to fix it as it took me 2 weeks to build this index...
Many many thanks for your kind advice!

If the segments.gen file is the only file you see, you are likely out of luck, but otherwise, you can try using CheckIndex to check for errors, and repair the index. Since the tool fixes the index by removing problematic segments, there may cerainly be some lost data as a result.

Related

Anybody know if OrcTableSource supports S3 file system?

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

Error while indexing documents in solr - SolrException

I am using the following code to index documents in solr server.
String urlString = "http://localhost:8080/solr";
SolrServer solr = new CommonsHttpSolrServer(urlString);
java.io.File file=new java.io.File("C:\\Users\\Guruprasad\\Desktop\\Search\\47975832.doc");
if (file.canRead()) {
System.out.println("adding " + file);
try {
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
String parts[] = file.getName().split("\\.");
String type = "text";
if (parts.length>1) {
type = parts[1];
}
req.addFile(file);
req.setParam("literal.id", file.getAbsolutePath());
req.setParam("literal.name", file.getName());
req.setParam("literal.content_type", type);
req.setParam("uprefix", "attr_");
req.setParam("fmap.content", "attr_content");
req.setAction(ACTION.COMMIT, true, true);
solr.request(req);* //**Line no 36** here i am getting exception
While executing this code i am getting following exception.
Exception: org.apache.solr.common.SolrException
Exception message:
Internal Server Error Internal Server Error request:
http://localhost:8080/solr/update/extract?literal.id=C:\Users\Guruprasad\Desktop\Search\47975832.doc&literal.name=47975832.doc&literal.content_type=doc&uprefix=attr_&fmap.content=attr_content&commit=true&waitFlush=true&waitSearcher=true&wt=javabin&version=2
Exception trace:
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at com.solr.search.test.IndexFiles.indexDocs(IndexFiles.java:36)*
Any help will be useful
i dont't suggest you use dih to index your database data, you can use solrj to index your data , solrj is simple , if you can use jdbc , then things is simple , you can use solrj build solr document and batch data commit to solr server . there are a solrj wiki , hope it can help you solrj wiki
solr 5.0 comes with inbuilt utility DIH handler for indexing data from database which you are using but its configuration is important and tricky could you please post your configuration of DIH handler or share logs of import , it looks like configuration problem to me

Getting parsing error in solr?

I am using solr 4.6 and while indexing documents i could see couple of errors/warnings in logs. I want to make sure that despite these messages, will documents be indexed into solr or will they be skipped.
Finally can i upgrade relevant jars i.e. pdfbox and tika to resolve the issue without breaking anything else?
Error.
ERROR PDCIDFont Error: Could not parse predefined CMAP file for '¢¬%?Â-ª¬/3Ó~Œ[-UCS2'
Error: Could not parse predefined CMAP file for 'PDFAUTOCAD-Indentity0-UCS2'
Also i could see below warnings.
ExtractingDocumentLoader
skip extracting text due to TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser#37e5a2db. metadata=stream_source_info=TMSD SS_FI006 - Fixed Assets Course Slides.pptx stream_content_type=application/vnd.openxmlformats-officedocument.presentationml.presentation stream_size=9780764 stream_name=TMSD SS_FI006 - Fixed Assets Course Slides.pptx Content-Type=application/vnd.openxmlformats-officedocument.presentationml.presentation resourceName=TMSD SS_FI006 - Fixed Assets Course Slides.pptx
And
ExtractingDocumentLoader
skip extracting text due to Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser#6900efc8. metadata=stream_source_info=PMO IW Enduring Employees_Secondment Tracker.xlsx stream_content_type=application/x-tika-ooxml-protected stream_size=52736 custom:_dlc_DocIdItemGuid=9523d6bd-d1cf-40b5-b5b3-ca1ce43c4eb0 stream_name=PMO IW Enduring Employees_Secondment Tracker.xlsx custom:ContentTypeId=0x010100B98D2353323F5D4F8163D5A4670906C0 Content-Type=application/x-tika-ooxml-protected resourceName=PMO IW Enduring Employees_Secondment Tracker.xlsx

java.lang.OutOfMemoryError when processing large pgp file

I want to use Camel 2.12.1 to decrypt some potentially large pgp files. The following flow results in an out of memory exception and the call stack shows that the PGPDataFormat.unmarshal() function is trying to build a ByteArray which is destined to fail if the file is large. Is there a way to pass streams around during unmarshalling?
My route:
from("file:///home/cps/camel/sftp-in?"
+ "include=.*&" // find files using this pattern
+ "move=/home/cps/camel/sftp-archive&" // after done adding records to queue, move file to archive
+ "delay=5000&"
+ "readLock=rename&" // readLock parameters prevent picking up file which is currently changing
+ "readLockCheckInterval=5000")
.choice()
.when(header(Exchange.FILE_NAME_ONLY).regex(".*pgp$|.*PGP$|.*gpg$|.*GPG$")).to("direct:decrypt")
.otherwise()
.to("file:///home/cps/camel/input");
from("direct:decrypt").unmarshal().pgp("file:///home/cps/.gnupg/secring.gpg", "developer", "set42now")
.setHeader(Exchange.FILE_NAME).groovy("request.headers.get('CamelFileNameOnly').replace('.gpg', '')")
.to("file:///home/cps/camel/input/")
.to("log:done");
The exception which shows the converter trying to create a ByteArray:
java.lang.OutOfMemoryError: Java heap space
at org.apache.commons.io.output.ByteArrayOutputStream.needNewBuffer(ByteArrayOutputStream.java:128)
at org.apache.commons.io.output.ByteArrayOutputStream.write(ByteArrayOutputStream.java:158)
at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1026)
at org.apache.commons.io.IOUtils.copy(IOUtils.java:999)
at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:218)
at org.apache.camel.converter.crypto.PGPDataFormat.unmarshal(PGPDataFormat.java:238)
at org.apache.camel.processor.UnmarshalProcessor.process(UnmarshalProcessor.java:65)
Try with 2.13 or 2.12-SNAPSHOT as we have improved data format and streaming recently. So likely to be better in next release.

Tika 1.2 PDF parse error - org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary

I am using Solr 4.0 and DIH (data import handler) with TikaProcessor for extracting text from PDF files stored in database. When I run indexing it gets failed to parse some PDF files and got the stack trace mentioned below.
Since Solr 4.0 uses Tika 1.2 I have written a unit test to parse the same PDF file using Tika 1.2 API, I got the same error.
The same problem with Tika 1.3 jars also. But when I tried using Tika 1.1 jars it works fine. Please let me if any of you have seen this error and how to fix this?
(I have posted the same in tika mailing list, but not much luck)
When I open the PDF file it is showing PDF/A mode. Not sure if this something related to the problem.
Here is the exception:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser#1fbfd6<mailto:org.apache.tika.parser.pdf.PDFParser#1fbfd6>
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.pc.TikaWithIndexing.main(TikaWithIndexing.java:53)
Caused by: java.lang.ClassCastException: org.apache.pdfbox.cos.COSString cannot be cast to org.apache.pdfbox.cos.COSDictionary
at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink.getAction(PDAnnotationLink.java:93)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:148)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 3 more
Here is the code snippet in JAVA:
String fileString = "C:/Bernard A J Am Coll Surg 2009.pdf";
File file = new File(fileString );
URL url = file.toURI().toURL();
ParseContext context = new ParseContext();;
Detector detector = new DefaultDetector();;
Parser parser = new AutoDetectParser(detector);;
Metadata metadata = new Metadata();
context.set(Parser.class, parser); //PPt,word,xlsx-- pdf,html
ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
outputstream.close();

Resources