I am using solr 4.6 and while indexing documents i could see couple of errors/warnings in logs. I want to make sure that despite these messages, will documents be indexed into solr or will they be skipped.
Finally can i upgrade relevant jars i.e. pdfbox and tika to resolve the issue without breaking anything else?
Error.
ERROR PDCIDFont Error: Could not parse predefined CMAP file for '¢¬%?Â-ª¬/3Ó~Œ[-UCS2'
Error: Could not parse predefined CMAP file for 'PDFAUTOCAD-Indentity0-UCS2'
Also i could see below warnings.
ExtractingDocumentLoader
skip extracting text due to TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser#37e5a2db. metadata=stream_source_info=TMSD SS_FI006 - Fixed Assets Course Slides.pptx stream_content_type=application/vnd.openxmlformats-officedocument.presentationml.presentation stream_size=9780764 stream_name=TMSD SS_FI006 - Fixed Assets Course Slides.pptx Content-Type=application/vnd.openxmlformats-officedocument.presentationml.presentation resourceName=TMSD SS_FI006 - Fixed Assets Course Slides.pptx
And
ExtractingDocumentLoader
skip extracting text due to Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser#6900efc8. metadata=stream_source_info=PMO IW Enduring Employees_Secondment Tracker.xlsx stream_content_type=application/x-tika-ooxml-protected stream_size=52736 custom:_dlc_DocIdItemGuid=9523d6bd-d1cf-40b5-b5b3-ca1ce43c4eb0 stream_name=PMO IW Enduring Employees_Secondment Tracker.xlsx custom:ContentTypeId=0x010100B98D2353323F5D4F8163D5A4670906C0 Content-Type=application/x-tika-ooxml-protected resourceName=PMO IW Enduring Employees_Secondment Tracker.xlsx
Related
We are struggling to import certain files into Solr occasionally. It seems like certain documents have weird meta data (values), not sure if it might be from eccentric word processor or something else. See two examples here:
Type: Solarium\Exception\HttpException
Message: Solr HTTP error: OK (400)
{"responseHeader":{"status":400,"QTime":49},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],"msg":"ERROR: [doc=3932487729] Error adding field 'brightness_value'='6.18' msg=For input string: \"6.18\"","code":400}}
And
Type: Solarium\Exception\HttpException
Severity: error --> Exception: Solr HTTP error: OK (400)
{"responseHeader":{"status":400,"QTime":72},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],"msg":"ERROR: [doc=16996] Error adding field 'version'='5.3.1' msg=For input string: \"5.3.1\"","code":400}}
How do we prevent these issues? We are not in control of the documents, so need to fix it on the server.
Define the field type explicitly in the schema instead of relying on Solr to create the field type for you - the first document that contains the field will make Solr guess the type of the field, and if later documents doesn't match the same, expected format, you'll get an error like this.
Always define the schema for a collection when using it in production or in an actual application - the schemaless mode is really neat for prototyping and experimenting, but in an actual application you want the types to be well defined.
My requirement is, I get input xml and based on some condition checking(using choice) I need to send into 2 different files. As I'm getting big file(100-400MB) I'm sending in stream mode(by enable streaming in file and datamapper components).
It is working fine for small size input xml(10-20MB). But when I give large input xml. Condition checking and XML to CSV conversion is working fine but while writing CSVB-data I'm getting error message.,
INFO 2015-09-08 12:03:49,227 [[simplebatch_1].simplebatchFlow.stage1.02] org.mule.api.processor.LoggerMessageProcessor: default logger java.io.PipedInputStream#1bca8e6
INFO 2015-09-08 12:03:49,258 [[simplebatch_1].File1.dispatcher.01] org.mule.lifecycle.AbstractLifecycleManager: Initialising: 'File1.dispatcher.29118412'. Object is: FileMessageDispatcher
INFO 2015-09-08 12:03:49,258 [[simplebatch_1].File1.dispatcher.01] org.mule.lifecycle.AbstractLifecycleManager: Starting: 'File1.dispatcher.29118412'. Object is: FileMessageDispatcher
INFO 2015-09-08 12:03:49,258 [[simplebatch_1].File1.dispatcher.01] org.mule.transport.file.FileConnector: Writing file to: D:\MulePOC's\output\myoutput1
ERROR 2015-09-08 12:03:54,999 [XML_READER0_0] org.jetel.graph.Node: java.lang.OutOfMemoryError: Java heap space
ERROR 2015-09-08 12:03:55,000 [WatchDog_0] org.jetel.graph.runtime.WatchDog: Component [XML READER:XML_READER0] finished with status ERROR.
Java heap space
Please suggest me on this., Thanks..,
You need increase your JVM memory for Mule. You can found the config file in $MULE_HOME/conf/wrapper.conf
You will found something like this:
# Increase Permanent Generation Size from default of 64m
# Increase this value if you get "Java.lang.OutOfMemoryError: PermGen space error"
# This property is not used when running java 8 and may cause a warning.
wrapper.java.additional.7=-XX:PermSize=256m
wrapper.java.additional.8=-XX:MaxPermSize=256m
# GC settings
wrapper.java.additional.9=-XX:+HeapDumpOnOutOfMemoryError
wrapper.java.additional.10=-XX:+AlwaysPreTouch
wrapper.java.additional.11=-XX:+UseParNewGC
wrapper.java.additional.12=-XX:NewSize=512m
wrapper.java.additional.13=-XX:MaxNewSize=512m
wrapper.java.additional.14=-XX:MaxTenuringThreshold=8
You can change this configurations as you like.
My Lucene index - built with Solr using Lucene4.1 - is, I think, corrupted. Upon trying to read the index using the following code I get org.apache.solr.common.SolrException: No such core: collection1 exception:
File configFile = new File(cacheFolder + File.separator + "solr.xml");
CoreContainer container = new CoreContainer(cacheFolder, configFile);
SolrServer server = new EmbeddedSolrServer(container, "collection1");
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("q", idFieldName + ":" + ClientUtils.escapeQueryChars(queryId));
params.set("fl",idFieldName+","+valueFieldName);
QueryResponse response = server.query(params)
I used "checkindex" util to check the integrity of the index and it seems not able to perform the task by throwing the following error:
Opening index # /....../solrindex_cache/zookeeper/solr/collection1/data/index
ERROR: could not read any segments file in directory
java.io.FileNotFoundException: /....../solrindex_cache/zookeeper/solr/collection1/data/index/segments_b5tb (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:223)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:285)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:347)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:630)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:343)
at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:383)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1777)
The file segments_b5tb that index checker is looking for is indeed missing in the index folder. The only file that looks similar is segments.gen.
Is there any way to diagnose what has gone wrong and if possible, by all means to fix it as it took me 2 weeks to build this index...
Many many thanks for your kind advice!
If the segments.gen file is the only file you see, you are likely out of luck, but otherwise, you can try using CheckIndex to check for errors, and repair the index. Since the tool fixes the index by removing problematic segments, there may cerainly be some lost data as a result.
The short question is :
I want to disable stored field compression on Solr 4.3.0 index. After reading :
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
http://wiki.apache.org/solr/SimpleTextCodecExample
http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/
I've decided to follow the path described there, and make my own codec. I'm pretty sure I've followed all the steps, however, when I actually try to use my codec (affectionatelly named "UncompressedStorageCodec"), I get the following error in Solr log:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
From the output I get that Solr is not picking up the jar with my custom codec, and I don't get why?
Here's all the horriffic details:
I've created a class like this:
public class UncompressedStorageCodec extends FilterCodec {
private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
protected UncompressedStorageCodec() {
super("UncompressedStorageCodec", new Lucene42Codec());
}
#Override
public StoredFieldsFormat storedFieldsFormat() {
return fieldsFormat;
}
}
in package: "fr.company.project.solr.transformers.utils"
The FQDN of "FilterCodec" is: "org.apache.lucene.codecs.FilterCodec"
I've created a basic jar file out of this (exported it as jar from Eclipse).
The Solr installation I'm using to test this is the basic Solr 4.3.0 unzipped, and started via it's embedded Jetty server and using the example core.
I've placed my jar with the codec in [solrDir]\dist
In:
[solrDir]\example\solr\myCore\conf\solrconfig.xml
I've added the line:
<lib dir="../../../dist/" regex="myJarWithCodec-1.10.1.jar" />
Then in the schema.xml file, I've declared some fieldTypes that should use this codec like so:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" postingsFormat="UncompressedStorageCodec"/>
<fieldType name="string_lowercase" class="solr.TextField" positionIncrementGap="100" omitNorms="true" postingsFormat="UncompressedStorageCodec">
<!--...-->
</fieldType>
Now, if I use the DataImportHandler component to import some data into Solr, at commit time it tells me:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
What I find strange is that the above mentioned codec jar also contains some Transformers for the DataImportHandler component. And those are picked up fine. Also, other jars placed in the dist folder (and declared in the same way in solrconfig.xml), like the jdbc driver are picked up fine. I'm guessing that for the codec there's this SPI thingy which loads things differentlly, and there's somethign he's missing...
I've also tried placing the codec jar in:
[solrDir]\example\solr-webapp\webapp\WEB-INF\lib\
as well as inside the WEB-INF\lib folder of the solr.war file, which is found in:
[solrDir]\example\webapps\
but I'm still getting the same error.
So basically, my question is, what's missing so that my codec jar is picked up by Solr?
Thanks
I'm going to answer this question myself since it sort of become moot due to some benchmarks I've made: long story short, I had arrived at the (wrong) conclusion that for really large stored fields, Solr 3.x and 4.0 (without field compression) is faster than Solr 4.1 and above (with field compression). However that was mostly due to some errors in my benchmarks. After repeating them I've obtained results where when you go from non-compressed to compressed fields even for very large stored fields, the index time is between 0% and 15% slower, which is really not bad at all, considering that afterwards queries on the compressed fields indexes are 10-20% times faster (the document fetching part).
Also, here's some remarks on how to speed up indexing:
Use the DataImportHandler plugin. It bypasses the Solr Rest (HTTP based) API and writes directly to the Lucene index.
Check out said plugins sources to see how it accomplishes this, and do your own plugin if the DataImportHandler doesn't meet your needs
If for whatever reason you want to stick to the Solr Rest API, use ConcurrentUpdateSolrServer and play around with the queue size and number of threads parameters. It will normally be a lot faster (up to 200% in my case) than the basic HttpSolrServer.
Don't forget to enable the javabin data serialization like this:
ConcurrentUpdateSolrServer solrServer = new ConcurrentUpdateSolrServer("http://some.solr.host:8983/solr", 100, 4);
solrServer.setRequestWriter(new BinaryRequestWriter());
I'm explicitly showing the code because I believe there migth be a small bug here:
If you look at the ConcurrentUpdateSolrServer constructor, you'll see that by default it already sets the request writer to binary:
//the ConcurrentUpdateSolrServer initializes HttpSolrServer objects using this constructor:
public HttpSolrServer(String baseURL, HttpClient client) {
this(baseURL, client, new BinaryResponseParser());
}
However after debugging I've noticed that if you don't explicitly call the setWriter method with the Binary writer argument, it will still use the XmlSerializer.
Going from XML to Binary serialization reduces the size of my documents about 3 times as they are being sent to the server. This makes my index times for this case about 150-200% faster.
I have recently tried and succeeded to get something very similar to work. The only difference is that I want to enable the best compression instead of no compression, and Solr defaults to the fastest compression. I also got the "SPI class [...] does not exist" error at some point, and here is what I have found out from various articles, including the ones you have linked to.
Lucene uses SPI to find the codec classes to load. Lucene requires the list of codec classes be declared in the file "org.apache.lucene.codecs.Codec", and the file must be on the class path. To get Solr to load the file: When you create your JAR file "myJarWithCodec-1.10.1.jar", make sure that it contains a file at "META-INF/services/org.apache.lucene.codecs.Codec". The file should have one full class name per line, like this:
org.apache.lucene.codecs.lucene3x.Lucene3xCodec
org.apache.lucene.codecs.lucene40.Lucene40Codec
org.apache.lucene.codecs.lucene41.Lucene41Codec
org.apache.lucene.codecs.lucene42.Lucene42Codec
fr.company.project.solr.transformers.utils.UncompressedStorageCodec
And in solrconfig.xml, replace:
<codecFactory class="solr.SchemaCodecFactory" />
with:
<codecFactory class="fr.company.project.solr.transformers.utils.UncompressedStorageCodec" />
You might also need to remove postingsFormat="UncompressedStorageCodec" from schema.xml if Solr complains. I think this particular parameter is for specifying the postings format, not the codec. Hope it helps.
I try to crawl using nutch 1.4 , but I'm facing error in parsing, this is the log file:
2012-01-09 09:12:02,696 INFO parse.ParseSegment - ParseSegment: starting at 2012-01-09 09:12:02
2012-01-09 09:12:02,697 INFO parse.ParseSegment - ParseSegment: segment: crawl/segments/20120109091153
2012-01-09 09:12:03,416 WARN parse.ParseUtil - Unable to successfully parse content http://sujitpal.blogspot.com/ of type application/xhtml+xml
2012-01-09 09:12:03,417 INFO parse.ParseSegment - Parsing: http:// sujitpal.blogspot.com/
2012-01-09 09:12:03,418 WARN parse.ParseSegment - Error parsing: http://sujitpal.blogspot.com/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
2012-01-09 09:12:03,419 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
by checking config/nutch-site.xml I found html|text|xhtml|xml are included in the plugin.includes preperty
<property>
<name>plugin.includes</name>
<value>myplugins|protocol-httpclient|query-(basic|site|url)|summary-
basic|urlfilter-
regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|scoring-
opic|urlnormalizer-(pass|regex|basic)|query-(basic|site|url)|response-(json|xml)
</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
Why can't it parse xhtml/xml or even text/xml?
Which plugins have you configured? If you are using tika, then tika has a mapping from mime-type like xhtml/xml to a parser. If there is no entry in the configfile, nothing happens.
You could disable tika and only use the parse-html plugin.
I tested your site with our default plugin config.
protocol-http|urlfilter-regex|parse-(html)|index-(basic|anchor)
|query- (basic|site|url)|response-(json|xml)
|summary-basic|scoring-opic|urlnormalizer-
(pass|regex|basic)
And got your page parsed.
Parsed (32ms):http://sujitpal.blogspot.com/
Grettings
JPee