I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.
Related
I am new to Flink Streaming framework and I am trying to understand the components and the flow. I am trying to run the basic wordcount example using the DataStream. I am trying to run the code on my IDE. The code runs with no issues when I feed data using the collection as
DataStream<String> text = env.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",z
"Or to take arms against a sea of troubles,"
);
But, it fails everytime when I am trying to read data either from the socket or from a file as below:
DataStream<String> text = env.socketTextStream("localhost", 9999);
or
DataStream<String> text = env.readTextFile("sample_file.txt");
For both the socket and textfile, I am getting the following error:
Exception in thread "main" java.lang.reflect.InaccessibleObjectException: Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not "opens java.lang" to unnamed module
Flink currently (as of Flink 1.14) only supports Java 8 and Java 11. Install a suitable JDK and try again.
I've add Flink Hadoop Compatibility to the project which reads sequence file from hdfs path,
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-compatibility_2.11</artifactId>
<version>1.5.6</version>
</dependency>
Here's the java code snippet,
DataSource<Tuple2<NullWritable, BytesWritable>> input = env.createInput(HadoopInputs.readHadoopFile(
new org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat<NullWritable, BytesWritable>(),
NullWritable.class, BytesWritable.class, path));
This works pretty fine when I run it inside my Eclipse, but when I submit it via command line 'flink run ...', it complains,
The type returned by the input format could not be automatically determined. Please specify the TypeInformation of the produced type explicitly by using the 'createInput(InputFormat, TypeInformation)' method instead.
OK, so I update my code to add type information,
DataSource<Tuple2<NullWritable, BytesWritable>> input = env.createInput(HadoopInputs.readHadoopFile(
new org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat<NullWritable, BytesWritable>(),
NullWritable.class, BytesWritable.class, path),
TypeInformation.of(new TypeHint<Tuple2<NullWritable, BytesWritable>>() {}));
Now it complains,
Caused by: java.lang.RuntimeException: Could not load the TypeInformation for the class 'org.apache.hadoop.io.Writable'. You may be missing the 'flink-hadoop-compatibility' dependency.
Some people suggest to copy flink-hadoop-compatibility_2.11-1.5.6.jar to FLINK_HOME/lib, but it doesn't help, still same error.
Does anyone have any clue?
My Flink is a standalone installation, version 1.5.6.
UPDATE:
Sorry, I copied flink-hadoop-compatibility_2.11-1.5.6.jar to the wrong place, after fixing that, it works.
Now my question is, is there any other way to go? Because copying that jar file to FLINK_HOME/lib is definitely not a good idea to me, especially when talking about a big flink cluster.
Fixed in version 1.9.0, see https://issues.apache.org/jira/browse/FLINK-12163 for details
I have a scenario wherein I need to read files from a particular folder. So I had a File inbound as below, its reading all non-empty files. But empty files are not read and sits in the same location as is.
<file:inbound-endpoint path="${file.path}" responseTimeout="10000" doc:name="File" moveToDirectory="${audit.location}">
<file:filename-regex-filter pattern="file.employee(.*).xml,file.client(.*).xml"
caseSensitive="true"/>
</file:inbound-endpoint>
I removed File filter, but still it doesn't read empty files.
Is there a way to enable file inbound to read empty files too?
According the the Mule File Connector documentation:
The File connector as inbound endpoint does not process empty (0 bytes) files.
So this behavior is expected. There is no documented way to process non empty file with the File Inbound Endpoint.
However you can still write your own connector to do this, or use a workaround such as fill your "empty" file with a single character (such as a space) to make it non-empty
If you want to read a file with the size of 0 KB, then you can`t achieve this with File Connector, but we can read a file by using MuleRequester in the flow. I will share sample snippet soon. Please let me know,If you need any help.
Regards,
Sreenivas B
Mule File connector does not process empty (0 bytes) files as inbound endpoint
As per my knowledge File Inbound connector will not process (0 KB) size files.
On the File Connector, the class org.mule.transport.file.FileMessageReceiver.java in method poll has :
if (file.length() == 0)
{
if (logger.isDebugEnabled())
{
logger.debug("Found empty file '" + file.getName() + "'. Skipping file.");
}
continue;
}
that prevents it from proccessing empty files
But you can create your own CustomFileMessageReceiver.java, create your package:
package com.mycompany.mule.transport.file;
And the class that extends AbstractPollingMessageReceiver
public class CustomFileMessageReceiver extends AbstractPollingMessageReceiver
Copy the original FileMessageReceiver.java methods but comment the above lines and change FileMessageReceiver to CustomFileMessageReceiver where needed.
The call fileConnector.move(file, workFile) is a protected method from the original package, commented and beware you cannot use workdir.
In the same package create a copy of org.mule.transport.file.ReceiverFileInputStream.java
Configure your connector:
<file:connector name="FILE" readFromDirectory="${incoming.directory}" autoDelete="true" streaming="false" recursive="true" validateConnections="true" doc:name="File" writeToDirectory="${processed.directory}">
<service-overrides messageReceiver="com.mycompany.mule.transport.file.CustomFileMessageReceiver" />
</file:connector>
Or you may implement your own file connector, as stated in the above answers.
The short question is :
I want to disable stored field compression on Solr 4.3.0 index. After reading :
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
http://wiki.apache.org/solr/SimpleTextCodecExample
http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/
I've decided to follow the path described there, and make my own codec. I'm pretty sure I've followed all the steps, however, when I actually try to use my codec (affectionatelly named "UncompressedStorageCodec"), I get the following error in Solr log:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
From the output I get that Solr is not picking up the jar with my custom codec, and I don't get why?
Here's all the horriffic details:
I've created a class like this:
public class UncompressedStorageCodec extends FilterCodec {
private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
protected UncompressedStorageCodec() {
super("UncompressedStorageCodec", new Lucene42Codec());
}
#Override
public StoredFieldsFormat storedFieldsFormat() {
return fieldsFormat;
}
}
in package: "fr.company.project.solr.transformers.utils"
The FQDN of "FilterCodec" is: "org.apache.lucene.codecs.FilterCodec"
I've created a basic jar file out of this (exported it as jar from Eclipse).
The Solr installation I'm using to test this is the basic Solr 4.3.0 unzipped, and started via it's embedded Jetty server and using the example core.
I've placed my jar with the codec in [solrDir]\dist
In:
[solrDir]\example\solr\myCore\conf\solrconfig.xml
I've added the line:
<lib dir="../../../dist/" regex="myJarWithCodec-1.10.1.jar" />
Then in the schema.xml file, I've declared some fieldTypes that should use this codec like so:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" postingsFormat="UncompressedStorageCodec"/>
<fieldType name="string_lowercase" class="solr.TextField" positionIncrementGap="100" omitNorms="true" postingsFormat="UncompressedStorageCodec">
<!--...-->
</fieldType>
Now, if I use the DataImportHandler component to import some data into Solr, at commit time it tells me:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
What I find strange is that the above mentioned codec jar also contains some Transformers for the DataImportHandler component. And those are picked up fine. Also, other jars placed in the dist folder (and declared in the same way in solrconfig.xml), like the jdbc driver are picked up fine. I'm guessing that for the codec there's this SPI thingy which loads things differentlly, and there's somethign he's missing...
I've also tried placing the codec jar in:
[solrDir]\example\solr-webapp\webapp\WEB-INF\lib\
as well as inside the WEB-INF\lib folder of the solr.war file, which is found in:
[solrDir]\example\webapps\
but I'm still getting the same error.
So basically, my question is, what's missing so that my codec jar is picked up by Solr?
Thanks
I'm going to answer this question myself since it sort of become moot due to some benchmarks I've made: long story short, I had arrived at the (wrong) conclusion that for really large stored fields, Solr 3.x and 4.0 (without field compression) is faster than Solr 4.1 and above (with field compression). However that was mostly due to some errors in my benchmarks. After repeating them I've obtained results where when you go from non-compressed to compressed fields even for very large stored fields, the index time is between 0% and 15% slower, which is really not bad at all, considering that afterwards queries on the compressed fields indexes are 10-20% times faster (the document fetching part).
Also, here's some remarks on how to speed up indexing:
Use the DataImportHandler plugin. It bypasses the Solr Rest (HTTP based) API and writes directly to the Lucene index.
Check out said plugins sources to see how it accomplishes this, and do your own plugin if the DataImportHandler doesn't meet your needs
If for whatever reason you want to stick to the Solr Rest API, use ConcurrentUpdateSolrServer and play around with the queue size and number of threads parameters. It will normally be a lot faster (up to 200% in my case) than the basic HttpSolrServer.
Don't forget to enable the javabin data serialization like this:
ConcurrentUpdateSolrServer solrServer = new ConcurrentUpdateSolrServer("http://some.solr.host:8983/solr", 100, 4);
solrServer.setRequestWriter(new BinaryRequestWriter());
I'm explicitly showing the code because I believe there migth be a small bug here:
If you look at the ConcurrentUpdateSolrServer constructor, you'll see that by default it already sets the request writer to binary:
//the ConcurrentUpdateSolrServer initializes HttpSolrServer objects using this constructor:
public HttpSolrServer(String baseURL, HttpClient client) {
this(baseURL, client, new BinaryResponseParser());
}
However after debugging I've noticed that if you don't explicitly call the setWriter method with the Binary writer argument, it will still use the XmlSerializer.
Going from XML to Binary serialization reduces the size of my documents about 3 times as they are being sent to the server. This makes my index times for this case about 150-200% faster.
I have recently tried and succeeded to get something very similar to work. The only difference is that I want to enable the best compression instead of no compression, and Solr defaults to the fastest compression. I also got the "SPI class [...] does not exist" error at some point, and here is what I have found out from various articles, including the ones you have linked to.
Lucene uses SPI to find the codec classes to load. Lucene requires the list of codec classes be declared in the file "org.apache.lucene.codecs.Codec", and the file must be on the class path. To get Solr to load the file: When you create your JAR file "myJarWithCodec-1.10.1.jar", make sure that it contains a file at "META-INF/services/org.apache.lucene.codecs.Codec". The file should have one full class name per line, like this:
org.apache.lucene.codecs.lucene3x.Lucene3xCodec
org.apache.lucene.codecs.lucene40.Lucene40Codec
org.apache.lucene.codecs.lucene41.Lucene41Codec
org.apache.lucene.codecs.lucene42.Lucene42Codec
fr.company.project.solr.transformers.utils.UncompressedStorageCodec
And in solrconfig.xml, replace:
<codecFactory class="solr.SchemaCodecFactory" />
with:
<codecFactory class="fr.company.project.solr.transformers.utils.UncompressedStorageCodec" />
You might also need to remove postingsFormat="UncompressedStorageCodec" from schema.xml if Solr complains. I think this particular parameter is for specifying the postings format, not the codec. Hope it helps.
I've been unable to find any easy way of figuring out the version string for a WAR file deployed with Tomcat 7 versioned naming (ie app##version.war). You can read about it here and what it enables here.
It'd be nice if there was a somewhat more supported approach other than the usual swiss army knife of reflection powered ribcage cracking:
final ServletContextEvent event ...
final ServletContext applicationContextFacade = event.getServletContext();
final Field applicationContextField = applicationContextFacade.getClass().getDeclaredField("context");
applicationContextField.setAccessible(true);
final Object applicationContext = applicationContextField.get(applicationContextFacade);
final Field standardContextField = applicationContext.getClass().getDeclaredField("context");
standardContextField.setAccessible(true);
final Object standardContext = standardContextField.get(applicationContext);
final Method webappVersion = standardContext.getClass().getMethod("getWebappVersion");
System.err.println("WAR version: " + webappVersion.invoke(standardContext));
I think the simplest solution is using the same version (SVN revision + padding as an example) in .war, web.xml and META-INF/MANIFEST.MF properties files, so you could retrieve the version of these files later in your APP or any standard tool that read version from a JAR/WAR
See MANIFEST.MF version-number
Another solution described here uses the path name on the server of the deployed WAR. You'd extract the version number from the string between the "##" and the "/"
runningVersion = StringUtils.substringBefore(
StringUtils.substringAfter(
servletConfig.getServletContext().getRealPath("/"),
"##"),
"/");
Starting from Tomcat versions 9.0.32, 8.5.52 and 7.0.101, the webapp version is exposed as a ServletContext attribute with the name org.apache.catalina.webappVersion.
Link to the closed enhancement request: https://bz.apache.org/bugzilla/show_bug.cgi?id=64189
The easiest way would be for Tomcat to make the version available via a ServletContext attribute (org.apache.catalina.core.StandardContext.webappVersion) or similar. The patch to do that would be trivial. I'd suggest opening an enhancement request in Tomcat's Bugzilla. If you include a patch then it should get applied fairly quickly.