I'm trying to process potentially large files using Camel, and am worried about them "fitting" in the body of a Camel Message. Is there a way I can just pass the name (path) of the file as the body of the message, and then in a processor use that to read from disk?
You can just pass in a java.io.File instance. This is essentially what the Camel file component does itself (although its placed inside a WrappedFile, due sharing code with the ftp components).
You can of course also just store the name of the file as a String, and then from the processor access the file, either by
String name = exchange.getIn().getBody(String.class);
File file = new File(name);
...
FileInputStream fis = new FileInputStream(file);
// read the file from the stream, etc.
Related
I am reading the S3 url from a Kafka producer. Then going to that S3 url to process all files inside that folder. After grabbing the info from each file, I will pass that data to a sink.
Initially, I have a DataStream<String> that will read and grab the nested JSON value from the Kafka source's ObjectNode, using the JSONKeyValueDeserializationSchema. So the path exists as a String inside the DataStream. How do I pass this string to a FileSource? The FileSource object takes in a Path object for the place of the folder.
I'm planning to use FileSource.forRecordStreamFormat to go through all the files and then all the lines of each file. However, this outputs a FileSource<String>, then a DataStream<String> by calling env.fromSourced.
The example I'm looking at now is: https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/sideoutput/SideOutputExample.java
I see that FileSource takes in a Path object and then eventually gets a DataStream<String> but is there a way for me to grab that String from the initial Kafka source DataStream<String> and then use it for a FileSource?
I hava file1.jpg,file2.jpg,file3.jpg in DirA.
I have file1.json,file2.json,file3.json in DirB
How can I create a apache camel file route such that first route picks file1.jpg from a DirA , processes and pass file1 name to second route so that it can read file1.json and processes.
CamelContext context = new DefaultCamelContext();
context.addRoutes(new RouteBuilder() {
public void configure() {
from("file:///DirA/?noop=true"). bean(MyBean.class,"doSomeThingWithJPG(${file:absolute.path})").
from("file:///DirB/?noop=true&fileName=${file:name}.json").
bean(AnotherBean.class,"doSomeThingWithJSON(${file:absolute.path})") ;
}
})
The second from (file:///) component is also pointing to file in DirA instead of files in DirB
You could:
Stage the JSON file
Write a processor or bean that copies a named file from one directory to another
Call that processor after processing JPG, to copy the JSON to another directory
Have the second file-listener polling that other directory
In my code I am prompting the user to load a json file.
I am then attempting to copy this file into an sqlite database.
Once I have the data I am then able to manipulate it as needed - but I need to get it there in the first place.
So step 1 is to get the data in.
I have progressed as far as prompting the user to navigate to the file they want - but when I try and read the file I get this error ..
ERROR: resources must reside in the root directory thus must start with a '/' character in Codename One! Invalid resource: file:///tmp/temp3257201851214246357..json
So I think that I need to copy this file to the root directory
I cannot find a link that shows me how to do this.
Here is my code so far ...
case "Import Script":
try
{
JSONParser json = new JSONParser();
if (FileChooser.isAvailable()) {
FileChooser.showOpenDialog(".json", e2-> {
String file = (String)e2.getSource();
if (file == null) {
home.add("No file was selected");
home.revalidate();
} else {
home.add("Please wait - busy importing");
home.revalidate();
String extension = null;
if (file.lastIndexOf(".") > 0) {
extension = file.substring(file.lastIndexOf(".")+1);
}
if ("json".equals(extension)) {
FileSystemStorage fs = FileSystemStorage.getInstance();
try {
InputStream fis = fs.openInputStream(file);
try(Reader r = new InputStreamReader(Display.getInstance().getResourceAsStream(getClass(), file), "UTF-8"))
{
Map<String, Object> data = json.parseJSON(r);
Result result = Result.fromContent(data);
...... I progress from here
The error is occurring on this line ...
try(Reader r = new InputStreamReader(Display.getInstance().getResourceAsStream(getClass(), file), "UTF-8"))
If I hard code a filename and manually place it in the /src folder it works ... like this ...
try(Reader r = new InputStreamReader(Display.getInstance().getResourceAsStream(getClass(), '/test.json'), "UTF-8"))
But that defeats the purpose of them selecting a file
Any help would be appreciated
Thanks
I suggest watching this video.
It explains the different ways data is stored. One of the core sources of confusion is the 3 different ways to store files:
Resources
File System
Storage
getResourceAsStream returns a read only path that's physically embedded in the jar. It's flat so all paths to getResourceAsStream must start with / and must have only one of those. I would suggest avoiding more than one . as well although this should work in theory.
The sqlite database must be stored in file system which is encapsulated as FileSystemStorage and that's really the OS native file system. But you can't store it anywhere you want you need to give the DB name to the system and it notifies you where the file is stored and that's whats explained in the code above.
I am trying to use Hadoop in java with multiple input files. At the moment I have two files, a big one to process and a smaller one that serves as a sort of index.
My problem is that I need to maintain the whole index file unsplitted while the big file is distributed to each mapper. Is there any way provided by the Hadoop API to make such thing?
In case if have not expressed myself correctly, here is a link to a picture that represents what I am trying to achieve: picture
Update:
Following the instructions provided by Santiago, I am now able to insert a file (or the URI, at least) from Amazon's S3 into the distributed cache like this:
job.addCacheFile(new Path("s3://myBucket/input/index.txt").toUri());
However, when the mapper tries to read it a 'file not found' exception occurs, which seems odd to me. I have checked the S3 location and everything seems to be fine. I have used other S3 locations to introduce the input and output file.
Error (note the single slash after the s3:)
FileNotFoundException: s3:/myBucket/input/index.txt (No such file or directory)
The following is the code I use to read the file from the distributed cache:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(cacheFile[0].toString()));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using Amazon's EMR, S3 and the version 2.4.0 of Hadoop.
As mentioned above, add your index file to the Distributed Cache and then access the same in your mapper. Behind the scenes. Hadoop framework will ensure that the index file will be sent to all the task trackers before any task is executed and will be available for your processing. In this case, data is transferred only once and will be available for all the tasks related your job.
However, instead of add the index file to the Distributed Cache in your mapper code, make your driver code to implement ToolRunner interface and override the run method. This provides the flexibility of passing the index file to Distributed Cache through the command prompt while submitting the job
If you are using ToolRunner, you can add files to the Distributed Cache directly from the command line when you run the job. No need to copy the file to HDFS first. Use the -files option to add files
hadoop jar yourjarname.jar YourDriverClassName -files cachefile1, cachefile2, cachefile3, ...
You can access the files in your Mapper or Reducer code as below:
File f1 = new File("cachefile1");
File f2 = new File("cachefile2");
File f3 = new File("cachefile3");
You could push the index file to the distributed cache, and it will be copied to the nodes before the mapper is executed.
See this SO thread.
Here's what helped me to solve the problem.
Since I am using Amazon's EMR with S3, I have needed to change the syntax a bit, as stated on the following site.
It was necessary to add the name the system was going to use to read the file from the cache, as follows:
job.addCacheFile(new URI("s3://myBucket/input/index.txt" + "#index.txt"));
This way, the program understands that the file introduced into the cache is named just index.txt. I also have needed to change the syntax to read the file from the cache. Instead of reading the entire path stored on the distributed cache, only the filename has to be used, as follows:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(#the filename#));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using jdeveloper version 11.1.1.5.0. In my use case I have created Mail Client Send Mail program where I used ADF InputFile component to attach File on mail.
But problem is that InputFile Component only return path of file(only get file name). And in my mail program DataSource class use full path to access file name.
UploadedFile uploadfile=(UploadedFile) actionEvent.getNewValue();
String fname= uploadfile.getFilename();//this line only get file name.
So how can I get full path using adf InputFile component or any other way to fulfill my requirement.
You could save the uploaded file in a path at the server. Only take care about naming that file, because of concurrency of users you should follow a policy about it, for example, adding te time in milliseconds to the name of the file. Like this...
private String writeToFile(UploadedFile file) {
ServletContext servletCtx =
(ServletContext)FacesContext.getCurrentInstance().getExternalContext().getContext();
String fileDirPath = servletCtx.getRealPath("/files/tmp");
String fileName = getTimeInMilis()+file.getFilename();
try {
InputStream is = file.getInputStream();
OutputStream os =
new FileOutputStream(fileDirPath + "/"+fileName);
int readData;
while ((readData = is.read()) != -1) {
os.write(readData);
}
is.close();
os.close();
} catch (IOException ex) {
ex.printStackTrace();
}
return fileName;
}
This method also returns the new name of the uploaded file. You can replace getTimeInMilis() with any naming policy you like.
It would be a security issue if a web app is able to see anything other than the data stream for an uploaded file. The directory structure of the client would not be exposed to the webapp. As such, unless you plan to upload the file from the same host as the server, you will not have access to the file path on the client.
Note: Using answer instead of comment due to reputation threshold