MiNiFi GetFile processor fails to get large files - file

I'm running apache MiNiFi c++, The flow starts with a GetFile processor.
The input directory includes some large files, and when I run MiNiFi the files above ~1.5 GB fail and do not get queued.
The log file states:
[org::apache::nifi::minifi::processors::GetFile] [Warning] failed to stat large_file_path_here
The other smaller files are queued as expected.
Does anyone have a clue what can be wrong? Why can't the processor manage the larger files?
Thanks in advance.

What you found seems like a bug that is present even in the current MiNiFi implementation even up to today. The issue is that file sizes you mentioned, a narrowing exception happens here when trying to determine the length of the file to be written into the content repository.
We will try to fix this issue asap.

Related

Snowflake - snowsql PUT command upload very slow

I've been having issues with the PUT command using Snowflake's SnowSQL Client. My network's upload speed is 100 Mbps, but the PUT command takes almost a minute and a half for a ~30MB file. I have already gzipped the file, so the compression isn't taking this long. Does anyone know how to solve this problem?
PUT command speed
There is always encryption occurring on the local machine, so whether the file is compressed or not, the machine the PUT is running on will still affect the overall performance of the PUT. If the machine is under-powered or busy, can affect the performance. If you are watching the PUT happen, you can tell how long this is taking by viewing when the bytes transferred actually starts moving.
Try increasing the number of threads by setting PARALLEL to a value >1.
It looks like whatever the issue was has been resolved. I've been getting about 10-12 MB/s upload again. Not sure what the issue was, but I submitted a ticket to support a while back and it looks like they were able to fix it.
Thanks for the suggestions, but it looks like it was just something on their (or Azure's) end.

Omnimark file processing fails

We have the omnimark script that takes 2gb sgml file size as input and output the file which is around 2.2 gb.The script is called from unix shell script and we are facing issues that sometimes script runs successfully and sometime it just aborted with no error....any idea or suggestions how to debug this?
I have seen this type of issue before in running OmniMark v5.3 when the script bombs due to lack of server resources/memory.
If you've specified writing to a log file, e.g. using -log logfilename.txt, then you would see something like an error code #3000 "insufficient memory error".
http://developers.omnimark.com/docs/html/error/3000.htm
If no log file, then initial step would be to run the script in a console session so that any such abort message is visible.
Stilo have a page listing fixes in various versions of OmniMark
http://developers.omnimark.com/docs/html/concept/806.htm
This mentions a variety of memory-related issues in various versions of the software (e.g. use of certain translate rules) which may help some investigation.
Alternatively, you could add to the script writing to a debug log file (with a global switch to activate debug on or off (so you don't waste further I/O resources when you don't need to)). Debug log file should be unbuffered. At certain breakpoints in the script add a message. The more verbose the better at narrowing down where/when the error is, but with the size of file I suggest it's a I/O or memory error.
Also depends what version of OmniMark you're using.

Application calls old source functions

There is an application on remote machine with Linux OS(Fedora), writing to the log file when certain events occur. Some time ago I changed format of the message being written to the log file. But recently it turned out that for some reason in some seldom cases log files with old format messages appear there. I know for sure that none part of my code can write such strings. Also there is no instance of the old application running. Does anyone have some ideas why it can happen? It's not possible to check which process writes those files because anything like auditctl is not installed there, and neither package manager or yum to get it or install. Application is written in C language.
you can use fuser command to find out all the processes that are using that file
`fuser file.log`

Data loss on concurrent file write in camel

I am using camel technology for my file operation. My system is cluster environment.
Let say, I have 4 instances
Instance A
Instance B
Instance C
Instance D
Folders Structure
Input Folder: C:/app/input
Output Folder: C:/app/output
All the four instances will be pointing to Input folder location. As per, my business 8 files will be placed in the input folder and output will be consolidated file. here camel losing data when concurrently writing to output file.
Route:
from("file://C:/app/input")
.setHeader(Exchange.File_Name,simple("output.txt"))
.to("file://C:/app/output?fileExist=Append")
.end();
Kindly help me to resolve this issue. is there any thing like write lock in camel? to avoid concurrent file writer. Thanks in advance
You can use the doneFile option of the file component, see http://camel.apache.org/file2.html for more information.
Avoid reading files currently being written by another application
Beware the JDK File IO API is a bit limited in detecting whether another application is currently writing/copying a file. And the implementation can be different depending on OS platform as well. This could lead to that Camel thinks the file is not locked by another process and start consuming it. Therefore you have to do you own investigation what suites your environment. To help with this Camel provides different readLock options and doneFileName option that you can use. See also the section Consuming files from folders where others drop files directly.

Foreach Loop Container with Foreach File Enumerator option iterates all files twice

I am using the SSIS Foreach Loop Container to iterate through files with a certain pattern on a network share.
I am encountering an kind of unreproducible malfunction of the Loop Container:
Sometimes the loop is executed twice. After all files were processed it starts over with the first file.
Have anyone encountered a similar bug?
Maybe not directly using SSIS but accessing files on a Windows share with some kind of technology?
Could this error relate to some network issues?
Thanks.
I found this to be the case whilst working with Excel files and using the *.xlsx wildcard to drive the foreach.
Once I put logging in place I noticed that when the Excel was opened it produced an excel file prefixed with ~$. This was picked up by the foreach loop.
So I used a trick similar to http://geekswithblogs.net/Compudicted/archive/2012/01/11/the-ssis-expression-wayndashskipping-an-unwanted-file.aspx to exclude files with a ~$ in the filename.
What error message (SSIS log / Eventvwr messages) do you get?
Similar to #Siva, I've not come across this, but some ideas you could use to try and diagnose. You may be doing some of these already, I've just written them down for completeness from my thought processes...
log all files processed. write a line to a log file/table pre-processing (each file), then post-process (each file). Keep the full path of each file. This is actually something we do as standard with our ETL implementations, as users are often coming back to us with questions about when/what has been loaded. This will allow you to see if files are actually being processed twice.
perhaps try moving each file after it is processed to a different directory. That will make it more difficult to have a file processed a second time and the problem may disappear. (If you are processing them from an area that is a "master" area (and so cant move them), consider copying the files to a "waiting" folder, then processing them and moving them to a "processed" folder)
#Siva's comment is interesting - look at the "traverse subfolders" check box.
check your eventvwr for odd network events, or application events (SQL Server restarting?)
use perf mon to see if there is anything odd happening in terms of network load on your server (a bit of a random idea!)
try running your whole process with files on a local disk instead of a network disk, if your mean time between failures is after running 10 times, then you could do this load locally 20-30 times and if you dont get an error it may be a network error
nothing helped - I implemented following workaround: script task in the foreach iterator which tracks all files. if a file was alread loaded a warning is fired and the file is not processed again. anyway, seems to be some network related problem...

Resources