I'm running apache MiNiFi c++, The flow starts with a GetFile processor.
The input directory includes some large files, and when I run MiNiFi the files above ~1.5 GB fail and do not get queued.
The log file states:
[org::apache::nifi::minifi::processors::GetFile] [Warning] failed to stat large_file_path_here
The other smaller files are queued as expected.
Does anyone have a clue what can be wrong? Why can't the processor manage the larger files?
Thanks in advance.
What you found seems like a bug that is present even in the current MiNiFi implementation even up to today. The issue is that file sizes you mentioned, a narrowing exception happens here when trying to determine the length of the file to be written into the content repository.
We will try to fix this issue asap.
I have a project I've been working on for a long time and a few hours ago when I tried to pick it up from where I left last night I realized that one of the processing files may be corrupted.
Last time I opened this file yesterday it had over 500 lines of code, now it's completely blank. The weird part is that in Properties its memory is displayed to be around 17 KB. I tried opening it with Notepad, but I only got a txt file full of spaces.
Does anyone have any idea how I can recuperate my work? I can't imagine having to rewrite everything
I'm not really sure how we can help you.
Processing doesn't have auto-backup or anything like that. I'd say make sure you don't have the file open in multiple editors. You could also try sending the file to another computer to see if that works. Other than that, the best thing you can do is check whether your computer has backups of some kind.
I am using the SSIS Foreach Loop Container to iterate through files with a certain pattern on a network share.
I am encountering an kind of unreproducible malfunction of the Loop Container:
Sometimes the loop is executed twice. After all files were processed it starts over with the first file.
Have anyone encountered a similar bug?
Maybe not directly using SSIS but accessing files on a Windows share with some kind of technology?
Could this error relate to some network issues?
Thanks.
I found this to be the case whilst working with Excel files and using the *.xlsx wildcard to drive the foreach.
Once I put logging in place I noticed that when the Excel was opened it produced an excel file prefixed with ~$. This was picked up by the foreach loop.
So I used a trick similar to http://geekswithblogs.net/Compudicted/archive/2012/01/11/the-ssis-expression-wayndashskipping-an-unwanted-file.aspx to exclude files with a ~$ in the filename.
What error message (SSIS log / Eventvwr messages) do you get?
Similar to #Siva, I've not come across this, but some ideas you could use to try and diagnose. You may be doing some of these already, I've just written them down for completeness from my thought processes...
log all files processed. write a line to a log file/table pre-processing (each file), then post-process (each file). Keep the full path of each file. This is actually something we do as standard with our ETL implementations, as users are often coming back to us with questions about when/what has been loaded. This will allow you to see if files are actually being processed twice.
perhaps try moving each file after it is processed to a different directory. That will make it more difficult to have a file processed a second time and the problem may disappear. (If you are processing them from an area that is a "master" area (and so cant move them), consider copying the files to a "waiting" folder, then processing them and moving them to a "processed" folder)
#Siva's comment is interesting - look at the "traverse subfolders" check box.
check your eventvwr for odd network events, or application events (SQL Server restarting?)
use perf mon to see if there is anything odd happening in terms of network load on your server (a bit of a random idea!)
try running your whole process with files on a local disk instead of a network disk, if your mean time between failures is after running 10 times, then you could do this load locally 20-30 times and if you dont get an error it may be a network error
nothing helped - I implemented following workaround: script task in the foreach iterator which tracks all files. if a file was alread loaded a warning is fired and the file is not processed again. anyway, seems to be some network related problem...
I have a program written in C that allows the user to scroll through a display of about a zillion small files. Each file needs to undergo a certain amount of processing (read only) before it's displayed to the user. I've implemented a buffer that preprocesses the files in a certain radius around the user's position, so if they're working linearly through them, there's not much delay. For various reasons, I can only actually run my processing algorithm on one file at a time (though I can have multiple files open, and I can read from them) so my buffer loads sequentially.
My processing algorithms are as optimized as they're going to get, but I'm running into I/O problems. At first, my loading process is slow, but when the files have been accessed a few times, it speeds up by about 5x. Therefore I strongly suspect that what's slowing me down is waiting for the Windows page cache to pull my files into memory. I know very little about that sort of thing. If I could ensure my files were in the cache before my processing algorithm needed them, I'd be in business.
My question is: is there a way to persuade/cajole/trick/intimidate Windows into loading my files into the page cache before I actually get around to reading/processing them?
There's only one way to get a file into the file system cache: reading it. That's a chicken-and-egg problem. You can get the egg first by using a helper thread that reads files. It would have to have some kind of smarts about what file is likely to be next. And not read too much.
On a POSIX system, you'd use posix_fadvise:
POSIX_FADV_WILLNEED
Specifies that the application expects to access the specified data in the near future.
However, that doesn't seem to exist on Windows. What is fadvise/madvise equivalent on windows ? - Stack Overflow has some alternatives.
Is it possible to read damaged media (cd, hdd, dvd,...) even if windows explorer bombs out?
What I mean to ask is, whether there is a set of APIs or something that can access the disk at a very low level (below explorer?) and read whatever can be retrieved even if it is only partial, especially if you can still see the file is there from explorer, but can't do anything with it because it is damaged somehow (scratch on cd, etc)?
The main problem with Windows Explorer is that it doesn't support resuming copying after a read error. Most superficially scratched CDs, for example, will fail on different areas of the disk every time you eject and reinsert them.
Therefore, with a utility that supports resuming copy operations, it is possible to read the entire contents of a damaged CD with by doing "eject/reload/resume" a few times.
In fact, this is what a utility I wrote does, and I've never needed anything fancier to read scratched disks. (It simply uses ReadFile and WriteFile.)
One step lower would be opening the raw partition (i.e. disk image) by passing a string such as "\.\F:" (note: slashes are literal here) to CreateFile. It would allow you to read raw sectors from a drive, but reconstructing files from that data would be hard.
In fact, the "\.\" syntax allows you to open devices in the "\GLOBAL??" branch of the Windows Object Manager namespace as if they were files. It's not unlike calling dd with /dev/x as a parameter. There is also a "\Device" branch, but that's only accessible via DeviceIoControl() (i.e. ioctl()), meaning there's no simple ReadFile()/WriteFile() interface.
Anything lower level than that would be device-specific, I guess; like reading raw CD-ROM data (including ECC bits) the way some CD-burning programs do. You'd have to do some research on the specific media (CD, flash, DVD) and what your hardware allows you to do on them.
Note: The backslashes seem to get lost on the way to the web page; you need to pass "backslash backslash dot backslash DeviceName" to CreateFile. You need to escape them, too, of course.
If you want to do it, do it from the Linux side - see: http://sourceforge.net/projects/monkeycity/ opensource
or ready made app and freeware too: http://www.theabsolute.net/sware/dskinv.html
the first step is dd_rescue. After that, you're free to try anything to reconstruct the data.
And there's GNU ddrescue
GNU ddrescue is a data recovery tool. It copies data from one file or block device (hard disc, cdrom, etc) to another, trying to rescue the good parts first in case of read errors.
Make sure to use the 3-arg version (manual):
ddrescue [options] infile outfile [mapfile]
That is, do use a mapfile even if it's optional, because:
If you use the mapfile feature of ddrescue, the data is rescued very efficiently, (only the needed blocks are read). Also you can interrupt the rescue at any time and resume it later at the same point. The mapfile is an essential part of ddrescue's effectiveness. Use it unless you know what you are doing.
And it's also included in Cygwin and Homebrew.
I don't know what layer exists between Windows Explorer and the Win32 APIs. You can try to write a program with the Win32 File I/O stuff. If that doesn't work, then you have to write your own device driver to get any lower.
I've had some luck from the linux side, or using BartPE (http://www.nu2.nu/pebuilder/), but just seeing the file doesn't always mean the file is going to be recoverable, whether you're trying from Windows or Linux. You're best bet might be to use a trial of a recovery program.
I have had two disks start to disintegrate on me. From the pattern of unreadable sectors I think they had internal flaking of their emulsion. WinXP Explorer just threw up its hands and said the drive didn't even exist.
In both cases I used "GetDataBack for NTFS" from Runtime Software (http://www.runtime.org/). You can download a free trial which will show you what you could get back if you paid for it. When I bought it it was $49, but I see it is now $79.
This program is amazing. It's not necessarily fast as it will reread some sectors over and over, trying to get a consensus value from multiple tries, but when it's done you can get back stuff that you thought was gone forever. I had one drive that it took over 10 hours to analyze, but when it was done I got back over 97% of a 500GB drive. Definitely worth the price.
Another great tool is Beyond Compare. I have rev 2.5.3, but it is currently at 3.?? and costs $30. They have a full-functionality, 30-day trail. It does a great job of copying large quantities of files (and only those that need to be copied) and, unlike Explorer, it doesn't blow up if something fails. It's sort of like a visual rsync for Windows, if you're familiar with that program from the Samba people.
I have no connection with either of the comapnies mentioned other than being a very satisfied customer.
The gold standard for recovering data from a magnetic storage device would have to be SpinRite. It's a commerical app though, so you probably wouldn't learn much from it.
If you have a Linux machine around, I can recommend dvdisaster. It is originally meant for creating error correction files, but it also reads DVDs into an image and ignores read errors; and you can use different drives one after another to get missing sectors filled in the image.