Get the latest/recent file from FTP to local using Talend - file

I have to create a job in Talend which will connect to One FTP. The FTP is having various files for each day with same prefix but different timestamp appended(yyyymmddhhmmss) in the filename.
Example -
MyFile20151123142020.xml
MyFile20151123154748.xml
My requirement is to pick the latest or most recent file and copy to my local.
I understand that this could be achieved either by referring to the latest timestamp in the filename or referring to the last modified time. I thought of proceeding with the later and my job looks like below -
I dont know how to proceed further and how to use the latest mtime value to pick the most recent file.

After getting file properties, wen need to sort files by mtime or by basename then pick the first.
tSortRow : sort by mtime or basename if they have same pattern.
tSampleRow : "1" to get the first
tFTPGet : file mask = row3.basename (row3 the output flow of tSampleRow)

Related

Date in NLog file name and limit the number of log files

I'd like to achieve the following behaviour with NLog for rolling files:
1. prevent renaming or moving the file when starting a new file, and
2. limit the total number or size of old log files to avoid capacity issues over time
The first requirement can be achieved e.g. by adding a timestamp like ${shortdate} to the file name. Example:
logs\trace2017-10-27.log <-- today's log file to write
logs\trace2017-10-26.log
logs\trace2017-10-25.log
logs\trace2017-10-24.log <-- keep only the last 2 files, so delete this one
According to other posts it is however not possible to use date in the file name and archive parameters like maxArchiveFiles together. If I use maxArchiveFiles, I have to keep the log file name constant:
logs\trace.log <-- today's log file to write
logs\archive\trace2017-10-26.log
logs\archive\trace2017-10-25.log
logs\archive\trace2017-10-24.log <-- keep only the last 2 files, so delete this one
But in this case every day on the first write it moves the yesterday's trace to archive and starts a new file.
The reason I'd like to prevent moving the trace file is because we use Splunk log monitor that is watching the files in the log folder for updates, reads the new lines and feeds to Splunk.
My concern is that if I have an event written at 23:59:59.567, the next event at 00:00:00.002 clears the previous content before the log monitor is able to read it in that fraction of a second.
To be honest I haven't tested this scenario as it would be complicated to set up as my team doesn't own Splunk, etc. - so please correct me if this cannot happen.
Note also I know that it is possible to directly feed Splunk other ways like via network connection, but the current setup for Splunk at our company is reading from log files so it would be easier that way.
Any idea how to solve this with NLog?
When using NLog 4.4 (or older) then you have to go into Halloween mode and make some trickery.
This example makes hourly log-files in the same folder, and ensure archive cleanup is performed after 840 hours (35 days):
fileName="${logDirectory}/Log.${date:format=yyyy-MM-dd-HH}.log"
archiveFileName="${logDirectory}/Log.{#}.log"
archiveDateFormat="yyyy-MM-dd-HH"
archiveNumbering="Date"
archiveEvery="Year"
maxArchiveFiles="840"
archiveFileName - Using {#} allows the archive cleanup to generate proper file wildcard.
archiveDateFormat - Must match the ${date:format=} of the fileName (So remember to correct both date-formats, if change is needed)
archiveNumbering=Date - Configures the archive cleanup to support parsing of filenames as dates.
archiveEvery=Year - Activates the archive cleanup, but also the archive file operation. Because the configured fileName automatically ensures the archive file operation, then we don't want any additional archive operations (Ex. avoiding generating extra empty files at midnight).
maxArchiveFiles - How many archive files to keep around.
With NLog 4.5 (Still in BETA), then it will be a lot easier (As one just have to specify MaxArchiveFiles). See also https://github.com/NLog/NLog/pull/1993

UNIX Shell script: file reading issue

I have to read a file in my shell script. I was using PL/SQL's UTL_FILE to open the file.
But I have to do a new change which will append timestamp to the file.
e.g import.data file becomes import_20152005101200.data
Now timestamp is the time at which file arrive at the server.
Since the file name changed I can't use the old way of file accessing.
I came up with below solution:
UTL_FILE.FOPEN ('path','import_${file_date}.data','r');
To achieve this I have to get filename and trim it using SUBSTR to get timestamp and pass to file_date variable.
However I am not able to find how to access filename in a particular path. I can use basename. But My file name keeps changing because of timestamp.
Any help/ alternate ideas are welcome.
PL/SQL isn't a good tool to solve this problem; UTL_FILE doesn't have any tools to list all the files in a folder.
A better solution is to define a stored procedure which uses UTL_FILE and pass the file name to process as an argument to the procedure. That way, you use the shell (which has many powerful commands and tools to examine folders and files) or a script language like Python to determine which file to process.

Strange timestamp duplication when renaming and recreating a file

I'm trying to rename a log file named appname.log into the form appname_DDMMYY.log for archiving purposes and recreate an empty appname.log for further writing. When doing this in Windows 7 using C++ and either WinAPI or Qt calls (which may be the same internally) the newly created .log file strangely inherits the timestamps (last modified, created) from the renamed file.
This behaviour is also observable when renaming a file in Windows Explorer and creating a file with the same name quickly afterwards in the same directory. But it has to be done fast. After clicking on "new Text File" the timestamps are normal but after renaming they change to the timestamps the renamed file had or still has.
Is this some sort of Bug? How can I rename a file and recreate it shortly afterwards without getting the timestamps messed up?
This looks like it is by design, perhaps to try to preserve the time for "atomic saving." If an application does something like (save to temp, delete original, rename temp to original) to eliminate the risk of a mangled file, every time you saved a file the create time would increase. A file you have been editing for years would appear to have been created today. This kind of save pattern is very common.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms724320(v=vs.85).aspx
If you rename or delete a file, then restore it shortly thereafter, Windows searches the cache for file information to restore. Cached information includes its short/long name pair and creation time.
Notice that modification time is not restored. So after saving the file appears to have been modified and the creation time is the same as before.
If you create "a-new" and rename it back to "a" you get the old creation time of "a". If you delete "a" and recreate "a" you get the old creation time of "a".
This behaviour is called "File Tunneling". File Tunneling is allow "...to enable compatibility with programs that rely on file systems being able to hold onto file meta-info for a short period of time". Basically backward compatibility for older Windows systems that use a "safe save" function that involved saving a copy of the new file to a temp file, deleting the original and then renaming the temp file to the original file.
Please see the following KB article: https://support.microsoft.com/en-us/kb/172190 (archive)
As a test example, create FileA, rename FileA to FileB, Create FileA again (within 15 seconds) and the creation date will be the same as FileB.
This behaviour can be disabled in the registry as per the KB article above. This behaviour is also quite annoying when "forensicating" Windows machines.
Regards
Adam B
Here's a simple python script that repro's the issue on my Windows7 64bit system:
import time
import os
def touch(path):
with open(path, 'ab'):
os.utime(path, None)
touch('a')
print " 'a' timestamp: ", os.stat('a').st_ctime
os.rename('a', 'a-old')
time.sleep(15)
touch('a')
print "new 'a' timestamp: ", os.stat('a').st_ctime
os.unlink('a')
os.unlink('a-old')
With the sleep time ~15 seconds I'll get the following output:
'a' timestamp: 1436901394.9
new 'a' timestamp: 1436901409.9
But with the sleep time <= ~10 seconds one gets this:
'a' timestamp: 1436901247.32
new 'a' timestamp: 1436901247.32
Both files... created 10 seconds apart have the time created-timestamp!

How to process only the last file in a directory using Apache Camel's file component

I have a directory with files likes this:
inbox/
data.20130813T1921.json
data.20130818T0123.json
data.20130901T1342.json
I'm using Apache Camel 2.11 and on process start, I only want to process one file: the latest. The other files can actually be ignored. Alternatively, the older files can be deleted once a new file has been processed.
I'm configuring my component using the following, but it obviously doesn't do what I need:
file:inbox/?noop=true
noop does keep the last file, but also all other files. On startup, Camel processes all existing files, which is more than I need.
What is the best way to only process the latest file?
You can use the sorting and then sort by name, and possible need to reverse it so the latest is first / last. You can try it out to see which one you need. And then set maxMessagesPerPoll=1 to only pickup one file. And you need to set eagerMaxMessagesPerPoll=false to allow to sort before limiting the number of files.
You can find details at: http://camel.apache.org/file2. See the section Sorting using sortBy for the sorting.
An alternative would be to still using the sorting to ensure the latest file is last. Then you can use the aggregator EIP to aggregate all the files, and use org.apache.camel.processor.aggregate.UseLatestAggregationStrategy as the aggregation strategy to only keep the last (which would be the latest file). Then you can instruct the file endpoint to delete=true to delete the files when done. You would then also need to configure the aggregator to completionFromBatchConsumer=true.
The aggregator eip is documented here: http://camel.apache.org/aggregator2

Get last modified file without enumerating all files

In c#, given a folder path, is there a way to get the last modified file without getting all files?
I need to quickly find folders that have been updated after a certain time and if the file that was last modified is before the time, i want to skip the folder entirely.
I noticed that folder's last modified time does not get updated when one of its file get updated so this approach does't work.
No, this is why windows comes with indexing to speed up searching. The NTFS file system wasn't designed with fast searching in mind.
In any case you can monitor file changes which is not difficult to do. If it is possible to allow your program to run in the background and monitor changes then this would work. If you needed past history you could do an initial scan only once and then build up your hierarchy from their. As long as your program is always being ran then it should have a current snapshot and not have to do the slow scan.
You can also use the Window Search itself to find the files. If indexing is available then it's probably as fast as you'll get.
Try this.
DirectoryInfo di = new DirectoryInfo("strPath");
DateTime dt = di.LastWriteTime;
Then you should use
Directory.EnumerateFiles(strPath, "*.*", SearchOption.TopDirectoryOnly);
Then loop the above collection and get FileInfo() for each file.
I don't see a way how can you get the modified date of a file w/o getting reference to FileInfo() on that file.
I don't think FileInfo will get this file as far as I know.

Resources