Working with large files with a particular extension using directory scan operator - data-ingestion

I have a 1GB+ size file coming to my directory from a MQ, this takes some time to completely transfer the file, but a file will be produced in that directory even if it is not a complete one.
I am afraid my directoryScan operator will pick up an incomplete file.
Also, I cannot add an initial delay because I am not sure how much time will it take to transfer the file.
PS: I read somewhere that some of the file transfer protocols take care of this by adding a different extension to the file until it is complete. So say my directoryScan operator is waiting for any file with .txt extension, so this file transfer protocol will create a file with extension .abc until the transfer is complete.
How should I go ahead with this?

If you were going to use the regular expression route, here is an example of invoking the operator to only read files of a certain extension:
// DirectoryScan operator with an absolute file argument and a file name pattern
stream<rstring name> Dir2 = DirectoryScan()
{
param
directory : "/tmp/work";
pattern : "\\.txt$";
}
If that doesn't work, is it possible to set up MQ to write the file to a different directory and then move it into your target directory when it is complete?
One thing you could do, if you know the size of the file, is use the Size() function to ignore the file until it is the right size. This snippet uses a Custom operator to wait until the file size is at least 2000 bytes.
graph
stream <rstring filename, uint64 size> DirScanOutput = DirectoryScan() {
param
directory: "test1";
sleepTime: 10.0; //wait 10s between scans
pattern: ".*\\.txt";
output DirScanOutput : size= Size();
}
stream<rstring file> FileNameStream= Custom(DirScanOutput as In){
logic
onTuple In:{
if (size < 2000ul){
printStringLn("Required size not met yet.");
} else {
printStringLn("Size of file reached.");
submit({file=filename}, FileNameStream);
}
}
}
stream <cityData> CityDataRecord = FileSource(FileNameStream) {
param
format: csv;
}
I hope one of these suggestions works for you.

Related

How to interact with an external text editor in C

I am developing a command line application in C (linux environment) to edit a particular file format. This file format is a plain XML file, which is compressed, then encrypted, then cryptographically signed.
I'd like to offer an option to the user to edit this kind of file in an easy way, without the hassle of manualy extracting the file, editing it, and then compressing, encrypting and signing it.
Ideally, when called, my application should do the following:
Open the encrypted/compressed file and extract it to a temporary location (like /tmp)
Call an external text editor like nano or sublime-text or gedit depending on which is installed and maybe the user preferences. Wait until the user have edited the file and closed the text editor.
Read the modified temporary file and encrypt/compress it, replacing the old encrypted/compressed file
How can I achieve point no. 2?
I thought about calling nano with system() and waiting for it to return, or placing an inotify() on the temp file to know when it is modified by the graphical text editor.
Which solution is better?
How can i call the default text editor of the user?
Anything that can be done in a better way?
First, consider not writing an actual application or wrapper yourself, which calls another editor, but rather writing some kind of plugin for some existing editor which is flexible enough to support additional formats and passing its input through decompression.
That's not the only solution, of course, but it might be easier for you.
With your particular approach, you could:
Use the EDITOR and/or VISUAL command-line variables (as also pointed out by #KamilCuk) to determine which editor to use.
Run the editor as a child process so that you know when it ends execution, rather than having to otherwise communicate with it. Being notified of changes to the file, or even to its opening or closing, is not good enough, since the editor may make changes multiple files, and some editors don't even keep the file open while you work on it in them.
Remember to handle the cases of the editor failing to come up; or hanging; or you getting some notification to stop waiting for the editor; etc.
Call an external text editor like nano or sublime-text or gedit depending on which is installed and maybe the user preferences. Wait until the user have edited the file and closed the text editor.
Interesting question. One way to open the xml file with the user's default editor is using the xdg-open, but it doesn't give the pid of the application, in which user will edit the file.
You can use xdg-mime query default application/xml to find out the .desktop file of the default editor, but then you have to parse this file to figure out the executable path of the program - this is exactly how xdg-open actually works, in the search_desktop_file() function the line starting with Exec= entry is simply extracted from the *.desktop to call the editor executable and pass the target file as argument... What I am trying to say, is, after you find the editor executable, you can start it, and wait until it's closed, and then check if the file content has been changed. Well, this looks like a lot of unnecessary work...
Instead, you can try a fixed well-known editor, such as gedit, to achieve the desired workflow. You can also provide user a way (i.e. a prompt or config file) to set a default xml editor, i.e. /usr/bin/sublime_text, which then can be used in your programm on next run.
However, the key is here to open an editor that blocks the calling process, until user closes the editor. After the editor is closed, you can simply check if the file has been changed and if so, perform further operations.
To find out, if the file contents have been modified, you can use the stat system call to get the inode change time of the file, before you open the file, and then compare the timestamp value with the current one once it is closed.
i.e.:
stat -c %Z filename
Output: 1558650334
Wrapping up:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void execute_command(char* cmd, char* result) {
FILE *fp;
fp = popen(cmd, "r");
fscanf (fp, "%s" , result);
}
int get_changetime(char* filename) {
char cmd[4096];
char output[10];
sprintf(cmd, "stat -c %%Z %s", filename);
execute_command(cmd, output);
return atoi(output);
}
int main() {
char cmd[4096];
char* filename = "path/to/xml-file.xml";
uint ctime = get_changetime(filename);
sprintf(cmd, "gedit %s", filename);
execute_command(cmd, NULL);
if (ctime != get_changetime(filename)) {
printf("file modified!");
// do your work here...
}
return 0;
}

Fatfs significant slow down in directories with many files

I have a data logging system running on an STM32F7 which is storing data using FatFs by ChaN to an SD card:
http://elm-chan.org/fsw/ff/00index_e.html
Each new set of data is stored in a separate file withing a directory. During post-processing on the device, each file is read and then deleted. After testing the open, read, delete sequence in a directory with 5000 files I found that the further through the directory I scanned the slower it got.
At the beginning this loop would take around 100-200ms, 2000 files in and now it takes 700 ms. Is there a quicker way of storing, reading, deleting data? or configuring FatFs?
edit: Sorry should have specified, I am using FAT32 as the FAT file system
f_opendir(&directory, "log");
while(1) {
f_readdir(&directory, &fInfo);
if(fInfo.fname[0] == 0) {
//end of the directory
break;
}
if(fInfo.fname[0] == '.') {
//ignore the dot entries
continue;
}
if(fInfo.fattrib & AM_DIR) {
//its a directory (shouldnt be here), ignore it
continue;
}
sprintf(path, "log/%s", fInfo.fname);
f_open(&file, path, FA_READ);
f_read(&file, rBuf, btr, &br);
f_close(&file);
//process data...
f_unlink(path); //delete after processing
}
You can keep the directory chains shorter by splitting your files into more than one directory (simply create a new subdirectory for every 500 files or so). This can make access to a specific file quite a bit faster, as the chains to walk become shorter on average. (This is just assuming that you are not searching for files with a specific name, but rather process files in the order they have been created - In this case the search algorithm can be pretty straightforward).
Other than that, there is not much hope to get a simple FAT file system any faster. This is a principal problem of the old FAT technology.

Program to compile files in a directory in openedge

Could someone help me in writing a program that has to compile all the files in the directory and report error, if any. For which my program has to get the list of all files under the folder with its full path and store it in a temp-table and then it has to loop through the temp table and compile the files.
Below is a very rough start.
Look for more info around the COMPILE statement and the COMPILER system handle in the online help (F1).
Be aware that compiling requires you to have a developer license installed. Without it the COMPILE statement will fail.
DEFINE VARIABLE cDir AS CHARACTER NO-UNDO.
DEFINE VARIABLE cFile AS CHARACTER NO-UNDO FORMAT "x(30)".
ASSIGN
cDir = "c:\temp\".
INPUT FROM OS-DIR(cDir).
REPEAT:
IMPORT cFile.
IF cFile MATCHES "*..p" THEN DO:
COMPILE VALUE(cDir + cFile) SAVE NO-ERROR.
IF COMPILER:ERROR THEN DO:
DISPLAY
cFile
COMPILER:GET-MESSAGE(1) FORMAT "x(60)"
WITH FRAME frame1 WIDTH 300 20 DOWN.
END.
END.
END.
INPUT CLOSE.
Since the comment wouldn't let me paste this much into it... using INPUT FROM OS-DIR returns all of the files and directories under a directory. You can use this information to keep going down the directory tree to find all sub directories
OS-DIR documentation:
Sometimes, rather than reading the contents of a file, you want to read a list of the files in a directory. You can use the OS–DIR option of the INPUT FROM statement for this purpose.
Each line read from OS–DIR contains three values:
*The simple (base) name of the file.
*The full pathname of the file.
*A string value containing one or more attribute characters. These characters indicate the type of the file and its status.
Every file has one of the following attribute characters:
*F — Regular file or FIFO pipe
*D — Directory
*S — Special device
*X — Unknown file type
In addition, the attribute string for each file might contain one or more of the following attribute characters:
*H — Hidden file
*L — Symbolic link
*P — Pipe file
The tokens are returned in the standard ABL format that can be read by the IMPORT or SET statements.

How to quickly create large files in C?

I am doing research on file system performance, and I am stumped on how to create a very large file very quickly in C. Basically, I am trying to re-create a file system's folders and files by taking this metadata and storing it into a file. This is the extraction process. Later, I want to restore those folders and files into an existing freshly-made file system (in this case, ext3) using the metadata I previously extracted.
In the restore process, I have already succeeded in creating all the folders. However, I am a little confused on how to create the files instantly. For every file that I want to create, I have a file size and a file path. I am just confused on how to set the size of the file very quickly.
I used truncate, but this does not seem to affect the disk space usage from the point of view of the file system.
Thanks!
#include < stdio.h >
#include < stdlib.h >
int main() {
int i;
FILE *fp;
fp=fopen("bigfakefile.txt","w");
for(i=0;i<(1024*1024);i++) {
fseek(fp,(1024*1024), SEEK_CUR);
fprintf(fp,"C");
}
fclose(fp);
return 0;
}
There is no way to do it instantly.
You need to have each block of the file written on disk and this is going to take a significant period of time, especially for a large file.

Check if a file equals to other

Well i have one file on my server and other on my computer. What i want to do is a simple updater that checks if the file of my computer is equal to the one uploaded in the server. (If it's equal then it hasn't been updated, if it's not equal then download)
I'm using QNetworkAccessManager to download files. Any idea?
You can generate a checksum from a file in the following way:
QCryptographicHash hash( QCryptographicHash::Sha1 );
QFile file( fileName );
if ( file.open( QIODevice::ReadOnly ) ) {
hash.addData( file.readAll() );
} else {
// Handle "cannot open file" error
}
// Retrieve the SHA1 signature of the file
QByteArray sig = hash.result();
Do this for both files (while somehow getting the signature from one machine to the other) and compare the results.
You could calculate the SHA-1 checksum of the file and then compare the two checksums. If they are equal, the files have the same contents.
You will need something on your server (a WebService, or a plain servlet/php) that would take a file name (or ID or smth) as parameter and reply with its checksum (SHA1, MD5).
If your local file checksum differs from the remote one - download it.

Resources