I have looked up various methods of converting wide strings to single-byte characters, but am unsure how to approach the char d_name[260] field of a struct dirent entry pertaining to a file with a wide character name.
Is the d_name field completely invalid or is there any way to obtain a valid wide string or single-byte string from it?
Any help would be much appreciated, thank you.
Updates:
Platform: Windows 10
Compiler: MinGW
Problematic section of code (minimum reproducible example):
struct dirent *entry;
DIR *dir = opendir(srcDir);
time_t mtime = 0;
size_t PDFlen = 260 + HPlen + 12 + 1;
char *PDF = malloc(PDFlen);
char finalPDF[264] = {0};
memset_c(PDF, 0, PDFlen);
while ((entry = readdir(dir)) != NULL) {
if (endswith(entry->d_name, ".pdf")) {
strcpy_c(PDF, srcDir); strcat_c(PDF, "\\"); strcat_c(PDF, entry->d_name);
stat64(PDF, &buffer);
if (buffer.st_mtime > mtime) {
mtime = buffer.st_mtime;
#ifdef _WIN32
strcpy_c(finalPDF, "\\");
#else
strcpy_c(finalPDF, "/");
#endif
strcat_c(finalPDF, entry->d_name);
}
}
}
... then later on I am unable to open finalPDF with fopen(), hence my question (finalPDF is a PDF file in Spanish whose filename contains two non-ASCII characters).
P.S. srcDir is my Downloads folder and HPlen is the number of characters in my home path.
Related
I'm trying to list all the contents of a folder (including subfolder and its files)
Like ls -R with Linux
(I am using windows 10)
I already have this basic code with "dirent.h"
#include <stdio.h>
#include <dirent.h>
int main()
{
DIR *rep ;
struct dirent *file ;
rep = opendir ("c:\test") ;
if (rep != NULL)
{
while (file = readdir(rep))
printf ("%s\n", file->d_name) ;
(void) closedir (rep) ;
}
return 0;
}
It lists the contents of a folder well but does not browse the sub-folders
For example it could browse a whole hard drive
like C: /
I can't use d_type for detect if the content is a file or a folder
Because with windows the struct is:
struct dirent
{
long d_ino; /* Always zero. */
unsigned short d_reclen; /* Always zero. */
unsigned short d_namlen; /* Length of name in d_name. */
char d_name[260]; /* [FILENAME_MAX] */ /* File name. */
};
So I'm stuck on this problem, if anyone has an idea, or even a code
COMPILER: MinGW32 1.5.0
Here is an example of directory list for Windows.
I used Microsoft Visual Studio Community 2019 to build. It works as a Unicode Windows application. That is files and folders having name with non ASCII characters are handled correctly.
To achieve that, I used Windows typical data types and functions:
char -> WCHAR
strcpy -> wcscpy
strcat -> wcscat
strncmp -> wcsncmp
printf -> wprintf
Depending on the compiler you use, you may use the standard data types and functions.
String constant are prefixed with L to specify an Unicode string (16 bit characters).
The main function is ScanDir which take the starting directory and a file mask. Example of call:
ScanDir(L"C:\\Users\\fpiette\\Documents", L"*.jpg");
ScanDir will scan the specified folder for all files and then scan again for all directories, calling ScanDir recursively. For each file, the size and filename are displayed (Of course you may display other properties like time stamp and attributes). For each directory, the name is displayed.
Basically, iterating a directory is done using Windows FindFirstFile and FileNextFile.
Source code:
#include <stdio.h>
#include <stdlib.h>
#include <io.h>
#include <fcntl.h>
#include <Windows.h>
BOOL ScanDir(
WCHAR* srcDir,
WCHAR* fileMask)
{
WIN32_FIND_DATA fd;
HANDLE fh;
BOOL more;
WCHAR fromDir[MAX_PATH];
BOOL found;
_int64 fileSize;
wcscpy(fromDir, srcDir);
wcscat(fromDir, L"\\");
wcscat(fromDir, fileMask);
// First step: process files in current dir
fh = FindFirstFile(fromDir, &fd);
more = fh != INVALID_HANDLE_VALUE;
found = FALSE;
while (more) {
// Ignore directories in first step
if (0 == (fd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
if (!found) {
// For the first file found, display the title
found = TRUE;
wprintf(L"\nDirectory %s\n\n", srcDir);
}
fileSize = ((_int64)fd.nFileSizeHigh << 32) + fd.nFileSizeLow;
// display file information
wprintf(L"%12lld %s\n", fileSize, fd.cFileName);
}
more = FindNextFile(fh, &fd);
}
FindClose(fh);
// Second step: recursively process subfolders
wcscpy(fromDir, srcDir);
wcscat(fromDir, L"\\*.*");
fh = FindFirstFile(fromDir, &fd);
more = fh != INVALID_HANDLE_VALUE;
while (more) {
// Ignore files in second step
if (0 != (fd.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)) {
// We have a directory, process it recursively
if (wcsncmp(fd.cFileName, L".", 2) && // Ignore current directory "."
wcsncmp(fd.cFileName, L"..", 3)) { // Ignore parent directory ".."
wcscpy(fromDir, srcDir);
wcscat(fromDir, L"\\");
wcscat(fromDir, fd.cFileName);
if (!ScanDir(fromDir, fileMask))
return 0;
}
}
more = FindNextFile(fh, &fd);
}
FindClose(fh);
return TRUE;
}
int main()
{
// Change console output to unicode 16 bit (default is OEM)
_setmode(_fileno(stdout), _O_U16TEXT);
ScanDir(L"C:\\Users\\fpiette\\Documents", L"*.jpg");
return 0;
}
I generate a list of files using dirent but l am getting worried about directories and files that contain unicode characters.
void recurse_dir(char *dir)
{
DIR* d;
d = opendir(dir);
struct dirent* ent;
unsigned short int dir_size = strlen(dir), tmp_dir_size;
if(d != NULL)
{
while((ent = readdir(d)) != NULL)
{
if(ent->d_type == DT_DIR)
{
if(!strcmp(ent->d_name,".") || !strcmp(ent->d_name,".."))
continue;
folder_count++;
char tmp_dir[dir_size + strlen(ent->d_name) + 2];
tmp_dir[0] = '\0';
strcat(tmp_dir,dir);
strcat(tmp_dir,"/");
strcat(tmp_dir,ent->d_name);
recurse_dir(tmp_dir);
}
else
{
file_count++;
file_strs_size += dir_size + strlen(ent->d_name) + 2;
fprintf(list_fp, "%s/%s\n",dir, ent->d_name);
}
}
}
closedir(d);
}
Is there a way that l can get the ent->d_name in wide string format?
You can store all Unicode characters in a char array, using UTF-8 format. That is probably the way your OS is storing that name, so if you want the name in UTF-16 or UTF-32 you can do the conversion using a function that takes care of that, for example iconv.
Just run mbstowcs() over the file name. They're in UTF-8; the mbstowcs() function will convert it to wchar*.
As a side note, struct dirent.d_type is not very portable. It's useful as a shortcut/performance optimization, but:
some file systems (XFS is the most well known example) will always store DT_UNKNOWN in that member, so your code will fail there;
It's not part of POSIX, so some operating systems (e.g., Solaris) don't even have it, so your code won't compile there.
In my case, I've used a switch and a bit of preprocessor magic to handle both in the same code.
#include<stdio.h>
#include<stdlib.h>
#include<dirent.h>
#include<string.h>
int main()
{
FILE *fin,*fout;
char dest[80]="/home/vivs/InexCorpusText/";
char file[30];
DIR *dir;
char c,state='1';
int len;
struct dirent *ent;
if((dir=opendir("/home/vivs/InexCorpus"))!=NULL)
{
while((ent=readdir(dir))!=NULL)
{
if(strcmp(ent->d_name,".") &&
strcmp(ent->d_name,"..") &&
strcmp(ent->d_name,".directory"))
{
len=strlen(ent->d_name);
strcpy(file,ent->d_name);
file[len-3]=file[len-1]='t';
file[len-2]='x';
//strcat(source,ent->d_name);
strcat(dest,file);
printf("%s\t%s\n",ent->d_name,dest);
fin=fopen(ent->d_name,"r");
fout=fopen(dest,"w");
while((c=fgetc(fin))!=EOF)
{
if(c=='<')
{
fputc(' ',fout);
state='0';
}
else if(c=='>')
state='1';
else if(state=='1')
{
if(c!='\n')
fputc(c,fout);
if(c=='.')
{
c=fgetc(fin);
if(c==' '||c=='\n'||c=='<')
{
fputc('\n',fout);
ungetc(c,fin);
}
else fputc(c,fout);
}
}
}
}
close(fin);
close(fout);
strcpy(dest,"/home/vivs/InexCorpusText/");
}
closedir(dir);
}
else
{
printf("Error in opening directory\n");
}
return 0;
}
I was trying to convert xml files to text. This code simply remove tags and nothing else.
When i execute this code for around 300 files, it doesn't show any error but when number goes to 500 or more i receive a segmentation fault after processing around 300 files.
At least one reason 'right from the start':
Here is struct dirent declaration from man:
On Linux, the dirent structure is defined as follows:
struct dirent {
ino_t d_ino; /* inode number */
off_t d_off; /* offset to the next dirent */
unsigned short d_reclen; /* length of this record */
unsigned char d_type; /* type of file; not supported
by all file system types */
char d_name[256]; /* filename */
};
You are in trouble on any name longer than 30 (actually 29) chars. Memory overwrite occurs because file has only 30 bytes (reserve 1 for '\0' terminator):
char file[30];
...
strcpy(file,ent->d_name);
There are two structures within XML that it does not appear that you account for.
Attribute contents can contain unescaped > characters, which could throw off your count. See http://www.w3.org/TR/REC-xml/#NT-AttValue.
CDATA sections can contain both < and > characters as literal text, as long as they do not appear as part of the closing ]]> string. See http://www.w3.org/TR/REC-xml/#NT-CharData. This could seriously throw off your logic.
Why don't you look in your files to see if any contain the text CDATA?
You might want to consider using xsltproc or libxslt; a very simple XSLT transform would give you exactly what you want. See Extract part of an XML file as plain text using XSLT for such a transform engine.
OK, another problematic place:
len=strlen(ent->d_name);
....
file[len-3]=file[len-1]='t';
file[len-2]='x';
Because d_name could have less than 3 characters it could again lead to memory overwrite.
You should be careful with functions like strlen() and always validate their result.
There should be something elegant in Linux API/POSIX to extract base file name from full path
See char *basename(char *path).
Or run the command "man 3 basename" on your target UNIX/POSIX system.
Use basename (which has odd corner case semantics) or do it yourself by calling strrchr(pathname, '/') and treating the whole string as a basename if it does not contain a '/' character.
Here's an example of a one-liner (given char * whoami) which illustrates the basic algorithm:
(whoami = strrchr(argv[0], '/')) ? ++whoami : (whoami = argv[0]);
an additional check is needed if NULL is a possibility. Also note that this just points into the original string -- a "strdup()" may be appropriate.
You could use strstr in case you are interested in the directory names too:
char *path ="ab/cde/fg.out";
char *ssc;
int l = 0;
ssc = strstr(path, "/");
do{
l = strlen(ssc) + 1;
path = &path[strlen(path)-l+2];
ssc = strstr(path, "/");
}while(ssc);
printf("%s\n", path);
The basename() function returns the last component of a path, which could be a folder name and not a file name. There are two versions of the basename() function: the GNU version and the POSIX version.
The GNU version can be found in string.h after you include #define _GNU_SOURCE:
#define _GNU_SOURCE
#include <string.h>
The GNU version uses const and does not modify the argument.
char * basename (const char *path)
This function is overridden by the XPG (POSIX) version if libgen.h is included.
char * basename (char *path)
This function may modify the argument by removing trailing '/' bytes. The result may be different from the GNU version in this case:
basename("foo/bar/")
will return the string "bar" if you use the XPG version and an empty string if you use the GNU version.
References:
basename (3) - Linux Man Pages
Function: char * basename (const char *filename), Finding Tokens in a String.
Of course if this is a Gnu/Linux only question then you could use the library functions.
https://linux.die.net/man/3/basename
And though some may disapprove these POSIX compliant Gnu Library functions do not use const. As library utility functions rarely do. If that is important to you I guess you will have to stick to your own functionality or maybe the following will be more to your taste?
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
char *fn;
char *input;
if (argc > 1)
input = argv[1];
else
input = argv[0];
/* handle trailing '/' e.g.
input == "/home/me/myprogram/" */
if (input[(strlen(input) - 1)] == '/')
input[(strlen(input) - 1)] = '\0';
(fn = strrchr(input, '/')) ? ++fn : (fn = input);
printf("%s\n", fn);
return 0;
}
template<typename charType>
charType* getFileNameFromPath( charType* path )
{
if( path == NULL )
return NULL;
charType * pFileName = path;
for( charType * pCur = path; *pCur != '\0'; pCur++)
{
if( *pCur == '/' || *pCur == '\\' )
pFileName = pCur+1;
}
return pFileName;
}
call:
wchar_t * fileName = getFileNameFromPath < wchar_t > ( filePath );
(this is a c++)
You can escape slashes to backslash and use this code:
#include <stdio.h>
#include <string.h>
int main(void)
{
char path[] = "C:\\etc\\passwd.c"; //string with escaped slashes
char temp[256]; //result here
char *ch; //define this
ch = strtok(path, "\\"); //first split
while (ch != NULL) {
strcpy(temp, ch);//copy result
printf("%s\n", ch);
ch = strtok(NULL, "\\");//next split
}
printf("last filename: %s", temp);//result filename
return 0;
}
I used a simpler way to get just the filename or last part in a path.
char * extract_file_name(char *path)
{
int len = strlen(path);
int flag=0;
printf("\nlength of %s : %d",path, len);
for(int i=len-1; i>0; i--)
{
if(path[i]=='\\' || path[i]=='//' || path[i]=='/' )
{
flag=1;
path = path+i+1;
break;
}
}
return path;
}
Input path = "C:/Users/me/Documents/somefile.txt"
Output = "somefile.txt"
#Nikolay Khilyuk offers the best solution except.
1) Go back to using char *, there is absolutely no good reason for using const.
2) This code is not portable and is likely to fail on none POSIX systems where the / is not the file system delimiter depending on the compiler implementation. For some windows compilers you might want to test for '\' instead of '/'. You might even test for the system and set the delimiter based on the results.
The function name is long but descriptive, no problem there. There is no way to ever be sure that a function will return a filename, you can only be sure that it can if the function is coded correctly, which you achieved. Though if someone uses it on a string that is not a path obviously it will fail. I would have probably named it basename, as it would convey to many programmers what its purpose was. That is just my preference though based on my bias your name is fine. As far as the length of the string this function will handle and why anyone thought that would be a point? You will unlikely deal with a path name longer than what this function can handle on an ANSI C compiler. As size_t is defined as a unsigned long int which has a range of 0 to 4,294,967,295.
I proofed your function with the following.
#include <stdio.h>
#include <string.h>
char* getFileNameFromPath(char* path);
int main(int argc, char *argv[])
{
char *fn;
fn = getFileNameFromPath(argv[0]);
printf("%s\n", fn);
return 0;
}
char* getFileNameFromPath(char* path)
{
for(size_t i = strlen(path) - 1; i; i--)
{
if (path[i] == '/')
{
return &path[i+1];
}
}
return path;
}
Worked great, though Daniel Kamil Kozar did find a 1 off error that I corrected above. The error would only show with a malformed absolute path but still the function should be able to handle bogus input. Do not listen to everyone that critiques you. Some people just like to have an opinion, even when it is not worth anything.
I do not like the strstr() solution as it will fail if filename is the same as a directory name in the path and yes that can and does happen especially on a POSIX system where executable files often do not have an extension, at least the first time which will mean you have to do multiple tests and searching the delimiter with strstr() is even more cumbersome as there is no way of knowing how many delimiters there might be. If you are wondering why a person would want the basename of an executable think busybox, egrep, fgrep etc...
strrchar() would be cumbersome to implement as it searches for characters not strings so I do not find it nearly as viable or succinct as this solution. I stand corrected by Rad Lexus this would not be as cumbersome as I thought as strrchar() has the side effect of returning the index of the string beyond the character found.
Take Care
My example (improved):
#include <string.h>
const char* getFileNameFromPath(const char* path, char separator = '/')
{
if(path != nullptr)
{
for(size_t i = strlen(path); i > 0; --i)
{
if (path[i-1] == separator)
{
return &path[i];
}
}
}
return path;
}
How would one search for files on a computer?
Maybe looking for certain extensions.
I need to iterate through all the files and examine file names.
Say I wanted to find all files with an .code extension.
For Windows, you would want to look into the FindFirstFile() and FindNextFile() functions. If you want to implement a recursive search, you can use GetFileAttributes() to check for FILE_ATTRIBUTE_DIRECTORY. If the file is actually a directory, continue into it with your search.
A nice wrapper for FindFirstFile is dirent.h for windows (google dirent.h Toni Ronkko)
#define S_ISREG(B) ((B)&_S_IFREG)
#define S_ISDIR(B) ((B)&_S_IFDIR)
static void
scan_dir(DirScan *d, const char *adir, BOOL recurse_dir)
{
DIR *dirfile;
int adir_len = strlen(adir);
if ((dirfile = opendir(adir)) != NULL) {
struct dirent *entry;
char path[MAX_PATH + 1];
char *file;
while ((entry = readdir(dirfile)) != NULL)
{
struct stat buf;
if(!strcmp(".",entry->d_name) || !strcmp("..",entry->d_name))
continue;
sprintf(path,"%s/%.*s", adir, MAX_PATH-2-adir_len, entry->d_name);
if (stat(path,&buf) != 0)
continue;
file = entry->d_name;
if (recurse_dir && S_ISDIR(buf.st_mode) )
scan_dir(d, path, recurse_dir);
else if (match_extension(path) && _access(path, R_OK) == 0) // e.g. match .code
strs_find_add_str(&d->files,&d->n_files,_strdup(path));
}
closedir(dirfile);
}
return;
}
Use FindFirstFile() or FindNextFile() functions and a recursive algorithm to traverse sub-folders.
FindFirstFile()/ FindNextFile() will do the job in finding the list of files in the directory. To do recursive search through the sub-directories you might use _splitpath
to split the path, into directory and filenames, and then use the resulting directory detail to do a recursive directory search.