How to know character encoding of file names depending on the filesystem - filesystems

I would like to know the character encoding of the file names in a filesystem in order to display correctly them in a GUI.
How should I do this ?
I suppose I get different character encoding depending on the file system (FAT, NTFS, ext3, etc.)
Thank you
(I work in C++ but this topic is not language related)

NTFS is Unicode (UTF-16). exFAT is Unicode as well.
Original FAT and fAT32 use OEM character set (read more on MSDN).
On Linux and Unix filename may contain any bytes except NUL and the charater set is not defined. Consequently each application decides itself which one to use. Many applications use UTF8. See more in this question.
The above unix approach is used on most filesystems (mainly because the "charset" concept has more meaning on the OS level than on the storage level). You can check FS capabilities and requirements regarding filename characters here (table 2 column 3).

In Linux run then following command: locale | egrep "LANG=" | cut -d . -f 2
On Unix-like systems, the encoding of file names is not set at the filesystem level, but rather in the user environment. For instance, UTF-8 is the default setting in Ubuntu.
On Windows default encoding is CP-1252 (AKA ISO-8859-1 or Latin-1), but FS uses Unicode via UTF-16 encoding. See http://en.wikipedia.org/wiki/Filename.
But if you use Qt, you can build the following with Qt Creator and result be the current user encoding name.
#include <QTextCodec>
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
Q_UNUSED(argc); Q_UNUSED(argv);
QTextCodec* tc = QTextCodec::codecForLocale();
cout << "Current names text codec: " << tc->name().data() << endl;
return 0;
}

Related

What does the RCSID with dollar signs as the first and last characters mean in FreeBSD code like touch.c?

See https://opensource.apple.com/source/file_cmds/file_cmds-82/touch/touch.c Line36:
__RCSID("$FreeBSD: src/usr.bin/touch/touch.c,v 1.20 2002/09/04 23:29:07 dwmalone Exp $");
What does this line mean? What is __RCSID and what is the meaning of the string? Is this some standard message for version control?
In cdefs.h I found
#ifndef __RCSID
#define __RCSID(s) __IDSTRING(rcsid,s)
#endif
and
#define __IDSTRING(name, string) static const char name[] __used = string
But I still don't know what they are for.
This comes from the Revision Control System, one of the earliest version control systems, and later adopted by some other version control systems. If a file contains a string of the form $keyword:...$, the .. portion is replaced automatically by information about the version of the file when the file is checked in and out.
This is typically put into a static variable so that you can then search the resulting object file for the string, to find out what version of the source code was used to generate it. See the ident command for how this is used.
I checked some of my Linux systems and they don't have ident, but you can simply use strings:
strings /usr/bin/touch | grep FreeBSD:

IBM COBOL on AIX file access

We are migrating a bunch of COBOL programs from z/OS to AIX 7. We are using the IBM COBOL Compiler (5.1) on AIX. Now I don't understand how the file access and the file system work for COBOL on AIX.
The COBOL code is straight forward with
SELECT :FILE: ASSIGN TO :FILE:
ORGANIZATION IS SEQUENTIAL
ACCESS MODE IS SEQUENTIAL
STATUS IS S-STATUS.
and then doing an OPEN INPUT
This compiled fine on AIX:
PP 5724-Z87 IBM COBOL for AIX 5.1.0 in progress ...
LineID Message code Message text
91 IGYGR1216-I A "RECORDING MODE" of "F" was assumed for file
"CUSTOMERS".
94 IGYGR1216-I A "RECORDING MODE" of "F" was assumed for file
"LIST1".
97 IGYGR1216-I A "RECORDING MODE" of "F" was assumed for file "UINPUT".
Messages Total Informational Warning Error Severe Terminating
Printed: 3 3
End of compilation 1, program xxxxx, highest severity: Informational.
Now the problem is, that when running the program the file is not found. It gives Status code: 37
I know that I have to provide a file system for the file on the shell (ksh), as such:
export CUSTOMERS="STL-CUSTOMERS". The file is in the same directory as the program.
My question is this: Which file system to use? I tried "STL" which seemed to me like the "standard" AIX file system (which is JFS2). But that doesn't work. The other option are (from the COBOL on AIX Programming Guide 5.1):
DB2
SdU (SMARTdata Utilities)
SFS (Encina Structured File Server)
STL (standard language)
QSAM (queued sequential access method)
RSD (record sequential delimited)
We tried all and the file system that worked was "QSAM". The file is a text file (ASCII), that was transfered from the mainframe. But not directly, it was first copied by FTP and then converted to ASCII on Windows (we had to fix the line breaks). When playing around with it to make it work, we edited the file with vi to make the lines 80 characters each. So it was edited on AIX and looks like a ordinary text file on AIX.
Why would COBOL still want QSAM as "file system"? What does the term "file system" mean here any way? It seems that it is not in the sense of a real file system such as JFS.
I can see where this would be confusing... Especially coming from a Non-POSIX environment.
I'll start with the answer, and then provide more information.
For a normal text file, you want RSD, and to make sure that the \n is column 81.
A record length of 80 is just the data portion, not including the delimiter.
QSAM (fixed length) would appear to work, but would return the \n as part of the data!
Your FS=37 means that the attributes of the file don't match what the program is asking for (more esoteric than, say, FS=39 - invalid fixed attributes). In this case, it means the file you wanted to open was not, really, a STL file.
By filesystem we mean how the data is physically stored on the platter, SSD, RAM, ... More accurately, how we format the record before handing it off to the next lower level of I/O.
There are two basic types of filesystems:
Native (on top of JFS2, JFS, NFS, CIFS, ...) RSD, QSAM, LSQ, STL, SdU. These filesystems can be operated on by standard OS utilities
Non-Native (on top of another product) DB2 (DB2) and SFS (TxSeries/CICS). These filesystems are invisible to standard OS utilities
Then, there are groupings by type of COBOL organization (preferred):
Sequential: All filesystems support Sequential...z/OS: QSAM, VSAM
Relative: STL, SdU, SFS, DB2...........z/OS: VSAM
Indexed: STL, SdU, SFS, DB2...........z/OS: VSAM
Of the Native filesystems, QSAM(variable), STL and SDU contain metadata, making them not really viewable in vi,cat,... Ok... you can open them in vi, but it will look like utter garbage.
QSAM is a faithful implementation of z/OS:
Fixed length records: Raw data, no BDW/RDW, no line delimiters (\n).
Variable length records: RDW + raw data (no \n)... However there is no BDW.
RSD is a normal stream (text) file; each record is terminated with \n, which is not counted in the record length (the program will never see them).
LSQ (Line Sequential) is the same as on z/OS - messy semantics.
VSAM is an alias for a filesystem that provides all the features of z/OS VSAM.
Unfortunately, for historical reasons, it points to SdU...
STL is, by far and away, better at everything than is SdU.
SdU was the first filesystem to cover QSAM and VSAM, but is old and decrepit, compared to STL.

Read / Write special characters (like tilde, ñ,...) in a console application C

I'm trying that a C console application can read (using the keyboard) special Spanish characters such as accents, 'ñ', etc in a scanf or gets and then, print it too with printf.
I have achieved to show these characters correctly (stored in a variable or, directly, from printf) thanks to the package locale.h. I show an example:
#include <stdio.h>
// Add languaje package
#include <locale.h>
int main(void)
{
char string[254];
// Set languaje to Spanish
setlocale(LC_ALL, "spanish");
// Show correctly spanish special chars
printf("¡Success!. It is shown special chars like 'ñ' or 'á'.\n\n\n");
// Gets special chars by keyboard
printf("Input spanish special chars (such 'ñ'): ");
gets(string);
printf("Your string is: %s", string);
return 0;
}
but I have not yet achieved to pick them up correctly with the functions mentioned above.
Does anyone know how to do it?
Thank you.
EDIT 1:
In testing, I observed that:
setlocale(LC_ALL, "spanish"); It shows the characters of the Spanish correctly, but it does not collect them from the keyboard.
setlocale(LC_ALL, "es_ES"); It picks up the Spanish characters correctly from the keyboard, but it does not show them well.
EDIT 2:
I have tryed too setlocale(LC_ALL, "");, setlocale(LC_ALL, "es_ES.UTF-8"); and setlocale(LC_ALL, "es_ES.ISO_8859-15"); with the same results as EDIT 1 (or catch well characters from keyboard or show them well in console, but never both at the same time).
Microsoft's C runtime library (CRT) does not support UTF-8 as the locale encoding. It only supports Windows codepages. Also, "es_ES" isn't a valid CRT locale string, so setlocale would fail, leaving you in the default C locale. Newer versions of Microsoft's CRT support Windows locale names such as "es-ES" (hyphen, not underscore). Otherwise the CRT uses the full names or the old 3-letter abbreviations, e.g. "spanish_spain", "esp_esp" or "esp_esp.1252".
But that's not the end of the story. When reading from and writing to the console using legacy text encodings instead of Unicode, there's another layer of translation in the console itself. To avoid mojibake, you have to set the console input and output codepages (i.e. SetConsoleCP and SetConsoleOutputCP) to match the locale codepage. If you're limited to Spanish or Latin-1, then it should work to set the locale to "spanish" and set the console codepages via SetConsoleCP(1252) and SetConsoleOutputCP(1252). More generally you could look up the ANSI codepage for a given locale name, set the console codepages, and save them in order to reset the console at exit. For example:
wchar_t *locale_name = L"es-ES";
if (_wsetlocale(LC_ALL, locale_name)) {
int codepage;
gPrevConsoleCP = GetConsoleCP();
if (gPrevConsoleCP) { // The process is attached to a console.
gPrevConsoleOutputCP = GetConsoleOutputCP();
if (GetLocaleInfoEx(locale_name,
LOCALE_IDEFAULTANSICODEPAGE |
LOCALE_RETURN_NUMBER,
(LPWSTR)&codepage,
sizeof(codepage) / sizeof(wchar_t))) {
if (!codepage) { // The locale doesn't have an ANSI codepage.
codepage = GetACP();
}
SetConsoleCP(codepage);
SetConsoleOutputCP(codepage);
atexit(reset_console);
}
}
}
That said, when working with the console you will be better off in general if you set stdin and stdout to use _O_U16TEXT mode and use wide-character functions such as fgetws and wprintf. Ultimately, if supported by the C runtime library, this should use the wide-character console I/O functions ReadConsoleW and WriteConsoleW. The downside of using UTF-16 wide-character mode is that it would entail a complete rewrite of your code to use wchar_t strings and wide-character functions and also would require implementing adapters for libraries that work with multibyte encoded strings (preferably UTF-8).

Which path format to use in C on Windows, "D:\\source.txt" or "D:/source.txt"?

I only knew that we can't use D:\demo.txt as \d will be considered an escape character and hence we have to use D:\\demo.txt.But minutes ago I found out that D:/demo.txt works just as fine as we don't have to worry about escape characters with /. I am using CodeBlocks on Windows, and I want to know which one of these formats for path is valid for C on my platform.Here's my code and the commented-out lines work just as fine.
#include<stdio.h>
int main()
{
char ch;
FILE *fp,*tp;
fp=fopen("D:\\source.txt","r");
//fp=fopen("D:/source.txt","r");
tp=fopen("D:\\encrypt.txt","w");
//tp=fopen("D:/encrypt.txt","w");
if(fp==NULL||tp==NULL)
printf("ERROR");
while((ch=getc(fp))!=EOF)
putc(~ch,tp);
fclose(fp);
fclose(tp);
}
Windows (like MS-DOS before it) requires back-slashes as the path separator for the command line tools built into/provided by Windows.
Internal functions, however, have always accepted forward or backward slashes interchangeably. Personally, I prefer forward slashes as a general rule, but it's mostly personal preference -- either works fine.
It's true that Windows and MS-DOS accept either the forward slash / or the backslash \ as a directory path delimiter. And there are good arguments for using the forward slash in C code, because it doesn't have to be escaped in string and character literals.
But my own preference is to use the backslash (and remember to escape it properly), because most Windows users likely don't know that you can use / as a directory delimiter. It doesn't matter for an fopen call; these are equivalent (on Windows):
fopen("D:\\foo\\bar\\blah.txt", "r");
fopen("D:/foo/bar/blah.txt", "r");
But if that file name is ever shown to a user, IMHO it's a lot better if the message refers to D:\foo\bar\blah.txt.
You could use forward slashes for paths that are used only internally, and backslashes for paths that appear in the user interface, but that's going to be more difficult and error-prone than using one or the other consistently.
Incidentally, the C language says nothing about which character is used as a path delimiter; the language standard doesn't even specify directory support. It's determined by the operating system and file system.

prepend the "\\?\" string to the path - DriverPackageUninstall

I used DriverPackageUninstall, to uninstall my driver. For this API I need to give "Inf Path" as the input. And I need to give this path as UNICODE string. To do this, I took the following statement from MSDN as reference.
For a Unicode string, the maximum length is 32,767 characters. If you
use the Unicode version, prepend the "\?\" string to the path. For
general information about the format of file path strings, see Naming
a File in the MSDN Library.
But when I try the same in my code its not working. Can someone give me some examples on how to prepend the "\?\" before the path? Thanks..
UPDATE :
I tried with the below code as sample
#define UNICODE
#define _UNIOCDE
#define WINVER 0x501
#include <stdio.h>
#include <windows.h>
#include <tchar.h>
int main () {
PTCHAR DriverPackageInfPath = TEXT("\\?\\c:\\Documents and Settings\\Desktop\\My.inf");
FILE * Log;
Log = _wfopen( TEXT(DriverPackageInfPath, TEXT("a"));
if ( Log == NULL ) {
MessageBox(NULL, TEXT ( "Unable to open INF file\n" ),
TEXT ( "Installation Error" ), 0 | MB_ICONSTOP );
exit ( 1 );
} else {
printf ("INF file opened successfully\n");
}
return 0;
}
UPDATE:
".\dist\Driver\My.inf" How to add "\\?\" before this kind of paths? "\\?\.\dist\Driver\My.inf" is not working.
You have error in string constant:
TEXT("\\?\\c:\\Documents ...."
should be
TEXT("\\\\?\\c:\\Documents ...."
Read carefully, escape carefully : http://msdn.microsoft.com/en-us/library/windows/hardware/ff552316%28v=vs.85%29.aspx
UPDATE:
From http://msdn.microsoft.com/en-us/library/aa365247.aspx :
Win32 File Namespaces
The Win32 namespace prefixing and conventions are summarized in this section and the following section, with descriptions of how they are used. Note that these examples are intended for use with the Windows API functions and do not all necessarily work with Windows shell applications such as Windows Explorer. For this reason there is a wider range of possible paths than is usually available from Windows shell applications, and Windows applications that take advantage of this can be developed using these namespace conventions.
For file I/O, the "\?\" prefix to a path string tells the Windows APIs to disable all string parsing and to send the string that follows it straight to the file system. For example, if the file system supports large paths and file names, you can exceed the MAX_PATH limits that are otherwise enforced by the Windows APIs. For more information about the normal maximum path limitation, see the previous section Maximum Path Length Limitation.
Because it turns off automatic expansion of the path string, the "\?\" prefix also allows the use of ".." and "." in the path names, which can be useful if you are attempting to perform operations on a file with these otherwise reserved relative path specifiers as part of the fully qualified path.
Win32 Device Namespaces
The "\.\" prefix will access the Win32 device namespace instead of the Win32 file namespace. This is how access to physical disks and volumes is accomplished directly, without going through the file system, if the API supports this type of access. You can access many devices other than disks this way (using the CreateFile and DefineDosDevice functions, for example).
For example, if you want to open the system's serial communications port 1, you can use "COM1" in the call to the CreateFile function. This works because COM1–COM9 are part of the reserved names in the NT namespace, although using the "\.\" prefix will also work with these device names. By comparison, if you have a 100 port serial expansion board installed and want to open COM56, you cannot open it using "COM56" because there is no predefined NT namespace for COM56. You will need to open it using "\.\COM56" because "\.\" goes directly to the device namespace without attempting to locate a predefined alias.
Another example of using the Win32 device namespace is using the CreateFile function with "\.\PhysicalDiskX" (where X is a valid integer value) or "\.\CdRomX". This allows you to access those devices directly, bypassing the file system. This works because these device names are created by the system as these devices are enumerated, and some drivers will also create other aliases in the system. For example, the device driver that implements the name "C:\" has its own namespace that also happens to be the file system.
APIs that go through the CreateFile function generally work with the "\.\" prefix because CreateFile is the function used to open both files and devices, depending on the parameters you use.
If you're working with Windows API functions, you should use the "\.\" prefix to access devices only and not files.
Most APIs won't support "\.\"; only those that are designed to work with the device namespace will recognize it. Always check the reference topic for each API to be sure.
So your relative path can be
\\?\.\dist\driver\My.inf
escaped form is
\\\\?\\.\\dist\\driver\\My.inf
You only need to prepend \\?\ to the path if it is longer than MAX_PATH characters.

Resources