tl;dr: How do I ask Windows what the current directory separator character on the system is?
Different versions of Windows seem to behave differently (e.g. \ and / both work on the English versions, ¥ is apparently on the Japanese version, ₩ is apparently on the Korean version, etc...
Is there any way to avoid hard-coding this, and instead ask Windows at run time?
Note:
Ideally, the solution should not depend on a high-level DLL like ShlWAPI.dll, because lower-level libraries also depend on this. So it should really either depend on kernel32.dll or ntdll.dll or the like... although I'm having a trouble finding anything at all, whether at a high level or at a low level.
Edit:
A little experimentation told me that it's the Win32 subsystem (i.e. kernel32.dll... or is it perhaps RtlDosPathNameToNtPathName_U in ntdll.dll? not sure, didn't test...) which converts forward slashes to backslashes, not the kernel. (Prefixing \\?\ makes it impossible to use forward slashes later in the path -- and the NT native user-mode API also fails with forward slashes.)
So apparently it's not quite "built into" Windows, but rather just a compatibility feature -- which means you can't just blindly substitute slashes instead of backslashes, because any program which randomly prefixes \\?\ to paths will automatically break on forward slashes.
I have mixed feelings on what conclusions to make regarding this, but I just thought I'd mention it.
(I tagged this as "path separator" even though that's technically incorrect because the path separator is used for separating paths, not directories (; vs. \). Hopefully people get what I meant.)
While the ₩ and ¥ characters are shown as directory separator symbols in the respective Korean and Japanese windows versions, they are only how those versions of Windows represent the same Unicode code point U+005c as a glyph. The underlying code point for backslash is still the same across English Windows and the Japanese and Korean windows versions.
Extra confirmation for this can be found on this page: http://msdn.microsoft.com/en-us/library/dd374047(v=vs.85).aspx
Security Considerations for Character Sets in File Names
Windows code page and OEM character sets used on Japanese-language systems contain the Yen symbol (¥) instead of a backslash (\). Thus, the Yen character is a prohibited character for NTFS and FAT file systems. When mapping Unicode to a Japanese-language code page, conversion functions map both backslash (U+005C) and the normal Unicode Yen symbol (U+00A5) to this same character. For security reasons, your applications should not typically allow the character U+00A5 in a Unicode string that might be converted for use as a FAT file name.
Also, I don't know of any Windows API function that gets you the system's path separator, but you can rely on it being \ in all circumstances.
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx#naming_conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
...
Use a backslash (\) to separate the components of a path. The backslash divides the file name from the path to it, and one directory name from another directory name in a path. You cannot use a backslash in the name for the actual file or directory because it is a reserved character that separates the names into components.
...
About /
Windows should support the use of / as a directory separator in the API functions, though not necessarily in the command prompt (command.com).
Note File I/O functions in the Windows API convert "/" to "\" as part of converting the name to an NT-style name, except when using the "\?\" prefix as detailed in the following sections.
It's 'tough' to figure out the truth of all this, but this might be a really helpful link about / in Windows paths: http://bytes.com/topic/python/answers/23123-when-did-windows-start-accepting-forward-slash-path-separator
The original poster added the phrase "kernel-mode" in a comment to someone else's answer.
If the original question intended to ask about kernel mode, then it probably isn't a good idea to depend on / being a path separator. Different file systems allow different character sets on disk. Different file system drivers in Windows can also allow different characters sets, which normally cannot include characters which the underlying file systems don't accept on disk, but sometimes they can behave strangely. For example Posix mode allows a component name to contain some characters in a path name in an NTFS partition, even though NTFS ordinarily doesn't allow those characters. (But obviously / isn't one of them, in Posix.)
In kernel mode in Unicode, U+005C is always a backslash and it is always the path separator. Unicode code points for yen and won are not U+005C and are not path separators.
In kernel mode in ANSI, complications arise depending on which ANSI code page. In code pages that are sufficiently similar to ASCII, 0x5C is a backslash and it is the path separator. In ANSI code pages 932 and 949, 0x5C is not a backslash but 0x5C might be a path separator depending on where it occurs. If 0x5C is the first byte of a multibyte character, then it's a yen sign or won sign and it is a path separator. If 0x5C is the second byte of a multibyte character, then it's not a character by itself, so it's not a yen sign or won sign and it's not a path separator. You have to start parsing from the beginning of the string to figure out if a particular char is actually a whole character or not. Also in Chinese and UTF-8, multibyte characters can be longer than two chars.
The standard forward slash (/) has always worked in all versions of DOS and Windows. If you use it, you don't have to worry about issues with how the backslash is displayed on Japanese and Korean versions of Windows, and you also don't have to special-case the path separator for Windows as opposed to POSIX (including Mac). Just use forward slash everywhere.
Related
I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.
I am writting an application in C that will be ran in a terminal, and it would be handy but not necesary to use some of the less used unicode characters. From my experimentation, I have not had any trouble rendering them. However, I would not use any non ascii characters if it were a likely source of trouble in the future.
So, in short, can I count on just about any terminal or terminal emulator in the modern *nix world (mainly linux, freebsd, and osx) to properly render arbitrary utf-8 characters?
If I cannot make such an assumption, there are particular subsets of unicode characters defined for various purposes, so would some such subset at least be reliably rendered in any likely modern *nix terminal or terminal emulator?
NOTE: When I say arbitrary, I do mean arbitrary: any unicode characters. But for completeness of my question, I will note that I am primarily interested in arrows and mathematical characters, this link has lists of both: https://en.wikipedia.org/wiki/Unicode_symbols.
No, you should not assume that. Even in a modern system, the set of fonts installed, the font used by the terminal application, and environment variables such as LANG, LC_*, etc. may influence whether certain characters can be displayed correctly on the terminal or not.
You might be able to make reasonable guesses based on the value of the TERM, LANG, and LC_* environment variable as to what is supported, but it's still going to be a guess. I'd suggest either not relying on it at all or providing some means of enabling/disabling the use (via an environment variable and/or via commandline flags to the application).
For the most part, this depends on the font, not the terminal. But there are a couple of things the terminal software has to take into account. For example, halfwidth and fullwidth forms of CJK characters.
Also, Unicode characters are added on a regular basis. There's no way that every font and terminal software is automatically updated as soon as a new version of the Unicode standard is released.
In general, you should assume that there are always Unicode characters that are not rendered correctly, even on a modern terminal.
I have written a bunch of web apps and know how to protect against mysql injections and such. I am writing a log storage system for a project in C and I was advised to make sure that it was hack free in the sense that the user could not supply bad data like foo\b\b\b and try to hack into the OS with some rm -rf /* kind of crud. I looked online and found a similar question here: how to check for the "backspace" character in C
This is at least what I thought of, but I know there are probably other things I need to protect against. Can someone who has a bit more experience help me list out the things I need to validate when I am saving files onto a server using user input as part of the hierarchical file naming system?
Example file: /home/webapp/data/{User input}/{Machine-ID}/{hostname}/{tag} where all of these fields could be "faked" when submitted to our log storing system.
Instead of checking for bad characters, turn the problem on its head and specify the good characters. E.g. require {User Input} be a single directory name made of [[:alnum:]_] characters; {Machine-ID} must be made of [[:xdigit:]] to your liking, etc. That gets rid of all the injection stuff quickly.
If you're only ever using these inputs as file names inside your program, and you're storing them on a native Linux filesystem, then the critical things to watch for are:
absolutely proscribe any file name starting with ../ or containing /../ or ending with /... Such file names could allow the user to reach files outside the directory tree that you're working in.
Be wary of any file name containing / as these allow the user to name subdirectories, possibly with unintended consequences.
Other things that could cause trouble include:
Non-ASCII characters that may have a different meaning if used in a different locale.
Some ASCII punctuation characters may have a special meaning in parts of your processing system or may be invalid in some filesystems.
Some parts of your system may be case-sensitive with other parts being case-insensitive. Consider normalizing the case.
If applicable, restrict each field to something that isn't going to cause any trouble. For example:
A machine ID should probably consist of only ASCII lower letters and digits (or only ASCII uppercase letters and digits).
A hostname should consist of only ASCII lowercase letters and digits, plus - but not in an initial position (use Punycode for non-ASCII host names). If these are fully qualified host names, as opposed to host names in a network, then . is also valid, but not in initial position.
No field should be empty or contain a / or start with a . (an initial . could be . or .. — see above — and would be a dot file that ls doesn't show by default and isn't included in the pattern * in shells, so they're best avoided).
While control characters such as backspace aren't directly harmful, they can be indirectly harmful in that if you're investigating an issue on the command line, they can cause you to make mistakes. Do not allow them.
I am working on a small text replacement application that basically lets the user select a file and replace text in it without ever having to open the file itself. However, I want to make sure that the function only runs for files that are text-based. I thought I could accomplish this by checking the encoding of the file, but I've found that Notepad .txt files use Unicode UTF-8 encoding, and so do MS Paint .bmp files. Is there an easy way to check this without placing restrictions on the file extensions themselves?
Unless you get a huge hint from somewhere, you're stuck. Purely by examining the bytes there's a non-zero probability you'll guess wrong given the plethora of encodings ("ASCII", Unicode, UTF-8, DBCS, MBCS, etc). Oh, and what if the first page happens to look like ASCII but the next page is a btree node that points to the first page...
Hints can be:
extension (not likely that foo.exe is editable)
something in the stream itself (like BOM [byte-order-marker])
user direction (just edit the file, goshdarnit)
Windows used to provide an API IsTextUnicode that would do a probabilistic examination, but there were well-known false-positives.
My take is that trying to be smarter than the user has some issues...
Honestly, given the Windows environment that you're working with, I'd consider a whitelist of known text formats. Windows users are typically trained to stick with extensions. However, I would personally relax the requirement that it not function on non-text files, instead checking with the user for goahead if the file does not match the internal whitelist. The risk of changing a binary file would be mitigated if your search string is long - that is assuming you're not performing Y2K conversion (a la sed 's/y/k/g').
It's pretty costly to determine if a file is text-based or not (i.e. a binary file). You would have to examine each byte in the file to determine if it is a valid character, irrespective of the file encoding.
Others have said to look at all the bytes in the file and see if they're alphanumeric. Some UNIX/Linux utils do this, but just check the first 1K or 2K of the file as an "optimistic optimization".
well a text file contains text, right ? so a really easy way to check a file if it does contain only text is to read it and check if it does contains alphanumeric characters.
So basically the first thing you have to do is to check the file encoding if its pure ASCII you have an easy task just read the whole file in to a char array (I'm assuming you are doing it in C/C++ or similar) and check every char in that array with functions isalpha and isdigit ...of course you have to take care about special exceptions like tabulators '\t' space ' ' or the newline ('\n' in linux , '\r'\'n' in windows)
In case of a different encoding the process is the same except the fact that you have to use different functions for checking if the current character is an alphanumeric character... also note that in case of UTF-16 or greater a simple char array is simply to small...but if you are doing it for example in C# you dont have to worry about the size :)
You can write a function that will try to determine if a file is text based. While this will not be 100% accurate, it may be just enough for you. Such a function does not need to go through the whole file, about a kilobyte should be enough (or even less). One thing to do is to count how many whitespaces and newlines are there. Another thing would be to consider individual bytes and check if they are alphanumeric or not. With some experiments you should be able to come up with a decent function. Note that this is just a basic approach and text encodings might complicate things.
Could someone provide (or point me to a list) of all the illegal characters in the XFS filesystem? I'm writing an app that needs to sanitize filenames.
EDIT:
Okay, so POSIX filesystems should allow all characters except the NUL character, forward slash, and the '.' and '..' filenames are reserved. All other exceptions are application-level. Thanks!
POSIX filesystems (including XFS) allow every character in file names, with the exception of NUL (0x00) and forward-slash (/; 0x2f).
NUL marks the end of a C-string; so it is not allowed in file names.
/ is the directory separator, so it is not allowed.
File names starting with a dot (.; 0x2e) are considered hidden files. This is a userland, not kernel or filesystem convention.
There may be conventions you're following — for example, UTF-8 file names — in which case, there are many, many more restrictions including which normalization form to use.
Now, you probably want to disallow other things too; file name with all kinds of weird characters are no fun to deal with. I strongly suggest the whitelist approach.
Also, when handling file names, beware of the .. entry in every directory. You don't want to traverse it and allow an arbitrary path.
Source: Single Unix Spec v. 3, §3.169, "the characters composing the name may be selected from the set of all character values excluding the slash character and the null byte."
According to Wikipedia, any character except NUL is legal in an XFS filesystem file name. Of course, POSIX typically doesn't allow the forward slash '/' in a filename. Other than this, anything should be good, including international characters.