What are all the illegal characters in the XFS filesystem? - filesystems

Could someone provide (or point me to a list) of all the illegal characters in the XFS filesystem? I'm writing an app that needs to sanitize filenames.
EDIT:
Okay, so POSIX filesystems should allow all characters except the NUL character, forward slash, and the '.' and '..' filenames are reserved. All other exceptions are application-level. Thanks!

POSIX filesystems (including XFS) allow every character in file names, with the exception of NUL (0x00) and forward-slash (/; 0x2f).
NUL marks the end of a C-string; so it is not allowed in file names.
/ is the directory separator, so it is not allowed.
File names starting with a dot (.; 0x2e) are considered hidden files. This is a userland, not kernel or filesystem convention.
There may be conventions you're following — for example, UTF-8 file names — in which case, there are many, many more restrictions including which normalization form to use.
Now, you probably want to disallow other things too; file name with all kinds of weird characters are no fun to deal with. I strongly suggest the whitelist approach.
Also, when handling file names, beware of the .. entry in every directory. You don't want to traverse it and allow an arbitrary path.
Source: Single Unix Spec v. 3, §3.169, "the characters composing the name may be selected from the set of all character values excluding the slash character and the null byte."

According to Wikipedia, any character except NUL is legal in an XFS filesystem file name. Of course, POSIX typically doesn't allow the forward slash '/' in a filename. Other than this, anything should be good, including international characters.

Related

Can an empty string be a legal file name?

I would like to know if an empty string is a legal file name in any of the commonly used operating and file systems.
I guess not - but I can only guess.
Context: If the empty string is a universally illegal filename, it would be a simple to use test case for non-existent files. However, tests may err if there are situations where the empty string is a valid file name.
Perhaps unfortunately for you, it depends! Different filesystems firmly disagree about what is a "legal file name"! (a frequent source of confusion is that some are case-insensitive)
I suspect that some filesystem you will encounter will support it, but that case may not really matter for you
Your best bet would be to explicitly prevent silly names in your program (see below) or to try testing on a subset of filesystems with known versions
Some helpful information
at least ext4 probably explicitly prevents it https://unix.stackexchange.com/questions/66965/files-with-empty-names
What characters are forbidden in Windows and Linux directory names?
However, for program design, I would instead try to follow these rules
regulate filenames to exact sets of characters (perhaps with a regex)
explicitly prevent empty or blank strings and reserved names ( , \0, localhost, null..)
store whatever "name" users cook up to trouble your programs and filesystem in a field in a meta-file or database you control
name the file or entry yourself (perhaps with a uuid to guarantee uniqueness everywhere .. this could also or instead be a SQLite db entry or some config in a new directory for their work) .. here you are effectively creating your own filesystem with whatever rules you like

What character encoding is used by fopen() or open()?

When you use a function like fopen(), you have to pass it a string argument for the filename. I want to know what the character encoding of this string should be.
This question has already been asked here, but it has contradictory answers. One answer says the following:
It depends on the system locale. Look at the output of the "locale"
command. If the variables end in UTF-8, then your locale is UTF-8.
Most modern linuxes will be using UTF-8. Although Andrew is correct
that technically it's just a byte string, if you don't match the
system locale some programs may not work correctly and it will be
impossible to get correct user input, etc. It's best to stick with
UTF-8.
While another answer says the following:
Filesystem calls on Linux are encoding-agnostic, i.e. they do not
(need to) know about the particular encoding. As far as they are
concerned, the byte-string pointed to by the filename argument is
passed down to the filesystem as-is. The filesystem expects that
filenames are in the correct encoding (usually UTF-8, as mentioned by
Matthew Talbert).
This means that you often don't need to do anything (filenames are
treated as opaque byte-strings), but it really depends on where you
receive the filename from, and whether you need to manipulate the
filename in any way.
Which answer is the correct one?
They're both correct in some ways.
The strings passed to the file system calls are a string of bytes, with a null byte marking the end of the string and '/' used to separate path components. Within the file name segments, the meaning of the bytes is immaterial to the file system — they're just a sequence of bytes.
How the bytes that form the file name are displayed depends on the equipment used to display them. If the names use UTF-8 with non-ASCII characters, printing that data using ISO 8859-15 (or 8859-1 for intransigent residents of the USA) yields gibberish, often including C1 control bytes from the byte range 0x80 .. 0x9F. If the names use 8859-15 with non-ASCII characters, there will be sequences that are not valid UTF-8 and you will get illegible or meaningless data displayed (question marks, or other indications of invalid UTF-8 sequences).

Parsing stdin for files provided by ls

TL;DR: Is the output of ls standardised so that there is a perfect way to parse it into an array of files names ?
I have to write a program that processes some files, the program specification states this:
Your program should read a list of files from the standard entry
And an example is given of how to program will be used:
ls /usr/include/std*.h | ./distribuer 3
Where distribuer is the name of my program.
From my tests, I see that ls separates the files names with tabs when called with this sort of argument containing a wildcard, is this behaviour standard ? Or might ls sometimes use simple whitespace characters or even newlines when called with similar wildcard arguments ?
Finally, while this might be an edge case, I am also worried that since Unix allows for tabs and whitespaces in filenames, it could actually be impossible to reliably parse the output of ls, is that correct ?
Is the output of ls standardised so that there is a perfect way to parse it into an array of files names?
The output of ls is certainly standardised, by the Posix standard. In the section STDOUT, the standardised formats are described:
The default format shall be to list one entry per line to standard output; the exceptions are to terminals or when one of the -C, -m, or -x options is specified.
As well as a cautionary note about an important context in which the output is not standardised:
If the output is to a terminal, the format is implementation-defined.
(There is quite a lot of specification of how the format changes with different command-line parameters, which I'm not quoting because it is not immediately relevant here.)
So the standardised format, applicable if stdout is not directed to a terminal and if no command-line options are provided (or if the -1 option is provided, even if stdout is a terminal) is to print one entry per line.
Unfortunately, that does not provide a "perfect way" to parse the output, because it is legal for filenames to include newline characters, and a filename which includes a newline character will obviously span more than one line. If all you have is the ls output, there is no 100% reliable way to tell whether a newline (other than the last one) indicates the end of a filename or is a newline character in the filename.
For the purposes of your assignment, the simple strategy would be just to ignore that imperfection (or, better, document it and then ignore it), which is the same strategy that many Unix utilities use. Files whose names include newlines are extremely rare in the wild, and people who create files with newlines in their names probably deserve the problems they will cause themselves. However, you will find a lot of people here (including me, sometimes) suggesting that scripts should work correctly with all legal filenames. So the rest of this answer discusses some of the possible responses to this pedantry. Note that none of them are "perfect".
One imperfect solution is to try to figure out whether a given newline is embedded or not. If you know the list was produced by ls without any sorting options, you might be able to guess correctly in most cases by using the fact that ls presents files sorted by the current locale's collation rules. So if a line is out of sequence (either less than the preceding line or greater than the following one) then it is appropriate to guess that it is a continuation of the filename. That won't always work, and I don't know any utility which tries it, but it might be worth mentioning.
If you were running ls yourself, you could take advantage of the -q option, which causes non-printing characters (including tabs and newlines) to be replaced with ? in the output. That forces the filename to be printed on a single line, but has the disadvantage that you no longer know what the filename was before the substitution, since there are a variety of characters which could be replaced with a question mark (including a question mark itself). You might be able to query the filesystem to find the real name of the file, but there are a lot of corner cases I'm not going to go into since the premise of this paragraph is not applicable to the actual problem.
The most common solution is to allow the user to tell your utility that filenames are separated with a NUL character rather than a newline. This is 100% reliable because filenames cannot contain NUL characters -- in fact, that's the only character they cannot contain. Unfortunately, ls does not provide an option to produce output in this format, but the user could use the find utility to generate the same listing as ls and then use the non-standard but widely-implemented -print0 option to write out the filenames with NUL terminators. (If only Posix standard options to find are available, you can still produce the output by using -exec with an appropriate command to output the name.)
Many utilities which accept lists of filenames on standard input have (non-standard) options to specify a delimiter character, or to specify that the delimiter is NUL instead of newline. See, for example, xargs -0, sort -z (Gnu or BSD) or read -d (bash). So this is probably a reasonable enhancement if you're interested in coding it.
It's worth noting that most standard shell utilities do not provide an option to take a list of filenames through standard input. Most utilities prefer to receive filenames as command-line arguments. This works well because when the shell expands "globs" (like *) specified on a command-line, it does not rerun word-splitting on the output; each filename becomes a single argument. That means that
./distribute *
is almost perfect as a way of passing a list of filenames to a utility. But it is still not quite perfect because there is a limit to the number of command-line arguments you can provide in a single command-line. So if the directory has a really large number of files, the expansion of * might exceed that limit, causing the utility execution to fail. find also just passes filenames through to -exec as single arguments without word-splitting, and the use of {}+ as an -exec command terminator will split the filenames into sets which are small enough that they will not exceed to command-line limit. That's safer than ./distribute *, but it does mean that the utility may be called several times, once for each set. (And it's also a bit annoying getting the find predicates to give you exactly what you want.)

Validation for user input on file system

I have written a bunch of web apps and know how to protect against mysql injections and such. I am writing a log storage system for a project in C and I was advised to make sure that it was hack free in the sense that the user could not supply bad data like foo\b\b\b and try to hack into the OS with some rm -rf /* kind of crud. I looked online and found a similar question here: how to check for the "backspace" character in C
This is at least what I thought of, but I know there are probably other things I need to protect against. Can someone who has a bit more experience help me list out the things I need to validate when I am saving files onto a server using user input as part of the hierarchical file naming system?
Example file: /home/webapp/data/{User input}/{Machine-ID}/{hostname}/{tag} where all of these fields could be "faked" when submitted to our log storing system.
Instead of checking for bad characters, turn the problem on its head and specify the good characters. E.g. require {User Input} be a single directory name made of [[:alnum:]_] characters; {Machine-ID} must be made of [[:xdigit:]] to your liking, etc. That gets rid of all the injection stuff quickly.
If you're only ever using these inputs as file names inside your program, and you're storing them on a native Linux filesystem, then the critical things to watch for are:
absolutely proscribe any file name starting with ../ or containing /../ or ending with /... Such file names could allow the user to reach files outside the directory tree that you're working in.
Be wary of any file name containing / as these allow the user to name subdirectories, possibly with unintended consequences.
Other things that could cause trouble include:
Non-ASCII characters that may have a different meaning if used in a different locale.
Some ASCII punctuation characters may have a special meaning in parts of your processing system or may be invalid in some filesystems.
Some parts of your system may be case-sensitive with other parts being case-insensitive. Consider normalizing the case.
If applicable, restrict each field to something that isn't going to cause any trouble. For example:
A machine ID should probably consist of only ASCII lower letters and digits (or only ASCII uppercase letters and digits).
A hostname should consist of only ASCII lowercase letters and digits, plus - but not in an initial position (use Punycode for non-ASCII host names). If these are fully qualified host names, as opposed to host names in a network, then . is also valid, but not in initial position.
No field should be empty or contain a / or start with a . (an initial . could be . or .. — see above — and would be a dot file that ls doesn't show by default and isn't included in the pattern * in shells, so they're best avoided).
While control characters such as backspace aren't directly harmful, they can be indirectly harmful in that if you're investigating an issue on the command line, they can cause you to make mistakes. Do not allow them.

Get directory separator char on Windows? ('\', '/', etc.)

tl;dr: How do I ask Windows what the current directory separator character on the system is?
Different versions of Windows seem to behave differently (e.g. \ and / both work on the English versions, ¥ is apparently on the Japanese version, ₩ is apparently on the Korean version, etc...
Is there any way to avoid hard-coding this, and instead ask Windows at run time?
Note:
Ideally, the solution should not depend on a high-level DLL like ShlWAPI.dll, because lower-level libraries also depend on this. So it should really either depend on kernel32.dll or ntdll.dll or the like... although I'm having a trouble finding anything at all, whether at a high level or at a low level.
Edit:
A little experimentation told me that it's the Win32 subsystem (i.e. kernel32.dll... or is it perhaps RtlDosPathNameToNtPathName_U in ntdll.dll? not sure, didn't test...) which converts forward slashes to backslashes, not the kernel. (Prefixing \\?\ makes it impossible to use forward slashes later in the path -- and the NT native user-mode API also fails with forward slashes.)
So apparently it's not quite "built into" Windows, but rather just a compatibility feature -- which means you can't just blindly substitute slashes instead of backslashes, because any program which randomly prefixes \\?\ to paths will automatically break on forward slashes.
I have mixed feelings on what conclusions to make regarding this, but I just thought I'd mention it.
(I tagged this as "path separator" even though that's technically incorrect because the path separator is used for separating paths, not directories (; vs. \). Hopefully people get what I meant.)
While the ₩ and ¥ characters are shown as directory separator symbols in the respective Korean and Japanese windows versions, they are only how those versions of Windows represent the same Unicode code point U+005c as a glyph. The underlying code point for backslash is still the same across English Windows and the Japanese and Korean windows versions.
Extra confirmation for this can be found on this page: http://msdn.microsoft.com/en-us/library/dd374047(v=vs.85).aspx
Security Considerations for Character Sets in File Names
Windows code page and OEM character sets used on Japanese-language systems contain the Yen symbol (¥) instead of a backslash (\). Thus, the Yen character is a prohibited character for NTFS and FAT file systems. When mapping Unicode to a Japanese-language code page, conversion functions map both backslash (U+005C) and the normal Unicode Yen symbol (U+00A5) to this same character. For security reasons, your applications should not typically allow the character U+00A5 in a Unicode string that might be converted for use as a FAT file name.
Also, I don't know of any Windows API function that gets you the system's path separator, but you can rely on it being \ in all circumstances.
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx#naming_conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
...
Use a backslash (\) to separate the components of a path. The backslash divides the file name from the path to it, and one directory name from another directory name in a path. You cannot use a backslash in the name for the actual file or directory because it is a reserved character that separates the names into components.
...
About /
Windows should support the use of / as a directory separator in the API functions, though not necessarily in the command prompt (command.com).
Note File I/O functions in the Windows API convert "/" to "\" as part of converting the name to an NT-style name, except when using the "\?\" prefix as detailed in the following sections.
It's 'tough' to figure out the truth of all this, but this might be a really helpful link about / in Windows paths: http://bytes.com/topic/python/answers/23123-when-did-windows-start-accepting-forward-slash-path-separator
The original poster added the phrase "kernel-mode" in a comment to someone else's answer.
If the original question intended to ask about kernel mode, then it probably isn't a good idea to depend on / being a path separator. Different file systems allow different character sets on disk. Different file system drivers in Windows can also allow different characters sets, which normally cannot include characters which the underlying file systems don't accept on disk, but sometimes they can behave strangely. For example Posix mode allows a component name to contain some characters in a path name in an NTFS partition, even though NTFS ordinarily doesn't allow those characters. (But obviously / isn't one of them, in Posix.)
In kernel mode in Unicode, U+005C is always a backslash and it is always the path separator. Unicode code points for yen and won are not U+005C and are not path separators.
In kernel mode in ANSI, complications arise depending on which ANSI code page. In code pages that are sufficiently similar to ASCII, 0x5C is a backslash and it is the path separator. In ANSI code pages 932 and 949, 0x5C is not a backslash but 0x5C might be a path separator depending on where it occurs. If 0x5C is the first byte of a multibyte character, then it's a yen sign or won sign and it is a path separator. If 0x5C is the second byte of a multibyte character, then it's not a character by itself, so it's not a yen sign or won sign and it's not a path separator. You have to start parsing from the beginning of the string to figure out if a particular char is actually a whole character or not. Also in Chinese and UTF-8, multibyte characters can be longer than two chars.
The standard forward slash (/) has always worked in all versions of DOS and Windows. If you use it, you don't have to worry about issues with how the backslash is displayed on Japanese and Korean versions of Windows, and you also don't have to special-case the path separator for Windows as opposed to POSIX (including Mac). Just use forward slash everywhere.

Resources