Is there a Maximum Length for userPrincipalName in Active Directory? - active-directory

I am writing an application that is linked to Active Directory, and I need to store the userPrincipalName in a database table, but I do not know how big the field would need to be.
On MSDN, no Length is given, and neither in RFC 822. Now, before I revert to the DOMAIN\Username that has a defined Length (sAMAccountName is less than 20 chars, NETBIOS Domain Name is max. 15 chars), I wonder if anyone knows what the limit is either by standard, or by the implementation within both Windows 2003 and Windows 2008 domains.

On Win2k3 SP2 the longest userPrincipleName it allows me to create is 1013 characters long.

While trying to answer this question for myself today I stumbled across a documented answer.
2.381 Attribute userPrincipalName defines the userPrincipalName in the following way:
cn: User-Principal-Name
ldapDisplayName: userPrincipalName
attributeId: 1.2.840.113556.1.4.656
attributeSyntax: 2.5.5.12
omSyntax: 64
isSingleValued: TRUE
schemaIdGuid: 28630ebb-41d5-11d1-a9c1-0000f80367c1
systemOnly: FALSE
searchFlags: fATTINDEX
rangeUpper: 1024
attributeSecurityGuid: e48d0154-bcf8-11d1-8702-00c04fb96050
isMemberOfPartialAttributeSet: TRUE
systemFlags: FLAG_SCHEMA_BASE_OBJECT | FLAG_ATTR_REQ_PARTIAL_SET_MEMBER
This specifies the maximum length as 1024 according to the rangeUpper attribute definition.
In this context 1024 means octets (bytes), as opposed to Unicode code points, as omSyntax: 64 is defined as String(Unicode) in LDAP Representations and references RFC 2252 LDAPv3 Attributes 6.10 Directory String which describes it as the following:
6.10. Directory String
( 1.3.6.1.4.1.1466.115.121.1.15 DESC 'Directory String' )
A string in this syntax is encoded in the UTF-8 form of ISO 10646 (a
superset of Unicode). Servers and clients MUST be prepared to
receive encodings of arbitrary Unicode characters, including
characters not presently assigned to any character set.
With UTF-8 being a variable length encoding this means that the maximum string length is however many code points you can UTF-8 encode into 1024 octets (bytes). i.e.: For purely ASCII strings that's 1024 code points, for anything with non-ASCII characters it means something less than 1024 code points.

no it's not possible to change the length of characters in Logon Name of Active directory.

Related

Is there a query in Snowflake to identify the characters in a file that are invalid utf8 [duplicate]

This question already has answers here:
How to find rows with non utf8 characters in Snowflake?
(2 answers)
Closed 2 years ago.
I have a file that when loading into Snowflake gets an error for invalid UTF-8 characters, I have managed to load it into a table using another encoding, by creating a file format with option ENCODING = 'iso-8859-1' but I would like to find a way to get those characters queried.
I've tried TO_BINARY(col,'UTF-8') function hoping it will fail on the col that has invalid UTF-8 but was not able to get a valid result to capture those characters, has anyone faced the same issue?
Please note that ALL character data within Snowflake is encoded using UTF-8. There is no other option. A while back, this was not strictly true, and it was possible to have character data in Snowflake that was NOT valid UTF-8. But that should not be possible now.
Specifying the ENCODING = 'iso-8859-1' option instructed Snowflake (during the COPY INTO operation) to perform character set translation on the file (which was then interpreted as being encoded in ISO-8859-1), mapping all characters into their UTF-8 equivalent as it was written into Snowflake. As a result, all data in Snowflake is UTF-8 encoded, and therefore there should not be ANY non-UTF-8 characters to discover. That said, the result of character set translation might not end up translating to the correct/expected UTF-8 characters if the underlying (source) file was not truly encoded with the encoding that you specified during the COPY INTO (in this case, ISO-8859-1).
Given this, what is the ultimate problem that you are trying to resolve here? Did you load a source file with ENCODING = 'iso-8859-1' that was not actually ISO-8859-1? Or are you saying that the source file WAS truly encoded as ISO-8859-1, and yet somehow the resulting characters in Snowflake are either (1) incorrect or (2) invalid UTF-8? Or are you trying to determine the actual encoding of a source file (ignoring the whole ISO-8859-1 aspect altogether)?
Detailed answer found here How to find rows with non utf8 characters in Snowflake?
Should mark my question as duplicate and refer to the link, please.

Word2vec C++ training on German wikipedia

I am using the C version of word2vec (as found at https://code.google.com/archive/p/word2vec/) and training it on a filtered dump of the German version of Wikipedia (~17 GB raw text, ~1.4 B words). I am using the following settings:
-cbow 1 -size 300 -window 5 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 -min-count 1000
The resulting output file contains ~50k words, however none of them contain the letters ä, ö, ü or ß. I verified that word2vec can handle them by making a small corpus containing words with those letters and they appeared in the output.
What could be causing the words containing these character to not appear in the output file? Is it somehow related to the large size of the corpus or any of the settings that I am using?
It should not be related to the size of the corpus. I've trained a German model (see link below) with similar settings on a Wikipedia dump and German news articles (600k words in the vocabulary) and generated word vectors for words with German umlauts as well.
Things you can do:
Check the character encoding of your corpus file as well as your training environment to be UTF-8
Avoid this problem by converting the umlauts to their respective bigram tokens (ä → ae, ß → ss etc.) in the preprocessing
Check out this project where word2vec was applied on a german corpus (but with gensim using the C implementation)

special characters in google app-engine task queue task names

Are dashes(-) or any other special characters allowed in task names in task queues?
Yes, you can use dashes. You can also have letters (any case) or numbers. The name must be between 1 and 500 characters.
This is all from looking at the source code, which specifies that task names must match this regular expression:
MAX_TASK_NAME_LENGTH = 500
r'^[a-zA-Z0-9-]{1,%s}$' % MAX_TASK_NAME_LENGTH

What characters are allowed in ClearCase activity name?

I want to write script for internal issue tracking system, integrated with ClearCase, that checks activity name (typed by user) for illegal characters. Unfortunatly, I can't find list of characters, allowed by ClearCase. Does anybody know where to get it?
UPD: I'm looking for a link to a document, that specifies the allowed characters (or says that all characters are allowed).
Regarding mkactivity (the command used for creating activity), there is:
no special limitation for the activity headline
follow the same limitations than any other clearcase object ID name (see below):
cmd-context mkactivity -headline "Create directories" create_directories
Created activity "create_directories".
Set activity "create_directories" in view "webo_integ".
alt text http://publib.boulder.ibm.com/infocenter/cchelp/v7r0m0/topic/com.ibm.rational.clearcase.hlp.doc/cc_main/images/activity.gif
The cleartool man page about arguments in cleartool command is clear:
In object-creation commands, you must compose the object name according to these rules:
It must contain only letters, digits, and the special characters underscore (_), period (.), and hyphen (-).
A hyphen cannot be used as the first character of a name.
It must not be an integer; this restriction includes octal and hexadecimal integer values. However, noninteger names are allowed.
It must not be one of the special names “ . “, “ .. “, or “ ... “.
cleartool supports object names of up to 1024 bytes in length, although Windows imposes a limit of 260 bytes on object names.

What are reserved filenames for various platforms?

I'm not asking about general syntactic rules for file names. I mean gotchas that jump out of nowhere and bite you. For example, trying to name a file "COM<n>" on Windows?
From: http://www.grouplogic.com/knowledge/index.cfm/fuseaction/view_Info/docID/111.
The following characters are invalid as file or folder names on Windows using NTFS: / ? < > \ : * | " and any character you can type with the Ctrl key.
In addition to the above illegal characters the caret ^ is also not permitted under Windows Operating Systems using the FAT file system.
Under Windows using the FAT file system file and folder names may be up to 255 characters long.
Under Windows using the NTFS file system file and folder names may be up to 256 characters long.
Under Window the length of a full path under both systems is 260 characters.
In addition to these characters, the following conventions are also illegal:
Placing a space at the end of the name
Placing a period at the end of the name
The following file names are also reserved under Windows:
aux,
com1,
com2,
...
com9,
lpt1,
lpt2,
...
lpt9,
con,
nul,
prn
Full description of legal and illegal filenames on Windows: http://msdn.microsoft.com/en-us/library/aa365247.aspx
A tricky Unix gotcha when you don't know:
Files which start with - or -- are legal but a pain in the butt to work with, as many command line tools think you are providing options to them.
Many of those tools have a special marker "--" to signal the end of the options:
gzip -9vf -- -mydashedfilename
As others have said, device names like COM1 are not possible as filenames under Windows because they are reserved devices.
However, there is an escape method to create and access files with these reserved names, for example, this command will redirect the output of the ver command into a file called COM1:
ver > "\\?\C:\Users\username\COM1"
Now you will have a file called COM1 that 99% of programs won't be able to open, and will probably freeze if you try to access.
Here's the Microsoft article that explains how this "file namespace" works. Basically it tells Windows not to do any string processing on the text and to pass it straight through to the filesystem. This trick can also be used to work with paths longer than 260 characters.
The boost::filesystem Portability Guide has a lot of good info.
Well, for MSDOS/Windows, NUL, PRN, LPT<n> and CON. They even cause problems if used with an extension: "NUL.TXT"
Unless you're touching special directories, the only illegal names on Linux are '.' and '..'. Any other name is possible, although accessing some of them from the shell requires using escape sequences.
EDIT: As Vinko Vrsalovic said, files starting with '-' and '--' are a pain from the shell, since those character sequences are interpreted by the application, not the shell.

Resources