Why do disk UUIDs have a few variants of form? - uuid

Here are some of my hard disk UUIDs
/dev/disk/by-uuid/B0EA1359EA131B64 -> ../../sda1
/dev/disk/by-uuid/C814-23EC -> ../../sda2
/dev/disk/by-uuid/6A1016BD10169065 -> ../../sda4
/dev/disk/by-uuid/e1e5f4ec-094f-4cf1-b0a6-4ed7a0498b25 -> ../../sda5
/dev/disk/by-uuid/86e26b30-af5a-47a9-a57d-cddf785e1979 -> ../../sda7
/dev/disk/by-uuid/5DC0908E72D3BB2A -> ../../sdb2
/dev/disk/by-uuid/760630a7-223f-42e4-aecf-de92e32f12b9 -> ../../sdb3
These forms are semi-regular. Some have 17 characters uppercase, some have 33 character lowercase plus four hyphens, one has 4-4.
I can't see any rhyme or reason as to why some have one form and some another. What is the scheme which decides this?

It depends on the filesystem on your partition. UUID in 4-4 format is normally FAT, 17 characters are NTFS and all other filesystems should use the standard UUID format.

Related

Filesystems and upper/lower case characters in filenames

Needing to compact maximally the names of a lot of files in the same directory, I would like to use numbers and characters in these filenames.
0.png
1.png
..
9.png
A.png
B.png
..
00.png
..
It's better if I can use upper and lower case characters in these filenames, so I wonder if all actual filesystems support that. Concerning this : workstations, laptops, tablets, smartphones, etc. with every OS.
Thanks

Is there a tool to search differences in hex files?

I have two hex files, and want to search for values that have changed from 9C in one file to 9D in the next one.
Even a tool that synchronizes the windows would work, as I can't seem to find any that enable me to do this. All I've found are tools that let me search for the same value and same address between two files.
Thanks!
"There's no such thing as a hex file."
Oh well perhaps I should delete all my hex files.
I use Motorola and Intel formats of hex files (they are just text files representing bytes, but include a record structure for addresses, checksum etc per line. These files need to be treated as binary (in .gitattributes) since they cannot be merged. I use Kdiff3 to see differences (or any other text diff tool).

Word2vec C++ training on German wikipedia

I am using the C version of word2vec (as found at https://code.google.com/archive/p/word2vec/) and training it on a filtered dump of the German version of Wikipedia (~17 GB raw text, ~1.4 B words). I am using the following settings:
-cbow 1 -size 300 -window 5 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15 -min-count 1000
The resulting output file contains ~50k words, however none of them contain the letters ä, ö, ü or ß. I verified that word2vec can handle them by making a small corpus containing words with those letters and they appeared in the output.
What could be causing the words containing these character to not appear in the output file? Is it somehow related to the large size of the corpus or any of the settings that I am using?
It should not be related to the size of the corpus. I've trained a German model (see link below) with similar settings on a Wikipedia dump and German news articles (600k words in the vocabulary) and generated word vectors for words with German umlauts as well.
Things you can do:
Check the character encoding of your corpus file as well as your training environment to be UTF-8
Avoid this problem by converting the umlauts to their respective bigram tokens (ä → ae, ß → ss etc.) in the preprocessing
Check out this project where word2vec was applied on a german corpus (but with gensim using the C implementation)

Weird directory entries in FAT file system

So I'm trying to figure out how the FAT FS works and got confused by the root directory table. I have two files in the partition: test.txt and innit.eh which results in the following table:
The entries starting with 0xE5 are deleted, so I assume these were created due to renaming. The entries for the actual files look like this:
TEST TXT *snip*
INNIT EH *snip*
What I don't understand is where the entries like
At.e.s.t......t.x.t
Ai.n.n.i.t.....e.h.
are coming from and what are they for. They do not start with 0xE5, so should be treated as existing files.
By the way, I'm using Debian Linux to create filesystems and files, but I noticed similar behaviour on FS and files created on Windows.
The ASCII parts of the name (where the letters were close to each other) is the legacy 8.3 DOS shortname. You see it only uses capital letters. In DOS, only these would be there.
The longer parts (with 0x00 in between) is the long name (shown in Windows) which is Unicode, and uses 16bits per character.
The intervening bytes are all 0x00, which gives a strong feeling that they are stored in UTF-16 instead of UTF-8. Perhaps they are there as an extension similar to the other VFAT extensions for long filenames?

Is there a Maximum Length for userPrincipalName in Active Directory?

I am writing an application that is linked to Active Directory, and I need to store the userPrincipalName in a database table, but I do not know how big the field would need to be.
On MSDN, no Length is given, and neither in RFC 822. Now, before I revert to the DOMAIN\Username that has a defined Length (sAMAccountName is less than 20 chars, NETBIOS Domain Name is max. 15 chars), I wonder if anyone knows what the limit is either by standard, or by the implementation within both Windows 2003 and Windows 2008 domains.
On Win2k3 SP2 the longest userPrincipleName it allows me to create is 1013 characters long.
While trying to answer this question for myself today I stumbled across a documented answer.
2.381 Attribute userPrincipalName defines the userPrincipalName in the following way:
cn: User-Principal-Name
ldapDisplayName: userPrincipalName
attributeId: 1.2.840.113556.1.4.656
attributeSyntax: 2.5.5.12
omSyntax: 64
isSingleValued: TRUE
schemaIdGuid: 28630ebb-41d5-11d1-a9c1-0000f80367c1
systemOnly: FALSE
searchFlags: fATTINDEX
rangeUpper: 1024
attributeSecurityGuid: e48d0154-bcf8-11d1-8702-00c04fb96050
isMemberOfPartialAttributeSet: TRUE
systemFlags: FLAG_SCHEMA_BASE_OBJECT | FLAG_ATTR_REQ_PARTIAL_SET_MEMBER
This specifies the maximum length as 1024 according to the rangeUpper attribute definition.
In this context 1024 means octets (bytes), as opposed to Unicode code points, as omSyntax: 64 is defined as String(Unicode) in LDAP Representations and references RFC 2252 LDAPv3 Attributes 6.10 Directory String which describes it as the following:
6.10. Directory String
( 1.3.6.1.4.1.1466.115.121.1.15 DESC 'Directory String' )
A string in this syntax is encoded in the UTF-8 form of ISO 10646 (a
superset of Unicode). Servers and clients MUST be prepared to
receive encodings of arbitrary Unicode characters, including
characters not presently assigned to any character set.
With UTF-8 being a variable length encoding this means that the maximum string length is however many code points you can UTF-8 encode into 1024 octets (bytes). i.e.: For purely ASCII strings that's 1024 code points, for anything with non-ASCII characters it means something less than 1024 code points.
no it's not possible to change the length of characters in Logon Name of Active directory.

Resources