Why is the 17th digit of version 4 GUIDs limited to only 4 possibilities? - uuid

I understand that this doesn't take a significant chunk off of the entropy involved, and that even if a whole nother character of the GUID was reserved (for any purpose), we still would have more than enough for every insect to have one, so I'm not worried, just curious.
As this great answer shows, the Version 4 algorithm for generating GUIDs has the following format:
xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx
x is random
4 is constant, this represents the version number.
y is one of: 8, 9, A, or B
The RFC spec for UUIDs says that these bits must be set this way, but I don't see any reason given.
Why is the third bullet (the 17th digit) limited to only those four digits?

Bits, not hex
Focusing on hexadecimal digits is confusing you.
A UUID is not made of hex. A UUID is made of 128 bits.
Humans would resent reading a series of 128 bits presented as a long string of 1 and 0 characters. So for the benefit of reading and writing by humans, we present the 128-bits in hex.
Always keep in mind that when you see the series of 36 hex characters with hyphens, you are not looking at a UUID. You are looking at some text generated to represent the 128-bits of that are actually in the UUID.
Version & Variant
The first special meaning you mention, the “version” of UUID, is recorded using 4 bits. See section 4.1.3 of your linked spec.
The second special meaning you indicate is the “variant”. This value takes 1-3 bits. This See section 4.1.1 of your linked spec.
A hex character represents 4 bits (half an octet).
The Version number, being 4 bits, takes an entire a single hex character to itself.
Version 4 specifically uses the bits 01 00 which in hex is 4 as it is too in decimal (base 10) numbers.
The Variant, being 1-3 bits, does not take an entire hex character.
Outside the Microsoft world of GUIDs, the rest of the industry nowadays uses two bits: 10, for a decimal value of 2, as the variant. This pair of bits lands in the most significant bits of octet # 8. That octet looks like this, where ‘n’ means 0 or 1: 10 nn nn nn. A pair of hex characters represent each half of that octet. So your 17th hex digit, the first half of that 8th octet, 10 nn, can only have four possible values:
10 00 (hex 8)
10 01 (hex 9)
10 10 (hex A)
10 11 (hex B)

Quoting the estimable Mr. Lippert
First off, what bits are we talking about when we say “the bits”? We already know that in a “random” GUID the first hex digit of the third section is always 4....there is additional version information stored in the GUID in the bits in the fourth section as well; you’ll note that a GUID almost always has 8, 9, a or b as the first hex digit of the fourth section. So in total we have six bits reserved for version information, leaving 122 bits that can be chosen at random.
(from https://ericlippert.com/2012/05/07/guid-guide-part-three/)
tl;dr - it's more version information. To get more specific than that I suspect you're going to have to track down the author of the spec.

Based upon what I've tried to learn today, I've attempted to put together a C#/.NET 'LINQPad' snippet/script, to (for some small part) breakdown the GUID/UUID (- in case it helps):
void Main()
{
var guid =
Guid.Parse(
//#"08c8fbdc-ff38-402e-b0fd-353392a407af" // v4 - Microsoft/.NET
#"7c2a81c7-37ce-4bae-ba7d-11123200d59a" // v4
//#"493f6528-d76a-11ec-9d64-0242ac120002" // v1
//#"5f0ad0df-99d4-5b63-a267-f0f32cf4c2a2" // v5
);
Console.WriteLine(
$"UUID = '{guid}' :");
Console.WriteLine();
var guidBytes =
guid.ToByteArray();
// Version # - 8th octet
const int timeHiAndVersionOctetIdx = 7;
var timeHiAndVersionOctet =
guidBytes[timeHiAndVersionOctetIdx];
var versionNum =
(timeHiAndVersionOctet & 0b11110000) >> 4; // 0xF0
// Variant # - 9th octet
const int clkSeqHiResOctetIdx = 8;
var clkSeqHiResOctet =
guidBytes[clkSeqHiResOctetIdx];
var msVariantNum =
(clkSeqHiResOctet & 0b11100000) >> 5; // 0xE0/3 bits
var variantNum =
(clkSeqHiResOctet & 0b11000000) >> 5; // 0xC0/2bits - 0x8/0x9/0xA/0xB
//
Console.WriteLine(
$"\tVariant # = '{variantNum}' ('0x{variantNum:X}') - '0b{((variantNum & 0b00000100) > 0 ? '1' : '0')}{((variantNum & 0b00000010) > 0 ? '1' : '0')}{((variantNum & 0b00000001) > 0 ? '1' : '0')}'");
Console.WriteLine();
if (variantNum < 4)
{
Console.WriteLine(
$"\t\t'0 x x' - \"Reserved, NCS backward compatibility\"");
}
else
{
if (variantNum == 4 ||
variantNum == 5)
{
Console.WriteLine(
$"\t\t'1 0 x' - \"The variant specified in this {{RFC4122}} document\"");
}
else
{
if (variantNum == 6)
{
Console.WriteLine(
$"\t\t'1 1 0' - \"Reserved, Microsoft Corporation backward compatibility\"");
}
else
{
if (variantNum == 7)
{
Console.WriteLine(
$"\t\t'1 1 1' - \"Reserved for future definition\"");
}
}
}
}
Console.WriteLine();
Console.WriteLine(
$"\tVersion # = '{versionNum}' ('0x{versionNum:X}') - '0b{((versionNum & 0b00001000) > 0 ? '1' : '0')}{((versionNum & 0b00000100) > 0 ? '1' : '0')}{((versionNum & 0b00000010) > 0 ? '1' : '0')}{((versionNum & 0b00000001) > 0 ? '1' : '0')}'");
Console.WriteLine();
string[] versionDescriptions =
new[]
{
#"The time-based version specified in this {{RFC4122}} document",
#"DCE Security version, with embedded POSIX UIDs",
#"The name-based version specified in this {{RFC4122}} document that uses MD5 hashing",
#"The randomly or pseudo-randomly generated version specified in this {{RFC4122}} document",
#"The name-based version specified in this {{RFC4122}} document that uses SHA-1 hashing"
};
Console.WriteLine(
$"\t\t'{versionNum}' = \"{versionDescriptions[versionNum - 1]}\"");
Console.WriteLine();
Console.WriteLine(
$"'RFC4122' document - <https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.1>");
Console.WriteLine();
}

Related

Reading CAN message (PCAN-Router Pro FD)

please I have a problem with writing a code which will read a CAN message, edit it (limit to some maximum value) and then send back with same ID.
I´m using PCAN-Router Pro FD and will show you their example of such thing - basically same as mine but I have no idea what some of the numbers or operations are. [1]: https://i.stack.imgur.com/6ZDHn.jpg
My task is to: 1) Read CAN message with these parameters (ID = 0x120h, startbit 8, length 8 bit and factor 0,75)
2) Limit this value to 100 (because the message should have info about coolant temperature.)
3) If the value was below 100, dont change anything. If it was higher, change it to 100.
Thanks for any help !
Original code:
// catch ID 180h and limit a signal to a maximum
else if ( RxMsg.id == 0x180 && RxMsg.msgtype == CAN_MSGTYPE_STANDARD)
{
uint32_t speed;
// get the signal (intel format)
speed = ( RxMsg.data32[0] >> 12) & 0x1FFF;
// limit value
if ( speed > 6200)
{ speed = 6200;}
// replace the original value
RxMsg.data32[0] &= ~( 0x1FFF << 12);
RxMsg.data32[0] |= speed << 12;
}
After consulting the matter in person, we have found the answer.
The structure type of the RxMsg contains a union allowing the data to be accessed in 4-Byte chunks RxMsg.data32, 2-Byte chunks RxMsg.data16, or 1-Byte chunks RxMsg.data8. Since the temperature is located at the 8th bit and it is 1 Byte long, it can be accessed without using the binary masks, bit shifts and bitwise-logical-assignment operators at all.
// more if-else statements...
else if (RxMsg.id == 0x120 && RxMsg.msgtype == CAN_MSGTYPE_STANDARD)
{
uint8_t temperature = RxMsg.data8[1];
float factor = 0.75;
if (temperature * factor > 100.0)
{
temperature = (int)(100 / factor);
}
RxMsg.data8[1] = temperature;
}
The answer assumes that the startbit is the most significant bit in the message buffer and that the temperature value must be scaled down by the mentioned factor. Should the startbit mean the least significant bit, the [1] index could just be swapped out for [62], as the message buffer contains 64 Bytes in total.
The question author was not provided with a reference sheet for data format, so the answer is based purely on the information mentioned in the question. The temperature scaling factor is yet to be tested (will edit this after confirming it works).

Number of possible sequences of 16 symbols are there given some restrictions

How many possible sequences can be formed that obey the following rules:
Each sequence is formed from the symbols 0-9a-f.
Each sequence is exactly 16 symbols long.
0123456789abcdef ok
0123456789abcde XXX
0123456789abcdeff XXX
Symbols may be repeated, but no more than 4 times.
00abcdefabcdef00 ok
00abcde0abcdef00 XXX
A symbol may not appear three times in a row.
00abcdefabcdef12 ok
000bcdefabcdef12 XXX
There can be at most two pairs.
00abcdefabcdef11 ok
00abcde88edcba11 XXX
Also, how long would it take to generate all of them?
In combinatorics, counting is usually pretty straight-forward, and can be accomplished much more rapidly than exhaustive generation of each alternative (or worse, exhaustive generative of a superset of possibilities, in order to filter them). One common technique is to reduce a given problem to combinatons of a small(ish) number of disjoint sub-problems, where it is possible to see how many times each subproblem contributes to the total. This sort of analysis can often result in dynamic programming solutions, or, as below, in a memoised recursive solution.
Because combinatoric results are usually enormous numbers, brute-force generation of every possibility, even if it can be done extremely rapidly for each sequence, is impractical in all but the most trivial of cases. For this particular question, for example, I made a rough back-of-the-envelope estimate in a comment (since deleted):
There are 18446744073709551616 possible 64-bit (16 hex-digit) numbers, which is a very large number, about 18 billion billion. So if I could generate and test one of them per second, it would take me 18 billion seconds, or about 571 years. So with access to a cluster of 1000 96-core servers, I could do it all in about 54 hours, just a bit over two days. Amazon will sell me one 96-core server for just under a dollar an hour (spot prices), so a thousand of them for 54 hours would cost a little under 50,000 US dollars. Perhaps that's within reason. (But that's just to generate.)
Undoubtedly, the original question has is part of an exploration of the possibility of trying every possible sequence by way of cracking a password, and it's not really necessary to produce a precise count of the number of possible passwords to demonstrate the impracticality of that approach (or its practicality for organizations which have a budget sufficient to pay for the necessary computing resources). As the above estimate shows, a password with 64 bits of entropy is not really that secure if what it is protecting is sufficiently valuable. Take that into account when generating a password for things you treasure.
Still, it can be interesting to compute precise combinatoric counts, if for no reason other than the intellectual challenge.
The following is mostly a proof-of-concept; I wrote it in Python because Python offers some facilities which would have been time-consuming to reproduce and debug in C: hash tables with tuple keys and arbitrary precision integer arithmetic. It could be rewritten in C (or, more easily, in C++), and the Python code could most certainly be improved, but given that it only takes 70 milliseconds to compute the count request in the original question, the effort seems unnecessary.
This program carefully groups the possible sequences into different partitions and caches the results in a memoisation table. For the case of sequences of length 16, as in the OP, the cache ends up with 2540 entries, which means that the core computation is only done 2540 times:
# The basis of the categorization are symbol usage vectors, which count the
# number of symbols used (that is, present in a prefix of the sequence)
# `i` times, for `i` ranging from 1 to the maximum number of symbol uses
# (4 in the case of this question). I tried to generalise the code for different
# parameters (length of the sequence, number of distinct symbols, maximum
# use count, maximum number of pairs). Increasing any of these parameters will,
# of course, increase the number of cases that need to be checked and thus slow
# the program down, but it seems to work for some quite large values.
# Because constantly adjusting the index was driving me crazy, I ended up
# using 1-based indexing for the usage vectors; the element with index 0 always
# has the value 0. This creates several inefficiencies but the practical
# consequences are insignificant.
### Functions to manipulate usage vectors
def noprev(used, prevcnt):
"""Decrements the use count of the previous symbol"""
return used[:prevcnt] + (used[prevcnt] - 1,) + used[prevcnt + 1:]
def bump1(used, count):
"""Registers that one symbol (with supplied count) is used once more."""
return ( used[:count]
+ (used[count] - 1, used[count + 1] + 1)
+ used[count + 2:]
)
def bump2(used, count):
"""Registers that one symbol (with supplied count) is used twice more."""
return ( used[:count]
+ (used[count] - 1, used[count + 1], used[count + 2] + 1)
+ used[count + 3:]
)
def add1(used):
"""Registers a new symbol used once."""
return (0, used[1] + 1) + used[2:]
def add2(used):
"""Registers a new symbol used twice."""
return (0, used[1], used[2] + 1) + used[3:]
def count(NSlots, NSyms, MaxUses, MaxPairs):
"""Counts the number of sequences of length NSlots over an alphabet
of NSyms symbols where no symbol is used more than MaxUses times,
no symbol appears three times in a row, and there are no more than
MaxPairs pairs of symbols.
"""
cache = {}
# Canonical description of the problem, used as a cache key
# pairs: the number of pairs in the prefix
# prevcnt: the use count of the last symbol in the prefix
# used: for i in [1, NSyms], the number of symbols used i times
# Note: used[0] is always 0. This problem is naturally 1-based
def helper(pairs, prevcnt, used):
key = (pairs, prevcnt, used)
if key not in cache:
# avail_slots: Number of remaining slots.
avail_slots = NSlots - sum(i * count for i, count in enumerate(used))
if avail_slots == 0:
total = 1
else:
# avail_syms: Number of unused symbols.
avail_syms = NSyms - sum(used)
# We can't use the previous symbol (which means we need
# to decrease the number of symbols with prevcnt uses).
adjusted = noprev(used, prevcnt)[:-1]
# First, add single repeats of already used symbols
total = sum(count * helper(pairs, i + 1, bump1(used, i))
for i, count in enumerate(adjusted)
if count)
# Then, a single instance of a new symbol
if avail_syms:
total += avail_syms * helper(pairs, 1, add1(used))
# If we can add pairs, add the already not-too-used symbols
if pairs and avail_slots > 1:
total += sum(count * helper(pairs - 1, i + 2, bump2(used, i))
for i, count in enumerate(adjusted[:-1])
if count)
# And a pair of a new symbol
if avail_syms:
total += avail_syms * helper(pairs - 1, 2, add2(used))
cache[key] = total
return cache[key]
rv = helper(MaxPairs, MaxUses, (0,)*(MaxUses + 1))
# print("Cache size: ", len(cache))
return rv
# From the command line, run this with the command:
# python3 SLOTS SYMBOLS USE_MAX PAIR_MAX
# There are defaults for all four argument.
if __name__ == "__main__":
from sys import argv
NSlots, NSyms, MaxUses, MaxPairs = 16, 16, 4, 2
if len(argv) > 1: NSlots = int(argv[1])
if len(argv) > 2: NSyms = int(argv[2])
if len(argv) > 3: MaxUses = int(argv[3])
if len(argv) > 4: MaxPairs = int(argv[4])
print (NSlots, NSyms, MaxUses, MaxPairs,
count(NSlots, NSyms, MaxUses, MaxPairs))
Here's the result of using this program to compute the count of all valid sequences (since a sequence longer than 64 is impossible given the constraints), taking less than 11 seconds in total:
$ time for i in $(seq 1 65); do python3 -m count $i 16 4; done
1 16 4 2 16
2 16 4 2 256
3 16 4 2 4080
4 16 4 2 65040
5 16 4 2 1036800
6 16 4 2 16524000
7 16 4 2 263239200
8 16 4 2 4190907600
9 16 4 2 66663777600
10 16 4 2 1059231378240
11 16 4 2 16807277588640
12 16 4 2 266248909553760
13 16 4 2 4209520662285120
14 16 4 2 66404063202640800
15 16 4 2 1044790948722393600
16 16 4 2 16390235567479693920
17 16 4 2 256273126082439298560
18 16 4 2 3992239682632407024000
19 16 4 2 61937222586063601795200
20 16 4 2 956591119531904748877440
21 16 4 2 14701107045788393912922240
22 16 4 2 224710650516510785696509440
23 16 4 2 3414592455661342007436384000
24 16 4 2 51555824538229409502827923200
25 16 4 2 773058043102197617863741843200
26 16 4 2 11505435580713064249590793862400
27 16 4 2 169863574496121086821681298457600
28 16 4 2 2486228772352331019060452730124800
29 16 4 2 36053699633157440642183732148192000
30 16 4 2 517650511567565591598163978874476800
31 16 4 2 7353538304042081751756339918288153600
32 16 4 2 103277843408210067510518893242552998400
33 16 4 2 1432943471827935940003777587852746035200
34 16 4 2 19624658467616639408457675812975159808000
35 16 4 2 265060115658802288611235565334010714521600
36 16 4 2 3527358829586230228770473319879741669580800
37 16 4 2 46204536626522631728453996238126656113459200
38 16 4 2 595094456544732751483475986076977832633088000
39 16 4 2 7527596027223722410480884495557694054538752000
40 16 4 2 93402951052248340658328049006200193398898022400
41 16 4 2 1135325942092947647158944525526875233118233702400
42 16 4 2 13499233156243746249781875272736634831519281254400
43 16 4 2 156762894800798673690487714464110515978059412992000
44 16 4 2 1774908625866508837753023260462716016827409668608000
45 16 4 2 19556269668280714729769444926596793510048970792448000
46 16 4 2 209250137714454234944952304185555699000268936613376000
47 16 4 2 2169234173368534856955926000562793170629056490849280000
48 16 4 2 21730999613085754709596718971411286413365188258316288000
49 16 4 2 209756078324313353775088590268126891517374425535395840000
50 16 4 2 1944321975918071063760157244341119456021429461885104128000
51 16 4 2 17242033559634684233385212588199122289377881249323872256000
52 16 4 2 145634772367323301463634877324516598329621152347129008128000
53 16 4 2 1165639372591494145461717861856832014651221024450263064576000
54 16 4 2 8786993110693628054377356115257445564685015517718871715840000
55 16 4 2 61931677369820445021334706794916410630936084274106426433536000
56 16 4 2 404473662028342481432803610109490421866960104314699801413632000
57 16 4 2 2420518371006088374060249179329765722052271121139667645435904000
58 16 4 2 13083579933158945327317577444119759305888865127012932088217600000
59 16 4 2 62671365871027968962625027691561817997506140958876900738150400000
60 16 4 2 259105543035583039429766038662433668998456660566416258886520832000
61 16 4 2 889428267668414961089138119575550372014240808053275769482575872000
62 16 4 2 2382172342138755521077314116848435721862984634708789861244239872000
63 16 4 2 4437213293644311557816587990199342976125765663655136187709235200000
64 16 4 2 4325017367677880742663367673632369189388101830634256108595793920000
65 16 4 2 0
real 0m10.924s
user 0m10.538s
sys 0m0.388s
This program counts 16,390,235,567,479,693,920 passwords.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
enum { RLength = 16 }; // Required length of password.
enum { NChars = 16 }; // Number of characters in alphabet.
typedef struct
{
/* N[i] counts how many instances of i are left to use, as constrained
by rule 3.
*/
unsigned N[NChars];
/* NPairs counts how many more pairs are allowed, as constrained by
rule 5.
*/
unsigned NPairs;
/* Used counts how many characters have been distinguished by choosing
them as a represenative. Symmetry remains unbroken for NChars - Used
characters.
*/
unsigned Used;
} Supply;
/* Count the number of passwords that can be formed starting with a string
(in String) of length Length, with state S.
*/
static uint64_t Count(int Length, Supply *S, char *String)
{
/* If we filled the string, we have one password that obeys the rules.
Return that. Otherwise, consider suffixing more characters.
*/
if (Length == RLength)
return 1;
// Initialize a count of the number of passwords beginning with String.
uint64_t C = 0;
// Consider suffixing each character distinguished so far.
for (unsigned Char = 0; Char < S->Used; ++Char)
{
/* If it would violate rule 3, limiting how many times the character
is used, do not suffix this character.
*/
if (S->N[Char] == 0) continue;
// Does the new character form a pair with the previous character?
unsigned IsPair = String[Length-1] == Char;
if (IsPair)
{
/* If it would violate rule 4, a character may not appear three
times in a row, do not suffix this character.
*/
if (String[Length-2] == Char) continue;
/* If it would violate rule 5, limiting how many times pairs may
appear, do not suffix this character.
*/
if (S->NPairs == 0) continue;
/* If it forms a pair, and our limit is not reached, count the
pair.
*/
--S->NPairs;
}
// Count the character.
--S->N[Char];
// Suffix the character.
String[Length] = Char;
// Add as many passwords as we can form by suffixing more characters.
C += Count(Length+1, S, String);
// Undo our changes to S.
++S->N[Char];
S->NPairs += IsPair;
}
/* Besides all the distinguished characters, select a representative from
the pool (we use the next unused character in numerical order), count
the passwords we can form from it, and multiply by the number of
characters that were in the pool.
*/
if (S->Used < NChars)
{
/* A new character cannot violate rule 3 (has not been used 4 times
yet, rule 4 (has not appeared three times in a row), or rule 5
(does not form a pair that could pass the pair limit). So we know,
without any tests, that we can suffix it.
*/
// Use the next unused character as a representative.
unsigned Char = S->Used;
/* By symmetry, we could use any of the remaining NChars - S->Used
characters here, so the total number of passwords that can be
formed from the current state is that number times the number that
can be formed by suffixing this particular representative.
*/
unsigned Multiplier = NChars - S->Used;
// Record another character is being distinguished.
++S->Used;
// Decrement the count for this character and suffix it to the string.
--S->N[Char];
String[Length] = Char;
// Add as many passwords as can be formed by suffixing a new character.
C += Multiplier * Count(Length+1, S, String);
// Undo our changes to S.
++S->N[Char];
--S->Used;
}
// Return the computed count.
return C;
}
int main(void)
{
/* Initialize our "supply" of characters. There are no distinguished
characters, two pairs may be used, and each character may be used at
most 4 times.
*/
Supply S = { .Used = 0, .NPairs = 2 };
for (unsigned Char = 0; Char < NChars; ++Char)
S.N[Char] = 4;
/* Prepare space for string of RLength characters preceded by a sentinel
(-1). The sentinel permits us to test for a repeated character without
worrying about whether the indexing goes outside array bounds.
*/
char String[RLength+1] = { -1 };
printf("There are %" PRIu64 " possible passwords.\n",
Count(0, &S, String+1));
}
The number of possibilities there are, is fixed. You could either come up with an algorithm to generate valid combinations, or you could just iterate over the entire problem space and check each combination using a simple function that checks for the validity of the combination.
How long it takes, depends on the computer and the efficiency. You could easily make it a multithreaded application.

How can I determine if a file contains UTF-8 like characters

I am trying to write a program which takes a file as input, iterates the file and then check if the file contains UTF-8 encoded characters.
However I am unsure how to engage the problem of UTF-8 encoding. I understand the basic concept behind the encoding, that it can be stored in 1-4 bytes, where 1 byte is just ASCII representation (0-127).
1 bytes: 0xxxxxxx
For the remainder I believe the pattern to be as such:
2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
However, I struggle in realizing how to implement this in C code. I know how I would iterate the file, and do something if the predicate of UTF-8 encoding holds:
while ((check = fgetc(fp)) != EOF) {
if (*) {
// do something to the code
}
}
However, I am unsure how to actually modify and implement the encoding of UTF-8 into C (or any language which does not have a build in function to do this, such as C# UTF8Encoding e.g.).
As a simple example using a similar logic to ASCII would just have me iterating over each character (pointed to be the check variable) and verify whether it is within the ASCII character limits:
if (check >= 0 && check <= 127) {
// do something to the code
}
Can anyone try and explain to me how I would engage a similar logic, only when trying to determine if the check variable is pointing to a UTF-8 encoded character instead?
if ( (ch & 0x80) == 0x0 ) {
//ascii byte
}
else if ( (ch & 0xe0) == 0xc0 ) {
// 2 bytes
}
else if ( (ch & 0xf0) == 0xe0 ) {
// 3 bytes
}
else if ( (ch & 0xf8) == 0xf0 ) {
// 4 bytes
}
You want to bitwise & the first x bits and check that the first x-1 bits are 1. It helps to write out the numbers in binary and follow along.
UTF-8 is not hard, but it is stricter than what you realize and what jpsalm's answer suggests. If you want to test that it's valid UTF-8, you need to determine that it conforms to the definition, expressed in ABNF in RFC 3629:
UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
Alternatively, you can do a bunch of math checking for "non shortest form" and other stuff (surrogate ranges), but that's a huge pain, and highly error-prone. Almost every single implementation I've ever seen done this way, even in major widely used software, has been outright wrong on at least one thing. A state machine that accepts UTF-8 is easy to do and easy to verify that it matches the formal definition. One nice, clean, readable one is described in detail at https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

How does bit flag parameters of a method works (Ansi C/C)?

I'm programming a POS (point of sale) in C (ANSI C)
I have this function
GetString(uchar *str,uchar mode,uchar minlen,uchar maxlen)
Is something like readln but in the POS
In the API the mode parameter is something like D1,D2,D3...
But in the example (of the API) i have this
if(!GetString(buf, 0x26, 0, 10))
{
buf[buf[0]+1]=0; amt=atol(buf+1);
} else {
/* user press CANCEL button */ }
So what is the relation betwen 0x26 (parameter mode in the function) and
the binary numbers or bit flag or even, I dont know, hexadecimal.
In the API theres another thing explaining the mode input parameter
1. Input mode definition: (priority order is bit3>bit4>bit5, and mode&0x38≠0);
2. When inputting, the digit will be displayed on the screen in turns as plaintext or cryptograph (according to bit3).
3. The initial cursor position is determined by ScrGotoxy(x, y).
4. If bit7 of mode =1, then this function could bring initial digit string, and the string is displayed on initial cursor position as input digit string.
5. Output string does not record and contain function keys.
6. Press CLEAR button, if it is plaintext display, then CLEAR is considered as BACKSPACE key; if it is cryptograph display, then CLEAR is considered as the key to clear all contents.
7. Use function key to switch to Capital mode. S80 uses Alpha key to select the corresponding character on a key, however SP30 uses UP and Down key, and T52 uses ―#‖ key, T620 uses key F2.
8. In MT30, the switch method between uppercase, lowercase and number characters is to keep pressing
In the page you linked there are 8 bit flags specified. In your example you have the hex value 0x26 which is the binary value 00100110. This specifies 8 bit-flags, of which 3 (D1, D2, D5) are set and 5 (D0, D3, D4, D6, D7) are clear.
If you refer to your table linked (it's a graphic so I can't paste it), it tells you how the GetString argument mode instructs the function to behave, for each of the 8 bit-flags set (1) or clear (0).
For example D2 set to 1 indicates left alignment.
Combining the individual flags gives a binary number, which in your example is passed as the hexadecimal number 0x26.
The D1, D2, D3 ... D7 are bits. I suppose it's used as a bit flag. Since it's just 1 byte it has 8 possible states all of them can be combined together.
The 0x26 is decimal 38 or binary
00100110
which means D1, D2, D5 are set and al the other D's are not.
this will helps you , i defined a macro to manipulate D7 bit which is according to documentation is the bit of the ENTER mode.
continue in the same way with the others modes.
// bits manipulation
// idx stand for bit index
#define CHECK_BIT(var,idx) ((var >> idx) & 1)
#define SET_BIT(var,idx,n) (var ^= (-(n?1:0) ^ var) & (1 << idx))
// helping macros
// check if Enter mode is set D7 is used
#define IS_ENTER_MODE_SET(mode) CHECK_BIT(mode,7)
// set Enter mode to opt (true,false)
#define SET_ENTER_MODE (mode,opt) SET_BIT(mode,7,opt)

Correct way to unpack a 32 bit vector in Perl to read a uint32 written in C

I am parsing a Photoshop raw, 16 bit/channel, RGB file in C and trying to keep a log of exceptional data points. I need a very fast C analysis of up to 36 MPix images with 16 bit quanta or 216 MB Photoshop .RAW files.
<1% of the points have weird skin tones and I want to graph them with PerlMagick or Perl GD to see where they are coming from.
The first 4 bytes of the C data file contain the unsigned image width as a uint32_t. In Perl, I read the whole file in binary mode and extract the first 32 bits:
Xres=1779105792l = 0x6a0b0000
It looks a lot like the C log file:
DA: Color anomalies=14177=0.229%:
DA: II=1) raw PIDX=0x10000b25, XCols=[0]=0x00000b6a
Dec(0x00000b6a) = 2922, the Exact X_Columns_Width of a small test file.
Clearly a case of intel's 1972 8008 NUXI architecture. How hard could it possibly be to translate 0x6a0b0000 to 0x6a0b0000; swap 2 bytes and 2 nibbles and you're done. Slicing the 8 characters and rearranging them could be done but that is the kind of ugly hack I am trying to avoid.
Grab the same 32 bit vector from file offset zero and unpack it as "VAX" unsigned long.
$xres = vec($bdat, 0, 32); # vec EXPR,OFFSET,BITS
$vul = unpack("V", vec($bdat, 0, 32));
printf("Length (\$bdat)=%d, xres=0x%08x, Vax ulong=%ul=0x%08x\n",
length($bdat), $xres, $vul, $vul);
Length ($bdat) = 56712, xres=0x6a0b0000, Vax ulong=959919921l=0x39373731
Every single hex character is mangled. Obviously wrong Endian, it is not VAX. The "Other" one is Network Big-endian
http://perldoc.perl.org/functions/pack.html
N An unsigned long (32-bit) in "network" (big-endian) order.
V An unsigned long (32-bit) in "VAX" (little-endian) order.
$nul = unpack("N", vec($bdat, 0, 32)); # Network Unsigned Long 32b
printf("Xres=0x%08x, NET ulong=%ul=0x%08x\n", $xres, $nul, $nul);
Xres=0x6a0b0000, NET ulong=825702201l=0x31373739
The $XRES still shows the right hex in the wrong order. The "NETWORK" long 32 bit uint extracted from the same bits is unrecognizable. Try Binary
$bits = unpack("b*", vec($bdat, 0, 32));
printf("bits=$bits, len=%d\n", length $bits);
bits=10001100111011001110110010011100100011000000110010101100111011001001110001001100, len=80
I clearly asked for 32 bits and got 80 bits. What gives?
Try for 4, unsigned, 8bit bytes which can NOT be swapped:
for($ii = 0; $ii < 4; $ii++) {
$bit_off=$ii*8; # Bit offset
$uc = unpack("C", vec($bdat, $bit_off, 8)); # C An unsigned char
printf("II $ii, bo $bit_off, d=%d, u=%u, x=0x%x\n",
$uc,$uc, $uc);
}
II 0, bo 0, d=49, u=49, x=0x31
II 1, bo 8, d=51, u=51, x=0x33
II 2, bo 16, d=49, u=49, x=0x31
II 3, bo 24, d=49, u=49, x=0x31
I am looking for hex 0, 6, a or b. There are no "3"s or "1"s in the right answer. Try pirating from a C file:
http://cpansearch.perl.org/src/MHX/Convert-Binary-C-0.76/tests/include/include/bits/byteswap.h
$x = $xres;
$x= (((($x) & 0xff000000) >> 24) | ((($x) & 0x00ff0000) >> 8) | ((($x) & 0x0000ff00) << 8) | ((($x) & 0x000000ff) << 24));
printf("\$xres=0x%08x -> \$x=0x%08x = %u\n", $xres, $x, $x);
$xres=0x6a0b0000 -> $x=0x00000b6a = 2922
It WORKS! But, this is uglier than converting the original, wrong order hex number to a string to untangle it:
$stupid_str = sprintf("%08x", $xres);
$stupid_num = join('', reverse ($stupid_str =~ m/../g));
printf("Stupid_num '%s'->0x%08x=%d\n", $stupid_num, $dec=hex $stupid_num, $dec);
Stupid_num '00000b6a'->0x00000b6a=2922
It's like judging the Ugliest Dog contest, but I would still rather have to maintain the text version than the even more abominable C version.
I know there are ways to do this in Java/Python/Go/Ruby/.....
I know there are command line utilities that do exactly this.
I must figure out how I am misusing either VEC or Unpack, both of which I have used a zillion times. It is the Brain Teasing aspect which is driving me nuts! EndianNess == EndianMess!!!
TYVM!
=================================================
Borodin,
Thanks for lookin' at this.
My intel processor is little-endian. When I read it back, it was trans-mutilated by vec to the "correct" big-endian, network format.
I just tried reading it VERBATIM from a BINARY file read and it works fine:
($b4 = $bdat) =~ s/^(....).*$/$1/msg; # Give me my 4 bytes back without mutilation!
printf("B4='%s'=>0x%08x=<0x%08x\n", $b4, unpack("L>", $b4), unpack("L<", $b4));
B4='j...' = >0x6a0b0000 = <0x00000b6a <<< THE RIGHT ANSWER!!!
If you try unpack 'V', $bdat then you will find that it works
That was my first attempt:
$vul = unpack("V", vec($bdat, 0, 32)); # UNPACK V!
printf("Length (\$bdat)=%d, xres=0x%08x, Vax ulong=%ul=0x%08x\n",
length($bdat), $xres, $vul, $vul);
Length ($bdat) = 56712, xres=0x6a0b0000, Vax ulong=959919921l=0x39373731 <<<< TOTALLY WRONG!
I had already verified that the $BDAT info was the right data in the wrong format. It just needed some rearrangement.
I just used vec() to generate 1 bit and 4 bit graphics files and it worked faithfully, returning the exact bits I wrote. It must have mistaken my Intel i7 for my IBM System/370. I7/37??? Easy mistake to make. :)
I read the [confusing] part about "converted to a number as with pack ...". That's why my number was backward. The >>unpack("V", vec($bdat"<< ... was my ill-fated attempt to byte-swap the backward number in $BDAT from the WRONG VEC()-preferred FORMAT to the native format supported by my architecture.
Now I understand why I saw so many examples of people extracting by the byte, to avoid Big Brother's helping hand!
Data::BitStream::Vec "uses a Perl vec to store the data. The vector is accessed in 1-bit units"
Thanks 1E6,
B
You are confusing things by combining vec with unpack
The correct way is simply
unpack 'V', $bdat
which returns a value of 0x00000B6A as you expect
vec($bdat, 0, 32) is equivalent to unpack 'N', $bdat as you can see from the value of $xres in your first code block, and the documentation for vec confirms this with
If BITS is 16 or more, bytes of the input string are grouped into chunks of size BITS/8, and each group is converted to a number as with pack()/unpack() with big-endian formats n/N
The line
$vul = unpack("V", vec($bdat, 0, 32))
is very wrong, because the decimal value of vec($bdat, 0, 32) is 1779105792, so you are then calling unpack on the string "1779105792" which doesn't do anything useful at all

Resources