Time based UUID does not follow creation order according when implementing RFC 4122 - uuid

I am creating a custom algorithm to embed info into a timeUUID. When studying the RFC 4122. In the spec, the version 1 UUID has the following structure:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| time_low |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| time_mid | time_hi_and_version |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|clk_seq_hi_res | clk_seq_low | node (0-1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| node (2-5) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
I've found that the lower part of timestamp (rightmost 32 bits) goes in front of the ID making it the most relevant part when sorting UUID.
What I do not understand is how this specification works in a way when sorting UUIDs the sorting will follow creation order.
To illustrate the question, please find two examples here where timestamp t1 > t2 but the created UUID with that timestamp will be in the reverse order.
t1 = 137601405637595834 // 0x1e8dbbfd79f92ba
t2 = 3617559227 // 0xd79f92bb
are transformed to the following parts
t1_low: Uint = 3617559226 // 0xd79f92ba
t1_mid: Ushort = 56255 // 0xdbbf
t1_hi: Ushort = 1e8 // 0x1e8
t2_low: Uint = 3617559226 // 0xd79f92bb
t2_mid: Ushort = 0 // 0x0
t2_hi: Ushort = 0 // 0x0
Since the least significant bytes are not relevant for the order in this case, I will ignore that for the sake of simplification.
The UUIDs geenrated using these timestamps are
UUID1 = d79f92ba-dbbf-11e8-8808-000000000002
UUID2 = d79f92bb-0000-1000-a68b-000000000004
Clearly UUID1 < UUID2 even when its timestamps are in the reverse order.
What is wrong on my analysis?

The UUIDv1 spec deliberately puts the most entropy in the high-order bits so that keys do not sort as you expected; instead, they will be seemingly randomly yet roughly evenly distributed across the full number range regardless of creation order--just like UUIDv3/v4/v5.
If you want a sortable timestamp, add another column; using UUID as anything but an opaque identifier will end up biting you later.

Related

An array is given and it can be of four types:

An array is given and it can be of four types :
increasing
decreasing
first increasing then decreasing
first decreasing then increasing
Without traversing the array we need to tell its type.
Example:
a. increasing. eg 1 2 3 4 5 6 7 8 9
b. decreasing. eg 9 8 7 6 5 4 3 2 1
c. incr-decr. eg 1 2 3 4 9 8 7 6 5
d. decr-inc. eg 9 8 7 6 1 2 3 4 5
First, for the third and fourth cases, there must be at least three array elements in order to have an increase followed by a decrease, or vice-versa.
Assuming three or more elements, you can answer the question by doing the following two checks:
compare the first and second element (Comparison 1)
compare the second to last and last element (Comparison 2)
Here is a table showing how the results of these two comparisons can be used to determine the array type:
Comparison 1 | Comparison 2 | Type
< | < | increasing
> | > | decreasing
< | > | increasing then decreasing
> | < | decreasing then increasing

Nested loop with increasing number of elements

I have a longitudinal dataset of 18 time periods. For reasons not to be discussed here, this dataset is in the wide shape, not in the long one. More precisely, time-varying variables have an alphabetic prefix which identifies the time it belongs to. For the sake of this question, consider a quantity of interest called pay. This variable is denoted apay in the first period, bpay in the second, and so on, until rpay.
Importantly, different observations have missing values in this variable in different periods, in an unpredictable way. In consequence, running a panel for the full number of periods will reduce my number of observations considerably. Hence, I would like to know precisely how many observations a panel with different lengths will have. To evaluate this, I want to create variables that, for each period and for each number of consecutive periods count how many respondents have the variable with that time sequence. For example, I want the variable b_count_2 to count how many observations have nonmissing pay in the first period and the second. This can be achieved with something like this:
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
Now, since I want to do this automatically, this has to be in a loop. Moreover, there are different numbers of sequences for each period. For example, for the third period, there are two sequences (those with pay in period 2 and 3, and those with sequences in period 1, 2 and 3). Thus, the number of variables to create is 1+2+3+4+...+17 = 153. This variability has to be reflected in the loop. I propose a code below, but there are bits that are wrong, or of which I'm unsure, as highlighted in the comments.
local list b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local counter = 1 // counter to update; reflects sequence length
while `counter' > 0 { // loop over sequence lengths
gen _`var'_counter_`counter' = 0 // generate variable with counter
if `var'pay != . { // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS
recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
local counter = `counter' - 1 // update counter to look for a longer sequence in the next iteration
}
}
local counter = `counter' + 1 // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}
An example of the result of the above code (if right) is the following. Consider a dataset of five observations, for four periods (denoted a, b, c, and d):
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
where 1 means value is observed in that period, and . is not. The objective of the code is to create 1+2+3=6 new variables such that the new dataset is:
Obs a b c d b_count_2 c_count_2 c_count_3 d_count_2 d_count_3 d_count_4
1 1 1 . 1 1 0 0 0 0 0
2 1 1 . . 1 0 0 0 0 0
3 . . 1 1 0 0 0 1 0 0
4 . 1 1 . 0 1 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1
Now, why is this helpful? Well, because now I can run a set of summarize commands to get a very nice description of the dataset. The code to print this information in one go would be something like this:
local list a b c d e f g h i j k l m n o p q r // periods over which iterate
foreach var of local list { // loop over periods
local list `var'_counter_* // group of sequence variables for each period
foreach var2 of local list { // loop over each element of the list
quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1 // sum the number of individuals with value = 1 with sequence of length var2 in period var
di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
}
}
For the above example, this produces the following output:
"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."
This gives me a nice summary of the trade-offs I'm having between a wider panel and a longer panel.
If you insist on doing this with data in wide form, it is very inefficient to create extra variables just to count patterns of missing values. You can create a single string variable that contains the pattern for each observation. Then, it's just a matter of extracting from this pattern variable what you are looking for (i.e. patterns of consecutive periods up to the current wave). You can then loop over lengths of the matching patterns and do counts. Something like:
* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list
qui foreach pre in b c d e f g {
noi dis "{hline 80}" _n as res "Wave `pre'"
// the longest substring without a space up to the wave
gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
noi tab temp
// loop over the various substring lengths, from 2 to max length
gen len = length(temp)
sum len, meanonly
local n = r(max)
forvalues i = 2/`n' {
count if length(temp) >= `i'
noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
drop temp len
}
If you are open to working in long form, then here is how you would identify spells with contiguous data and how to loop to get the info you want (the data setup is exactly the same as above):
* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
gen `pre'pay = runiform() if runiform() < .8
}
* reshape to long form
gen id = _n
reshape long #pay, i(id) j(wave) string
* identify spells of contiguous periods
egen wavegroup = group(wave), label
tsset id wavegroup
tsspell, cond(pay < .)
drop if mi(pay)
foreach pre in b c d e f g {
dis "{hline 80}" _n as res "Wave `pre'"
sum _seq if wave == "`pre'", meanonly
local n = r(max)
forvalues i = 2/`n' {
qui count if _seq >= `i' & wave == "`pre'"
dis as txt "length = " as res `i' as txt " obs = " as res r(N)
}
}
I echo #Dimitriy V. Masterov in genuine puzzlement that you are using this dataset shape. It can be convenient for some purposes, but for panel or longitudinal data such as you have, working with it in Stata is at best awkward and at worst impracticable.
First, note specifically that
local b_count_2 = 0
if apay != . & bpay != . {
local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}
will only ever be evaluated in terms of the first observation, i.e. as if you had coded
if apay[1] != . & bpay[1] != .
This is documented here. Even if it is what you want, it is not usually a pattern for others to follow.
Second, and more generally, I haven't tried to understand all of the details of your code, as what I see is the creation of a vast number of variables even for tiny datasets as in your sketch. For a series T periods long, you would create a triangular number [(T - 1)T]/2 of new variables; in your example (17 x 18)/2 = 153. If someone had series 100 periods long, they would need 4950 new variables.
Note that because of the first point just made, these new variables would pertain with your strategy only to individual variables like pay and individual panels. Presumably that limitation to individual panels could be fixed, but the main idea seems singularly ill-advised in many ways. In a nutshell, what strategy do you have to work with these hundreds or thousands of new variables except writing yet more nested loops?
Your main need seems to be to identify spells of non-missing and missing values. There is easy machinery for this long since developed. General principles are discussed in this paper and an implementation is downloadable from SSC as tsspell.
On Statalist, people are asked to provide workable examples with data as well as code. See this FAQ That's entirely equivalent to long-standing requests here for MCVE.
Despite all that advice, I would start by looking at the Stata command xtdescribe and associated xt tools already available to you. These tools do require a long data shape, which reshape will provide for you.
Let me add another answer based on the example now added to the question.
Obs a b c d
1 1 1 . 1
2 1 1 . .
3 . . 1 1
4 . 1 1 .
5 1 1 1 1
The aim of this answer is not to provide what the OP asks but to indicate how many simple tools are available to look at patterns of non-missing and missing values, none of which entail the creation of large numbers of extra variables or writing intricate code based on nested loops for every new question. Most of those tools require a reshape long.
. clear
. input a b c d
a b c d
1. 1 1 . 1
2. 1 1 . .
3. . . 1 1
4. . 1 1 .
5. 1 1 1 1
6. end
. rename (a b c d) (y1 y2 y3 y4)
. gen id = _n
. reshape long y, i(id) j(time)
(note: j = 1 2 3 4)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 5 -> 20
Number of variables 5 -> 3
j variable (4 values) -> time
xij variables:
y1 y2 ... y4 -> y
-----------------------------------------------------------------------------
. xtset id time
panel variable: id (strongly balanced)
time variable: time, 1 to 4
delta: 1 unit
. preserve
. drop if missing(y)
(7 observations deleted)
. xtdescribe
id: 1, 2, ..., 5 n = 5
time: 1, 2, ..., 4 T = 4
Delta(time) = 1 unit
Span(time) = 4 periods
(id*time uniquely identifies each observation)
Distribution of T_i: min 5% 25% 50% 75% 95% max
2 2 2 2 3 4 4
Freq. Percent Cum. | Pattern
---------------------------+---------
1 20.00 20.00 | ..11
1 20.00 40.00 | .11.
1 20.00 60.00 | 11..
1 20.00 80.00 | 11.1
1 20.00 100.00 | 1111
---------------------------+---------
5 100.00 | XXXX
* ssc inst xtpatternvar
. xtpatternvar, gen(pattern)
* ssc inst groups
. groups pattern
+------------------------------------+
| pattern Freq. Percent % <= |
|------------------------------------|
| ..11 2 15.38 15.38 |
| .11. 2 15.38 30.77 |
| 11.. 2 15.38 46.15 |
| 11.1 3 23.08 69.23 |
| 1111 4 30.77 100.00 |
+------------------------------------+
. restore
. egen npresent = total(missing(y)), by(time)
. tabdisp time, c(npresent)
----------------------
time | npresent
----------+-----------
1 | 2
2 | 1
3 | 2
4 | 2
----------------------

Comparisons across multiple rows in Stata (household dataset)

I'm working on a household dataset and my data looks like this:
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
What I want to do is identify the mother in each family. A mother is a member of the family whose id is equal to one of the mother_id's of another family member. In the example above, for the family with id_family=3, individual 5 has mother_id=4, which makes individual 4 her mother.
I create a family size variable that tells me how many members there are per family. I also create a rank variable for each member within a family. For families of three, I then have the following piece of code that works:
bysort id_family: gen family_size=_N
bysort id_family: gen rank=_n
gen mother=.
bysort id_family: replace mother=1 if male==0 & rank==1 & family_size==3 & (id[_n]==id[_n+1] | id[_n]==id[_n+2])
bysort id_family: replace mother=1 if male==0 & rank==2 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n+1])
bysort id_family: replace mother=1 if male==0 & rank==3 & family_size==3 & (id[_n]==id[_n-1] | id[_n]==id[_n-2])
What I get is:
id id_family mother_id male family_size rank mother
1 2 12 0 2 1 .
2 2 13 1 2 2 .
3 3 15 1 3 1 .
4 3 17 0 3 2 1
5 3 4 0 3 3 .
However, in my real data set, I have to get the mother for families of size 4 and higher (up to 9), which makes this procedure very inefficient (in the sense that there are too many row elements to compare "manually").
How would you obtain this in a cleaner way? Would you make use of permutations to index the rows? Or would you use a for-loop?
Here's an approach using merge.
// create sample data
clear
input id id_family mother_id male
1 2 12 0
2 2 13 1
3 3 15 1
4 3 17 0
5 3 4 0
end
save families, replace
clear
// do the job
use families
drop id male
rename mother_id id
sort id_family id
duplicates drop
list, clean abbreviate(10)
save mothers, replace
use families, clear
merge 1:1 id_family id using mothers, keep(master match)
generate byte is_mother = _merge==3
list, clean abbreviate(10)
The second list yields
id id_family mother_id male _merge is_mother
1. 1 2 12 0 master only (1) 0
2. 2 2 13 1 master only (1) 0
3. 3 3 15 1 master only (1) 0
4. 4 3 17 0 matched (3) 1
5. 5 3 4 0 master only (1) 0
where I retained _merge only for expositional purposes.

In C, how does the length in an array definition map to addressing?

In C, when I define an array like int someArray[10], does that mean that the accessible range of that array is someArray[0] to someArray[9]?
Yes, indexing in c is zero-based, so for an array of n elements, valid indices are 0 through n-1.
Yes, because C's memory addressing is easily computed by an offset
myArray[5] = 3
roughly translates to
store in the address myArray + 5 * sizeof(myArray's base type)
the number 3.
Which means that if we permitted
myArray[1]
to be the first element, we would have to compute
store in the address myArray + (5 - 1) * sizeof(myArray's base type)
the number 3
which would require an extra computation to subtract the 1 from the 5 and would slow the program down a little bit (as this would require an extra trip through the ALU.
Modern CPUs could be architected around such issues, and modern compilers could compile these differences out; however, when C was crafted they didn't consider it a must-have nicety.
Think of an array like this:
* 0 1 2 3 4 5 6 7 8 9
+---+---+---+---+---+---+---+---+---+----+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
+---+---+---+---+---+---+---+---+---+----+
DATA
* = array indices
So the the range of access would be [0,9] (inclusive)

inet_pton() counterpart for link layer address

I have two problems related to my implementation -
I need a function which can convert a given link-layer address from text to standard format like we have a similar function at n/w layer for IP addresses inet_pton() which converts a given IP address from text to standard IPv4/IPv6 format.
Is there any difference b/w Link-layer address and 48-bit mac address
(in case of IPv6 specifically)?
If no, then link layer address should always also be of 48 bit in length, if I am not wrong.
Thanks in advance. Please excuse if I am missing something trivial.
EDIT :
Ok.. I am clear on the difference b/w link layer address and ethernet mac address. There are several types of data link layer addresses, ethernet mac address is just one.
Now, this arises one more problem ... As i said in my first question I need to convert a link layer address given from command line to its standard form. The solution provided here will just work for ethernet mac address only.
Isn't there any standard function for the purpose ? What I would like to do is to create an application where user will enter values for different options present in ICMP router advertisement message as stated in RFC 4861.
Option Formats
Neighbor Discovery messages include zero or more options, some of
which may appear multiple times in the same message. Options should
be padded when necessary to ensure that they end on their natural
64-bit boundaries. All options are of the form:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length | ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
~ ... ~
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Fields:
Type 8-bit identifier of the type of option. The
options defined in this document are:
Option Name Type
Source Link-Layer Address 1
Target Link-Layer Address 2
Prefix Information 3
Redirected Header 4
MTU 5
Length 8-bit unsigned integer. The length of the option
(including the type and length fields) in units of
8 octets. The value 0 is invalid. Nodes MUST
silently discard an ND packet that contains an
option with length zero.
4.6.1. Source/Target Link-layer Address
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length | Link-Layer Address ...
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Fields:
Type
1 for Source Link-layer Address
2 for Target Link-layer Address
Length The length of the option (including the type and
length fields) in units of 8 octets. For example,
the length for IEEE 802 addresses is 1
[IPv6-ETHER].
Link-Layer Address
The variable length link-layer address.
The content and format of this field (including
byte and bit ordering) is expected to be specified
in specific documents that describe how IPv6
operates over different link layers. For instance,
[IPv6-ETHER].
One more thing I am not quite handy with C++, can you please provide a C alternative ?
Thanks.
Your first question, it's not that hard to write, and since MAC addresses are represented by a 6 byte array you don't need to take machine-dependency into account ( like endian-ness and stuff )
void str2MAC(string str,char* mac) {
for(int i=0;i<5;i++) {
string b = str.substr(0,str.find(':'));
str = str.substr(str.find(':')+1);
mac[i] = 0;
for(int j=0;j<b.size();b++) {
mac[i] *= 0x10;
mac[i] += (b[j]>'9'?b[j]-'a'+10:b[j]-'0');
}
}
mac[5] = 0;
for(int i=0;i<str.size();i++) {
mac[5] *= 0x10;
mac[5] += (str[i]>'9'?str[i]-'a'+10:str[i]-'0');
}
}
About your second question, IP ( and IPv6 specifically) is a Network Layer protocol and is above the Link Layer, thus doesn't have to do anything with the Link Layer.
If by Link Layer you mean Ethernet, yes Ethernet Address is always 48bits, but there are other Link Layer protocols presents which may use other formats.

Resources