I've got a string with eight characters in it, e.g. abcdefgh. I need to generate all possible 10-character combinations of this string.
For example, all 2-character combinations of this string would be ab bc cd ef gh ac ad ae af ah, etc.
I thought of doing something like this but I couldn't figure out how to get it working.
What should I do? Is there a simple algorithm I'm missing?
You can use 2 pointers, one on the letter t the start of your string, who is incremented a each time you are on '/0' and the second who is simply incremented on each turn of your loop with a condition fo you don't rewrite an older combination.
aa ab ac ... bb bc ...
Edit :
No need condition, only the reset of your second pointer have to be 1 on the first pointer
Related
I have been trying out some regular expressions lately. Now, I have 3 symbols a, b and c.
I first looked at a case where I don't want 2 consecutive a's. The regex would be something like:
((b|c + a(b|c))*(a + epsilon)
Now I'm wondering if there's a way to generalize this problem to say something like:
A regular expression with no two consecutive a's and no two consecutive b's. I tried stuff like:
(a(b|c) + b(a|c) + c)* (a + b + epsilon)
But this accepts inputs such as"abba" or "baab" which will have 2 consecutive a's (or b's) which is not what I want. Can anyone suggest me a way out?
If you can't do a negative match then perhaps you can use negative lookahead to exclude strings matching aa and bb? Something like the following (see Regex 101 for more information):
(?!.*(aa|bb).*)^.*$
I (think I) solved this by hand-drawing a finite state machine, then, generating a regex using FSM2Regex. The state machine is written below (with the syntax from the site):
#states
s0
s1
s2
s3
#initial
s0
#accepting
s1
s2
s3
#alphabet
a
b
c
#transitions
s0:a>s1
s0:b>s2
s0:c>s3
s1:b>s2
s1:c>s3
s2:a>s1
s2:c>s3
s3:c>s3
s3:a>s1
s3:b>s2
If you look at the transitions, you'll notice it's fairly straightforward- I have states that correspond to a "sink" for each letter of the alphabet, and I only allow transitions out of that state for other letters (not the "sink" letter). For example, s1 is the "sink" for a. From all other states, you can get to s1 with an a. Once you're in s1, though, you can only get out of it with a b or a c, which have their own "sinks" s2 and s3 respectively. Because we can repeat c, s3 has a transition to itself on the character c. Paste the block text into the site, and it'll draw all this out for you, and generate the regex.
The regex it generated for me is:
c+cc*(c+$+b+a)+(b+cc*b)(cc*b)*(c+cc*(c+$+b+a)+$+a)+(a+cc*a+(b+cc*b)(cc*b)*(a+cc*a))(cc*a+(b+cc*b)(cc*b)*(a+cc*a))*(c+cc*(c+$+b+a)+(b+cc*b)(cc*b)*(c+cc*(c+$+b+a)+$+a)+b+$)+b+a
Which, I'm pretty sure, is not optimal :)
EDIT: The generated regex uses + as the choice operator (usually known to us coders as |), which means it's probably not suitable to pasting into code. However, I'm too scared to change it and risk ruining my regex :)
You can use back references to match the prev char
string input = "acbbaacbba";
string pattern = #"([ab])\1";
var matchList = Regex.Matches(input, pattern);
This pattern will match: bb, aa and bb. If you don't have any match in your input pattern, it means that it does not contain a repeated a or b.
Explanation:
([ab]): define a group, you can extend your symbols here
\1: back referencing the group, so for example, when 'a' is matched, \1 would be 'a'
check this page: http://www.regular-expressions.info/backref.html
I have 26 variables and each of them contain numbers ranging from 1 to 61. I want for each case of 1, each case of 2 etc. the number 1 in a new variable. If there is no 1, the variable should contain 2.
So 26 variables with data like:
1 15 28 39 46 1 12 etc.
And I want 61 variables with:
1 2 1 2 2 1 etc.
I have been reading about creating vectors, loops, do if's etc but I can't find the right way to code it. What I have done is just creating 61 variables and writing
do if V1=1 or V2=1 or (etc until V26).
recode newV1=1.
end if.
exe.
**repeat this for all 61 variables.
recode newV1 to newV61(missing=2).
So this is a lot of code and quite a detour from what I imagine it could be.
Anyone who can help me out with this one? Your help is much appreciated!
noumenal is correct, you could do it with two loops. Another way though is to access the VECTOR using the original value though, writing that as 1, and setting all other values to zero.
To illustrate, first I make some fake data (with 4 original variables instead of 26) named X1 to X4.
*Fake Data.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 20.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
VECTOR X(4,F2.0).
LOOP #i = 1 TO 4.
COMPUTE X(#i) = TRUNC(RV.UNIFORM(1,62)).
END LOOP.
EXECUTE.
Now what this code does is create four vector sets to go along with each variable, then uses DO REPEAT to actually refer to the VECTOR stub. Then finishes up with RECODE - if it is missing it should be coded a 2.
VECTOR V1_ V2_ V3_ V4_ (61,F1.0).
DO REPEAT orig = X1 TO X4 /V = V1_ V2_ V3_ V4_.
COMPUTE V(orig) = 1.
END REPEAT.
RECODE V1_1 TO V4_61 (SYSMIS = 2).
It is a little painful, as for the original VECTOR command you need to write out all of the stubs, but then you can copy-paste that into the DO REPEAT subcommand (or make a macro to do it for you).
For a more simple illustration, if we have our original variable, say A, that can take on integer values from 1 to 61, and we want to expand to our 61 dummy variables, we would then make a vector and then access the location in that vector.
VECTOR DummyVec(61,F1.0).
COMPUTE DummyVec(A) = 1.
For a record if A = 10, then here DummyVec10 will equal 1, and all the others DummyVec variables will still by system missing by default. No need to use DO IF for 61 values.
The rest of the code is just extra to do it in one swoop for multiple original variables.
This should do it:
do repeat NewV=NewV1 to NewV61/vl=1 to 61.
compute NewV=any(vl,v1 to v26).
end repeat.
EXPLANATION:
This syntax will go through values 1 to 61, for each one checking whether any of the variables v1 to v26 has that value. If any of them do, the right NewV will receive the value of 1. If none of them do, the right NewV will receive the value of 0.
Just make sure v1 to v26 are consecutively ordered in the file. if not, then change to:
compute NewV=any(vl,v1, v2, v3, v4 ..... v26).
You need a nested loop: two loops - one outer and one inner.
In Stata, I am trying to use a foreach loop where I am looping over numbers from, say, 05-11. The problem is that I wish to keep the 0 as part of the value. I need to do this because the 0 appears in variable names. For example, I may have variables named Y2005, Y2006, Var05, Var06, etc. Here is an example of the code that I tried:
foreach year of numlist 05/09 {
...do stuff with Y20`year` or with Var`year`
}
This gives me an error that e.g. Y205 is not found. (I think that what is happening is that it is treating 05 as 5.)
Also note that I can't add a 0 in at the end of e.g. Y20 to get Y200 because of the 10 and 11 values.
Is there a work-around or an obvious thing I am not doing?
Another work-around is
forval y = 5/11 {
local Y : di %02.0f `y'
<code using local Y, which must be treated as a string>
}
The middle line could be based on
`: di %02.0f `y''
so that using another macro can be avoided, but at the cost of making the code more cryptic.
Here I've exploited the extra fact that foreach over such a simple numlist is replaceable with forvalues.
The main trick here is documented here. This trick avoids the very slight awkwardness of treating 5/9 differently from 10/11.
Note. To understand what is going on, it often helps to use display interactively on very simple examples. The detail here is that Stata is happily indifferent to leading zeros when presented with numbers. Usually this is immaterial to you, or indeed a feature as when you appreciate that Stata does not insist on a leading zero for numbers less than 1.
. di 05
5
. di 0.3
.3
. di .3
.3
Here we really need the leading zero, and the art is to see that the problem is one of string manipulation, the strings such as "08" just happening to contain numeric characters. Agreed that this is obvious only when understood.
There's probably a better solution but here's how this one goes:
clear
set more off
*----- example data -----
input ///
var2008 var2009 var2010 var2011 var2012
0 1 2 3 4
end
*----- what you want -----
numlist "10(1)12"
local nums 08 09 `r(numlist)'
foreach x of local nums {
display var20`x'
}
The 01...09 you can insert manually. The rest you build with numlist. Put all that in a local, and finally use it in the loop.
As you say, the problem with your code is that Stata will read 5 when given 05, if you've told it is a number (which you do using numlist in the loop).
Another solution would be to use an if command to count the number of characters in the looping value, and then if needed you can add a leading zero by reassigning the local.
clear
input var2008 var2009 var2010 var2011 var2012
0 1 2 3 4
end
foreach year of numlist 08/12{
if length("`year'") == 1 local year 0`year'
di var20`year'
}
I have an array of unsigned integers, each corresponding to a string with 12 characters, that can contain 4 different characters, namely 'A','B','C','D'. Thus the array will contain 4^12 = 16777216 elements. The ordering of the elements in the array is arbitrary; I can choose which one corresponds to each string. So far, I have implemented this as simply as that:
unsigned int my_array[16777216];
char my_string[12];
int index = string_to_index(my_string);
my_array[index] = ...;
string_to_index() simply assigns 2 bits per character like this:
A --> 00, B --> 01, C --> 10, D --> 11
For example, ABCDABCDABCD corresponds to the index (000110110001101100011011)2 = (1776411)10
However, I know for a fact that each string that is used to access the array is the previous string shifted once to the left with a new last character. For example after I access with ABCDABCDABCD, the next access will use BCDABCDABCDA, or BCDABCDABCDB, BCDABCDABCDC, BCDABCDABCDD.
So my question is:
Is there a better way to implement the string_to_index function to take under consideration this last fact, so that elements that are consecutively accessed are closer in the array? I am hoping to improve my caching performance by doing so.
edit: Maybe I was not very clear: I am looking for a completely different string to index correspondence scheme, so that the indexes of ABCDABCDABCD and BCDABCDABCDA are closer.
If the following assumptions are true for your problem then the solution you implemented is best one.
The right most char of next string is randomly selected with equal probability for each valid character
Start of the sequence is not same always (it is random).
Reason:
When I first read your question I came up with the following tree: (reduced your problem to string of length three characters and only 2 possible characters A and B for simplicity) Note that left most child of root node (AAA in this case) is always same as root node (AAA) hence I am not building that branch further.
AAA
/ \
AAB
/ \
ABA ABB
/ \ / \
BAA BAB BBA BBB
In this tree each node has its next possible sequence as child nodes. To improve on cache you need to traverse this tree using breadth-first traversal and store it in the array in the same order. For the above tree we get following string index combination.
AAA 0
AAB 1
ABA 2
ABB 3
BAA 4
BAB 5
BBA 6
BBB 7
Assuming value(A) = 0 and value(B) = 1, index can be calculated as
index = 2^0 * (value(string[2])) + 2^1 * (value(string[1])) + 2^2 * (value(string[0]))
This is same solution as you are using.
I have written a python script to check this for other combinations too (like string of length 4 characters with A B C as possible characters). Script link
So unless the 2 assumptions made at the beginning are false than your solution already takes care of cache optimisation.
I think we could define "closer" first.
For example, we could define a function F which takes a method of calculating the indices of strings. Then F will check every string's index and return a certain value based on the distance of neighbor strings' indices.
Then we can compare various ways of calculating the index and find a best one.
Of course we could examine shorter strings first.
There are several algorithms out there that print all combinations of a string, but I need one that prints them out in a specific order. Currently I am using a standard permutation algorithm similar to the one in the top answer (not the question itself) of this question: C++ recursive permutation algorithm for strings -> not skipping duplicates
For example, for the input "ABC", the output will be: ABC ACB BAC BCA CAB CBA
For the input "ACC", it will be: ACC CAC CCA
The outputs are all correct, however I need them in a different order. The input will only consist of the characters 'A' and 'C', and I am sorting the string alphabetically before inputting it to the recursive function for convenience so the input string will always have the same characters together (i.e. AACCC). As for the order, I want to treat the collection of 'C's as a single entity which I shift left for each set of permutations of the characters to the right of the first 'C' only. So for input "ACC", the first output is "ACC" which is OK, the next output should be "CCA" because I shifted all the 'C's one step to the left, then the permutations of "CCA" of all the characters to the right of the first 'C' is the final output which is just "ACA".
I need it to look like this for these inputs:
Input: ACC
Output: ACC CCA CAC
Input: AACC
Output:
AACC ACCA ACAC CCAA CACA CAAC
Any idea how I should modify my algorithm to produce the combinations in this order?
For a string with two distinct characters A and C, given n is the number of A's, it sounds like what you're looking for is a concatenation of these sequences: All permutations beginning with exactly n A's in reverse lexicographic order, all permutations beginning with exactly n-1 A's in reverse lexicographic order, etc. So, you could take your existing output which is in lexicographic order, and iterate over it in reverse order, selecting elements matching /^A{n}C/, /^A{n-1}C/ through /^A{0}C/ and adding them to a new collection.
You could generate this output directly by generating strings of A's of each length from n A's to zero and then for each one, append the permutations of the remaining characters in reverse lexicographic order.