There is this typical problem: given a list of values, check if they are present in an array.
In awk, the trick val in array does work pretty well. Hence, the typical idea is to store all the data in an array and then keep doing the check. For example, this will print all lines in which the first column value is present in the array:
awk 'BEGIN {<<initialize the array>>} $1 in array_var' file
However, it is initializing the array takes some time because val in array checks if the index val is in array, and what we normally have stored in array is a set of values.
This becomes more relevant when providing values from command line, where those are the elements that we want to include as indexes of an array. For example, in this basic example (based on a recent answer of mine, which triggered my curiosity):
$ cat file
hello 23
bye 45
adieu 99
$ awk -v values="hello adieu" 'BEGIN {split(values,v); for (i in v) names[v[i]]} $1 in names' file
hello 23
adieu 99
split(values,v) slices the variable values into an array v[1]="hello"; v[2]="adieu"
for (i in v) names[v[i]] initializes another array names[] with names["hello"] and names["adieu"] with empty value. This way, we are ready for
$1 in names that checks if the first column is any of the indexes in names[].
As you see, we slice into a temp variable v to later on initialize the final and useful variable names[].
Is there any faster way to initialize the indexes of an array instead of setting one up and then using its values as indexes of the definitive?
No, that is the fastest (due to hash lookup) and most robust (due to string comparison) way to do what you want.
This:
BEGIN{split(values,v); for (i in v) names[v[i]]}
happens once on startup and will take close to no time while this:
$1 in array_var
which happens once for every line of input (and so is the place that needs to have optimal performance) is a hash lookup and so the fastest way to compare a string value to a set of strings.
not an array solution but one trick is to use pattern matching. To eliminate partial matches wrap the search and array values with the delimiter. For your example,
$ awk -v values="hello adieu" 'FS values FS ~ FS $1 FS' file
hello 23
adieu 99
Related
I've got a character variable which holds a delimited list of strings, like so:
data lists;
format list_val $75.;
list_val = "PDC; QRS; OLN; ABC";
run;
I need to alphabetize the elements of each list (so the desired result when applied to the above string is "ABC; OLN; PDC; QRS;").
I adapted the solution here for my purposes as follows:
data lists_sorted;
set lists;
array_size = count(list_val,";") + 1; /* Cannot be used as array length must be specified at creation */
array t(50) $ 8 _TEMPORARY_;
call missing(of t(*));
do _n_=1 to array_size;
t(_n_)=scan(list_val,_n_,";");
end;
call sortc(of t(*));
new_list_val =catx("; ", of t(*));
put "original: " list_val " new: " new_list_val;
run;
When I run this code I get the following output:
original: PDC; QRS; OLN; ABC new: ABC; OLN; QRS; PDC
Which was not expected or desired. In general, the result of the above code applied to any list is a new list which is sorted alphabetically, except that the first element of the original list becomes the last element of the new list, regardless of its alphabetical ordering.
I can't find anything in the documentation of sortc which would explain this behavior, so I'm wondering if the issue is somehow the way I've set up the temporary array (I don't have much experience with these).
Does anyone know why sortc behaves this way? Side question: is there anyway I can dynamically determine the size of the array, rather than hard-coding a value such as 50?
It is because you included the leading spaces when assigning the values to the array elements. Remove those.
t[_n_]=left(scan(list_val,_n_,";"));
If you want to know what the minimum size array you could use for your data step you would need to process the dataset twice.
proc sql ;
select max(count(list_val,";") + 1) into :max_size trimmed from have;
quit;
....
array t[&max_size] $ 8 _temporary_;
But there is probably not much harm in just using some large constant value.
Let's say I have an array of n elements. Each element is a string of comma-separated x,y coordinate pairs, e.g. "581,284". There is no set character length to these x,y values.
Say I wanted to subtract 8 from each x value, and 5 from each y value.
What would be the simplest way to modify x and y, independently of each other, without permanently splitting the x and y values apart?
e.g the first array element "581,284" becomes "573,279", the second array element "1013,562" becomes "1005,274", and so forth.
I worked on this problem for a couple of hours (I'm an amateur at bash), and it seemed as if my approach was awfully convoluted.
Please note that the apostrophes above are only added for emphasis, and are not a part of the problem.
Thank you in advance, I've been racking my head over this for a while now!
Edit: The following excerpt is the approach I was taking. I don't know much about bash, as you can tell.
while read value
do
if [[ -z $offset_list ]]
then
offset_list="$value"
else
offset_list="$offset_list,$value"
fi
done < text.txt
new_offset=${offset_list//,/ }
read -a new_array <<< $new_offset
for value in "${new_array[#]}"
do
if [[ $((value%2)) -eq 1 ]]
then
value=$((value-8));
new_array[$counter]=$value
counter=$((counter+1));
elif [[ $((value%2)) -eq 0 ]]
then
value=$((value-5));
new_array[$counter]=$value
counter=$((counter+1));
fi
done
Essentially I had originally read the coordinate pairs, and stripped the commas from them, and then planned on modifying odd/even values which were populated into the new array. At this point I realized that there had to be a more efficient way.
I believe the following should achieve what you are looking for:
#!/bin/bash
input=("581,284" "1013,562")
echo "Initial array ${input[#]}"
for index in ${!input[#]}; do
value=${input[$index]}
x=${value%%,*}
y=${value##*,}
input[$index]="$((x-8)),$((y+5))"
done
echo "Modified array ${input[#]}"
${!input[#]} allows us to loop over the indexes of the bash array.
${value%%,*} and ${value##*,} relies on bash parameter substitution to remove the everything after or before the comma (respectively). This effectively splits your string into two variables.
From there, it's your required math and variable reassignment to mutate the array.
my-script.awk
#!/env/bin awk
BEGIN {
toggleValues="U+4E00,U+9FFF,U+3400,U+4DBF,U+20000,U+2A6DF,U+2A700,U+2B73F,U+2B740,U+2B81F,U+2B820,U+2CEAF,U+F900,U+FAFF"
split(toggleValues, boundaries, ",")
if ("U+4E00" in boundaries) {print "inside"}
}
Run
echo ''| awk -f my-script.awk
Question
Why I don't see inside printed?
awk stores arrays differently then what you expect. It's a key/value pair with the key (from split() is the integer index starting at 0 and the value is the string that was split() it into that element.
The awk in condition tests keys, not values. So your "U+4E00" in boundaries condition isn't going to pass. Instead you'll need to iterate your array and look for the value.
for (boundary in boundaries) { if(boundaries[boundary] == "U+4E00") { print "inside" }
Either that or you can create a new array based on the existing one, but with the values stored as the key so the in operator will work as is.
for (i in boundaries) {boundaries2[boundaries[i]] = ""}
if ("U+4E00" in boundaries2){print "inside"}
This second method is a little hackey since all your element values are set to "", but it's useful if you are going to iterate through large file and just want to use the in operator to test that a field is in your array (as opposed to iterating the array on each record, which might be more expensive).
I am trying to access some numeric values that a regular 'cat' outputs in an array.
If I do: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
I get: 51000 102000 204000 312000 ...
So I scripted as below to get all elements in an array and I tried to get the number of elements.
vAvailableFrequencies=$(sudo cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies)
nAvailableFrequencies=${#vAvailableFrequencies[#]}
The problem is that nAvailableFrequencies is equal to the number of characters in the array, not the number of elements.
The idea is to be able to access each element as:
for (( i=0;i<$nAvailableFrequencies;i++)); do
element=${vAvailableFrequencies[$i]
done
Is this possible in bash without doing something like a sequential read and inserting elements in the array one by one?
You can use array like this:
arr=($(</sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies))
nAvailableFrequencies=${#arrr}
$(<file) reads and outputs a file content while (...) creates an array.
You just need another set of brackets around the vAvailableFrequencies assignment:
vAvailableFrequencies=($(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies))
nAvailableFrequencies=${#vAvailableFrequencies[#]}
Now you can access within your for loop, or individually with ${vAvailableFrequencies[i]} where i is the number of an element
If you are using bash 4 or later, use
readfile -t vAvailableFrequencies < /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
(or, if you need to use sudo,
readfile -t vAvailableFrequencies < <(sudo cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies)
)
I want to check if any string in an array of strings is a prefix of any other string in the same array. I'm thinking radix sort, then single pass through the array.
Anyone have a better idea?
I think, radix sort can be modified to retrieve prefices on the fly. All we have to do is to sort lines by their first letter, storing their copies with no first letter in each cell. Then if the cell contains empty line, this line corresponds to a prefix. And if the cell contains only one entry, then of course there are no possible lines-prefices in it.
Here, this might be cleaner, than my english:
lines = [
"qwerty",
"qwe",
"asddsa",
"zxcvb",
"zxcvbn",
"zxcvbnm"
]
line_lines = [(line, line) for line in lines]
def find_sub(line_lines):
cells = [ [] for i in range(26)]
for (ine, line) in line_lines:
if ine == "":
print line
else:
index = ord(ine[0]) - ord('a')
cells[index] += [( ine[1:], line )]
for cell in cells:
if len(cell) > 1:
find_sub( cell )
find_sub(line_lines)
If you sort them, you only need to check each string if it is a prefix of the next.
To achieve a time complexity close to O(N2): compute hash values for each string.
Come up with a good hash function that looks something like:
A mapping from [a-z]->[1,26]
A modulo operation(use a large prime) to prevent overflow of integer
So something like "ab" gets computed as "12"=1*27+ 2=29
A point to note:
Be careful what base you compute the hash value on.For example if you take a base less than 27 you can have two strings giving the same hash value, and we don't want that.
Steps:
Compute hash value for each string
Compare hash values of current string with other strings:I'll let you figure out how you would do that comparison.Once two strings match, you are still not sure if it is really a prefix(due to the modulo operation that we did) so do a extra check to see if they are prefixes.
Report answer