How do I create a string of array combinations given a list of "source code" strings? - arrays

Basically, I’m given a list of strings such as:
["structA.structB.myArr[6].myVar",
"structB.myArr1[4].myArr2[2].myVar",
"structC.myArr1[3][4].myVar",
"structA.myArr1[4]",
"structA.myVar"]
These strings are describing variables/arrays from multiple structs. The integers in the arrays describe the size each array. Given a string has a/multiple arrays (1d or 2d), I want to generate a list of strings which go through each index combination in the array for that string. I thought of using for loops but issue is I don’t know how many arrays are in a given string before running the script. So I couldn’t do something like
for i in range (0, idx1):
for j in range (0, idx2):
for k in range (0, idx3):
arr.append(“structA.myArr1[%i][%i].myArr[%i]” %(idx1,idx2,idx3))
but the issue is that I don’t know how I can create multiple/dynamic for loops based on how many indexes and how I could create a dynamic append statement that changes per each string from the original list since each string will have a different number of indexes and the arrays will be in different locations of the string.
I was able to write a regex to find all the index for each string in my list of strings:
indexArr = re.findall('\[(.*?)\]', myString)
//after looping, indexArr = [['6'],['4','2'],['3','4'],['4']]
however I'm really stuck on how to achieve the "dynamic for loops" or use recursion for this. I want to get my ending list of strings to look like:
[
["structA.structB.myArr[0].myVar",
"structA.structB.myArr[1].myVar",
...
"structA.structB.myArr[5].myVar”],
[“structB.myArr1[0].myArr2[0].myVar",
"structB.myArr1[0].myArr2[1].myVar",
"structB.myArr1[1].myArr2[0].myVar",
…
"structB.myArr1[3].myArr2[1].myVar”],
[“structC.myArr1[0][0].myVar",
"structC.myArr1[0][1].myVar",
…
"structC.myArr1[2][3].myVar”],
[“structA.myArr1[0]”,
…
"structA.myArr1[3]”],
[“structA.myVar”] //this will only contain 1 string since there were no arrays
]
I am really stuck on this, any help is appreciated. Thank you so much.

The key is to use itertools.product to generate all possible combinations of a set of ranges and substitute them as array indices of an appropriately constructed string template.
import itertools
import re
def expand(code):
p = re.compile('\[(.*?)\]')
ranges = [range(int(s)) for s in p.findall(code)]
template = p.sub("[{}]", code)
result = [template.format(*s) for s in itertools.product(*ranges)]
return result
The result of expand("structA.structB.myArr[6].myVar") is
['structA.structB.myArr[0].myVar',
'structA.structB.myArr[1].myVar',
'structA.structB.myArr[2].myVar',
'structA.structB.myArr[3].myVar',
'structA.structB.myArr[4].myVar',
'structA.structB.myArr[5].myVar']
and expand("structB.myArr1[4].myArr2[2].myVar") is
['structB.myArr1[0].myArr2[0].myVar',
'structB.myArr1[0].myArr2[1].myVar',
'structB.myArr1[1].myArr2[0].myVar',
'structB.myArr1[1].myArr2[1].myVar',
'structB.myArr1[2].myArr2[0].myVar',
'structB.myArr1[2].myArr2[1].myVar',
'structB.myArr1[3].myArr2[0].myVar',
'structB.myArr1[3].myArr2[1].myVar']
and the corner case expand("structA.myVar") naturally works to produce
['structA.myVar']

Related

How to structure multiple python arrays for sorting

A fourier analysis I'm doing outputs 5 data fields, each of which I've collected into 1-d numpy arrays: freq bin #, amplitude, wavelength, normalized amplitude, %power.
How best to structure the data so I can sort by descending amplitude?
When testing with just one data field, I was able to use a dict as follows:
fourier_tuples = zip(range(len(fourier)), fourier)
fourier_map = dict(fourier_tuples)
import operator
fourier_sorted = sorted(fourier_map.items(), key=operator.itemgetter(1))
fourier_sorted = np.argsort(-fourier)[:3]
My intent was to add the other arrays to line 1, but this doesn't work since dicts only accept 2 terms. (That's why this post doesn't solve my issue.)
Stepping back, is this a reasonable approach, or are there better ways to combine & sort separate arrays? Ultimately, I want to take the data values from the top 3 freqs and associated other data, and write them to an output data file.
Here's a snippet of my data:
fourier = np.array([1.77635684e-14, 4.49872050e+01, 1.05094837e+01, 8.24322470e+00, 2.36715913e+01])
freqs = np.array([0. , 0.00246951, 0.00493902, 0.00740854, 0.00987805])
wavelengths = np.array([inf, 404.93827165, 202.46913583, 134.97942388, 101.23456791])
amps = np.array([4.33257766e-16, 1.09724890e+00, 2.56328871e-01, 2.01054261e-01, 5.77355886e-01])
powers% = np.array([4.8508237956526163e-32, 0.31112370227749603, 0.016979224022185751, 0.010445983875848858, 0.086141014686372669])
The last 4 arrays are other fields corresponding to 'fourier'. (Actual array lengths are 42, but pared down to 5 for simplicity.)
You appear to be using numpy, so here is the numpy way of doing this. You have the right function np.argsort in your post, but you don't seem to use it correctly:
order = np.argsort(amplitudes)
This is similar to your dictionary trick only it computes the inverse shuffling compared to your procedure. Btw. why go through a dictionary and not simply a list of tuples?
The contents of order are now indices into amplitudes the first cell of order contains the position of the smallest element of amplitudes, the second cell contains the position of the next etc. Therefore
top5 = order[:-6:-1]
Provided your data are 1d numpy arrays you can use top5 to extract the elements corresponding to the top 5 ampltiudes by using advanced indexing
freq_bin[top5]
amplitudes[top5]
wavelength[top5]
If you want you can group them together in columns and apply top5 to the resulting n-by-5 array:
np.c_[freq_bin, amplitudes, wavelength, ...][top5, :]
If I understand correctly you have 5 separate lists of the same length and you are trying to sort all of them based on one of them. To do that you can either use numpy or do it with vanilla python. Here are two examples from top of my head (sorting is based on the 2nd list).
a = [11,13,10,14,15]
b = [2,4,1,0,3]
c = [22,20,23,25,24]
#numpy solution
import numpy as np
my_array = np.array([a,b,c])
my_sorted_array = my_array[:,my_array[1,:].argsort()]
#vanilla python solution
from operator import itemgetter
my_list = zip(a,b,c)
my_sorted_list = sorted(my_list,key=itemgetter(1))
You can then flip the array with my_sorted_array = np.fliplr(my_sorted_array) if you wish or if you are working with lists you can reverse it in place with my_sorted_list.reverse()
EDIT:
To get first n values only, you have to simply slice the array similarly to what #Paul is suggesting. Slice is done in a similar manner to classic list slicing by specifying start:stop:step (you can omit the step) arguments. In your case for 5 top columns it would be [:,-5:]. So in the example above you can take top 2 columns from each row like this:
my_sliced_sorted_array = my_sorted_array[:,-2:]
result will be:
array([[15, 13],
[ 3, 4],
[24, 20]])
Hope it helps.

Apply a string value to several positions of a cell array

I am trying to create a string array which will be fed with string values read from a text file this way:
labels = textread(file_name, '%s');
Basically, for each string in each line of the text file file_name I want to put this string in 10 positions of a final string array, which will be later saved in another text file.
What I do in my code is, for each string in file_name I put this string in 10 positions of a temporary cell array and then concatenate this array with a final array this way:
final_vector='';
for i=1:size(labels)
temp_vector=cell(1,10);
temp_vector{1:10}=labels{i};
final_vector=horzcat(final_vector,temp_vector);
end
But when I run the code the following error appears:
The right hand side of this assignment has too few values to satisfy the left hand side.
Error in my_example_code (line 16)
temp_vector{1:10}=labels{i};
I am too rookie in cell strings in matlab and I don't really know what is happening. Do you know what is happening or even have a better solution to my problem?
Use deal and put the left hand side in square brackets:
labels{1} = 'Hello World!'
temp_vector = cell(10,1)
[temp_vector{1:10}] = deal(labels{1});
This works because deal can distribute one value to multiple outputs [a,b,c,...]. temp_vector{1:10} alone creates a comma-separated list and putting them into [] creates the output array [temp_vector{1}, temp_vector{2}, ...] which can then be populated by deal.
It is happening because you want to distribute one value to 10 cells - but Matlab is expecting that you like to assign 10 values to 10 cells. So an alternative approach, maybe more logic, but slower, would be:
n = 10;
temp_vector(1:n) = repmat(labels(1),n,1);
I also found another solution
final_vector='';
for i=1:size(labels)
temp_vector=cell(1,10);
temp_vector(:,1:10)=cellstr(labels{i});
final_vector=horzcat(final_vector,temp_vector);
end

Concise way to create an array filled within a range in Matlab

I need to create an array filled within a range in Matlab
e.g.
from=2
to=6
increment=1
result
[2,3,4,5,6]
e.g.
from=15
to=25
increment=2
result
[15,17,19,21,23,25]
Obviously I can create a loop to perform this action from scratch but I wondering if there is a coincise and efficent way to do this with built-in matlab commands since seems a very common operation
EDIT
If I use linspace the operation is weird since the spacing between the points is (x2-x1)/(n-1).
This can be handled simply by the : operator in the following notation
array = from:increment:to
Note that the increment defaults to 1 if written with only one colon seperator
array = from:to
Example
array1 = 2:6 %Produces [2,3,4,5,6]
array2 = 15:2:25 %Produces [15,17,19,21,23,25]

Efficient allocation of cell array in matlab

I have some which converts a cell array of strings into a cell array of characters.
Note. For a number of reasons, both the input (C) and the output (C_itemised) must be cell arrays.
The cell array of strings (C) is as follows:
>> C(1:10)
ans =
't1416933446'
''
't1416933446'
''
't1416933446'
''
't1416933446'
''
't1416933446'
''
I have only shown a portion of the array here. In reality it is ~28,000 rows in length.
I have some code which does this, although it is very inefficient. The cellstr function takes up 72% of the code's time, as it is currently called thousands of times. The code is as follows:
C_itemised=cell(length(C),500);
for i=3:length(C)
temp=char(C{i});
for j=1:length(temp)
C(i-2,j)=cellstr(temp(j));
end
end
I have a feeling that some minor modifications could take out the inner loop, thus cutting down the overall running time substantially. I have tried a number of ways to do this, but I think I keep getting confused about whether to use {} or (), and haven't been able to find anything online that can help me. Can anyone see a way to make the code more efficient?
Please also note that this function is used in conjunction with other functions, and does work, although it is running slower than would be ideal. Therefore, I do not wish to change the format of C_itemised.
EDIT:
(A sample of) the output of my current function is:
C_itemised(1,1:12)
ans =
Columns 1 through 12
't' '1' '4' '1' '6' '9' '3' '3' '4' '4' '6' []
One thing I can suggest is to use the undocumented function sprintfc. This function is hidden from normal use in MATLAB, but it is used internally with a variety of other functions. Mainly, if you tried doing help sprintfc, it'll say that there's no function found! It's cool to sniff around the source sometimes!
How sprintfc works is that you provide it a formatting string, much like printf, and the data you want printed. It will take each individual element in the data and place them into individual cell arrays. As an example, supposing I had a string D = 'abcdefg';, if we did:
out = sprintfc('%c', D);
We get:
>> celldisp(out)
out{1} =
a
out{2} =
b
out{3} =
c
out{4} =
d
out{5} =
e
out{6} =
f
out{7} =
g
As such, it takes each element in your string and places them as individual characters serving as individual elements in a new cell array. The %c formatting string means that we want to print a single character per element. Check out the link to Undocumented MATLAB that I posted above if you want to learn more!
Therefore, try simplifying your loop to this:
C_itemised=cell(length(C));
for i=1:length(C)
C_itemised{i} = sprintfc('%c', C{i});
end
C_itemised will be a cell array, where each element C_itemised{i} is another cell array, with each element in this cell array being a single character that is composed of the string C{i}.
Minor Note
You said you were confused about {} and () in MATLAB for cells. {} is used to access individual elements inside the cell. So doing C{1} for example will grab whatever is stored in the first element of the cell array. () is used to slice and index into the cells. For example, if you wanted to make another cell array that is a subset of the current one, you would do something like C(1:3). This will create a three element cell array which is composed of the first three cells in C.

Algorithm - check if any string in an array of strings is a prefix of any other string in the same array

I want to check if any string in an array of strings is a prefix of any other string in the same array. I'm thinking radix sort, then single pass through the array.
Anyone have a better idea?
I think, radix sort can be modified to retrieve prefices on the fly. All we have to do is to sort lines by their first letter, storing their copies with no first letter in each cell. Then if the cell contains empty line, this line corresponds to a prefix. And if the cell contains only one entry, then of course there are no possible lines-prefices in it.
Here, this might be cleaner, than my english:
lines = [
"qwerty",
"qwe",
"asddsa",
"zxcvb",
"zxcvbn",
"zxcvbnm"
]
line_lines = [(line, line) for line in lines]
def find_sub(line_lines):
cells = [ [] for i in range(26)]
for (ine, line) in line_lines:
if ine == "":
print line
else:
index = ord(ine[0]) - ord('a')
cells[index] += [( ine[1:], line )]
for cell in cells:
if len(cell) > 1:
find_sub( cell )
find_sub(line_lines)
If you sort them, you only need to check each string if it is a prefix of the next.
To achieve a time complexity close to O(N2): compute hash values for each string.
Come up with a good hash function that looks something like:
A mapping from [a-z]->[1,26]
A modulo operation(use a large prime) to prevent overflow of integer
So something like "ab" gets computed as "12"=1*27+ 2=29
A point to note:
Be careful what base you compute the hash value on.For example if you take a base less than 27 you can have two strings giving the same hash value, and we don't want that.
Steps:
Compute hash value for each string
Compare hash values of current string with other strings:I'll let you figure out how you would do that comparison.Once two strings match, you are still not sure if it is really a prefix(due to the modulo operation that we did) so do a extra check to see if they are prefixes.
Report answer

Resources