Comparison of two array elements and calculation - arrays

I have an issue with a section of code I wish to write. My problem is based around two arrays and the elements they encompass.
I have two arrays filled with numbers (relating to positions in a string). I wish to select the substrings between the positions. The elements in the first array are the start of the substrings and the elements in the second array are the ends of the substrings.
The code I have supplied reads in the file and makes it a string:
>demo_data
theoemijono
milotedjonoted
dademimamted
String:
theoemijonomilotedjonoteddademimamted
so what I want to happen is to extract the substring
emijonomiloted
emimamted
The code I have written takes the the first element array and compares it with the second array corresponding element and then to ensure that there is no cross over and hence hold the substring to start with emi and end with tedas seen in the provided sequences
for($i=0; $i<=10; $i++)
{
if ($rs1_array[$i] < $rs2_array[$i] && $rs1_array[$i+1] > $rs2_array[$i])
{
my$size= $rs2_array[$i]-$rs1_array[$i]+ 3);
my$substr= substr($seq, $rs1_array[$i],$size);
print $substr."\n";
}
}
Using this code works for the first substring, but the second substring is ignored as the first array has fewer elements and hence the comparison cannot be completed.
UPDATE
Array structures:
#rs1_array = (4, 28);
#rs2_array = (15, 22, 34);
Hi borodin, You were absolutely correct.. I have edited the code now! Thank you for seeing that in relation to the length issue. The reason for the strange offset is that the value in #rs2_array is the start position and it does not take into consideration the remainder of the word "ted" in this case and I require this to complete the string.The Array is built correctly as for the elements in #rs1_array they represent the start position "emi" the #rs2_array elements also hold the start position for each "ted" so as there are 2 emi's and 3 ted's in the string this causes the unbalance.

my #starts = ( 4, 28 );
my #ends = map $_+3, ( 15, 22, 34 );
my $starts_idx = my $ends_idx = 0;
while ($starts_idx < #starts && $ends_idx < #ends) {
if ($starts[$start_idx] > $ends[$ends_idx]) {
++$start_idx;
next;
}
my $length = $ends[$ends_idx] - $starts[$start_idx];
say substr($seq, $starts[$start_idx], $length);
++$ends_idx;
++$start_idx;
}
Which, of course, gives the same output as:
say for $seq =~ /(emi(?:(?!emi|ted).)*ted)/sxg;

Related

How to replace a specific character in an array with two characters

So I just came back from a job interview and one of the questions I had to face with was :
"Given an array of characters and three characters for example :
Array : [a,b,c,z,s,w,y,z,o]
Char 1: 'z'
Char 2 : 'R'
Char 3 : 'R'
Your goal is to replace each 'z' in the array to become two R characters within O(N) time complexity.
so your input will be Array : [a,b,c,z,s,w,y,z,o]
and your output array will be : [a,b,c,R,R,s,w,y,R,R,o]
assume that there is no 'R' in the array before.
You are not allowed to use other arrays or other variables.
The algorithm should be in-line algorithm.
Your final array must be a characters array."
My solution was within O(N^2) time complexity but there is a solution within O(N) time complexity .
The interview is over but I am still thinking about this problem, Can anyone help me to solve this ?
First scan the input to count how many occurrences of char 1 exist. This has a linear time complexity.
From that you know that the length of the final array will be the input length + the number of occurrences.
Then extend the array to its new length, leaving the new slots empty (or whatever value). The exact nature of the operation depends on how the array data structure is implemented. This can surely be done with at worst a linear time complexity.
Use two indexes, i and j, where i references the last character of the input array and j references the very last index in the array (potentially to an empty slot).
Start copying from i to j each time decreasing the values of these indices with one. If you copy the matching letter, then duplicate the copied character to j again, and only reduce j. This has again a linear time complexity.
The algorithm will end with both i and j equal to -1.
Do two iterations.
First, count the number of char1s ('z' in your example).
Now you know how long your array should be at the end: array.size() + num_char1s
Then, go from last to first with input and output iterators. If the element is char1, insert to the end iterator the new chars, otherwise - just copy.
Pseudo code:
num_char1s = 0
for x in array:
if x == char1:
num_char1s++
// Assuming array has sufficient memory already allocated.
out_iterator = num_char1s + size - 1
in_iterator = size - 1
while (in_iterator >= 0):
if (array[in_iterator] == char1):
array[out_iterator--] = char3
array[out_iterator--] = char2
else:
array[out_iterator--] = array[in_iterator]
in_iterator--
In your question, two things are very important.
can't use new variable
can't use new array
So, we must need to use given array.
First we will increase our given array size double. why? Cause at most our new array size = given_array_size*2 (if all characters = char 1)
Now we will shift our given array n times right, where n= given_array_size.
Now we will iterate our array from the new shifted position = n. iterate i=n to 2*n-1
We will take j=0, which will write new array. if we found char 1, we will
make array[j++]=char 2 and array[j++]=char 3.
But if a character is not 'z', we simply don't do anything. array[j++]=array[i]
At last 0 to j-1 is the right answer.
Complexity: O(n)
No new variable and array needed

Picking random indexes into a sorted array

Let's say I have a sorted array of values:
int n=4; // always lower or equal than number of unique values in array
int i[256] = {};
int v = {1 1 2 4 5 5 5 5 5 7 7 9 9 11 11 13}
// EX 1 ^ ^ ^ ^
// EX 2 ^ ^ ^ ^
// EX 3 ^ ^ ^ ^
I would like to generate n random index values i[0] ... i[n-1], so that:
v[i[0]] ... v[i[n-1]] points to a unique number (ie. must not point to 5 twice)
Each number to must be the rightmost of its kind (ie. must point to the last 5)
An index to the final number (13 in this case) should always be included.
What I've tried so far:
Getting the indexes to the last of the unique values
Shuffling the indexes
Pick out the n first indexes
I'm implementing this in C, so the more standard C functions I can rely on and the shorter code, the better. (For example, shuffle is not a standard C function, but if I must, I must.)
Create an array of the last index values
int last[] = { 1, 2, 3, 8, 10, 12, 14 };
Fisher-Yates shuffle the array.
Take the first n-1 elements from the shuffled array.
Add the index to the final number.
Sort the resulting array, if desired.
This algorithm is called reservoir sampling, and can be used whenever you know how big a sample you need but not how many elements you're sampling from. (The name comes from the idea that you always maintain a reservoir of the correct number of samples. When a new value comes in, you mix it into the reservoir, remove a random element, and continue.)
Create the return value array sample of size n.
Start scanning the input array. Each time you find a new value, add its index to the end of sample, until you have n sampled elements.
Continue scanning the array, but now when you find a new value:
a. Choose a random number r in the range [0, i) where i is the number of unique values seen so far.
b. If r is less than n, overwrite element r with the new element.
When you get to the end, sort sample, assuming you need it to be sorted.
To make sure you always have the last element in the sample, run the above algorithm to select a sample of size n-1. Only consider a new element when you have found a bigger one.
The algorithm is linear in the size of v (plus an n log n term for the sort in the last step.) If you already have the list of last indices of each value, there are faster algorithms (but then you would know the size of the universe before you started sampling; reservoir sampling is primarily useful if you don't know that.)
In fact, it is not conceptually different from collecting all the indices and then finding the prefix of a Fisher-Yates shuffle. But it uses O(n) temporary memory instead of enough to store the entire index list, which may be considered a plus.
Here's an untested sample C implementation (which requires you to write the function randrange()):
/* Produces (in `out`) a uniformly distributed sample of maximum size
* `outlen` of the indices of the last occurrences of each unique
* element in `in` with the requirement that the last element must
* be in the sample.
* Requires: `in` must be sorted.
* Returns: the size of the generated sample, while will be `outlen`
* unless there were not enough unique elements.
* Note: `out` is not sorted, except that the last element in the
* generated sample is the last valid index in `in`
*/
size_t sample(int* in, size_t inlen, size_t* out, size_t outlen) {
size_t found = 0;
if (inlen && outlen) {
// The last output is fixed so we need outlen-1 random indices
--outlen;
int prev = in[0];
for (size_t curr = 1; curr < inlen; ++curr) {
if (in[curr] == prev) continue;
// Add curr - 1 to the output
size_t r = randrange(0, ++found);
if (r < outlen) out[r] = curr - 1;
prev = in[curr];
}
// Add the last index to the output
if (found > outlen) found = outlen;
out[found] = inlen - 1;
}
return found;
}

Selecting an element of an array without specifying an array in awk

I'd like to select a specific element of an array from a file with awk where the file is not setup specifying every entry as being part of an array. I plan on putting this in a for loop or assigning this as a variable to be used for arithmetic opterations. However, I am finding that I cannot use the way I'm selecting the element of the array when assigning it as a variable or using it in a for loop.
1 2 3 4
5 6 7 8
9 8 7 6
If these elements are not specified in awk as being part of an array, referencing them could be done with
FNR == 1 {print $3}
However, I cannot assign this as a variable to be used later, nor can I put this in a loop.
Is there another way to reference a single element of an array without having to restructure the input file?
You can read the file into an array, then access the array. When accessing the array, use split:
{ array[NR] = $0 }
After the input scanning is complete, array[42] gives you the contents of record #42, usually the 42nd line of the input. We can put in an END { ... } block where we process the array.
To get the third element of array[1], we can do this:
split(array[1], fields)
Now we have an array called fields. fields[3] holds the same datum as $3 held when the first record were being processed which we assigned to array[1].
In Awk we can also simulate two-dimensional arrays, by catenating multiple indices together with some unambiguous separator, like a space or dash.
{ for (i = 1; i <= NF; i++)
array[NR "-" i] = $i }
After this executes for every input record, we can access $3 from record 1 as array["1-3"]. The key 1-3 is a character string.
The expression NR "-" i in the loop body places several expressions next to each other with no operators in between. That denotes string catenation. When NR is 17 and i is 5, we get the string "17-5" and so on.
Since the number of fields per record is variable, we could have another array which gives the NF value for each element of array.
{ nf[i] = NF;
for (i = 1; i <= NF; i++)
array[NR "-" i] = $i }
Now we know that if nf[17] is 5, the fields array["17-1"] through array["17-5"] are valid.

Error when copying a word to array character by character

I'm trying to copy an unknown length of characters into an array, but I keep getting an error. I'm getting this from a website converted to text. Site is the position of the first character of the word (I want to copy 4 words), and result is the whole text file.
I keep getting this error:
Subscript indices must either be real positive integers or logicals.
for this line: webget = result(sites(i)+n);
for i = 0:3; %for finding first 4
webget = 'p'; %placeholder
website = []; %blank
while strcmp(webget,' ') == 0;
for n = 0:150; %letter by letter, arbitrary search length
webget = result(sites(i)+n);
website = strcat(website,webget);
end
end
website(i) = website;
end
Could anyone help?
Matlab arrays index starting from 1, not 0. On your first loop iteration, i=0, so your request for the 0th entry in the sites array is not valid.
Consider using i = 1:4.

Search Algorithm with Incomplete Input

I need an algorithm which will search an array for a string, but the string may not be exactly the same as one of the items in the array.
For example,
Array = {"Stack", "Over", "Flow", "Stake"}
input = "Sta"
It will need to recognize that Stack and Stake both match the parameters and then choose the one which is first in alphabetical order.
How can I do this?
I would use List, do binarySearch on that list.
List<String> arr = new ArrayList<>();
add elements, while adding elements you can do the following.
int x = Collections.binarySearch(arr, key);
if(x < 0)
arr.add(-x-1, key);
//for n element this takes n.log_n time.
you can do binary search in the list, if the result of binarySearch is > 0, then the key exists in your list, else (-x-1) is the location of the key when it is inserted. go tru each element who begins with input string.
For example, arr is your array and you are searching for input.
arr = {"Flow", "Over", "Stack", "Stake"}
input = "Sta";
int x = Collections.binarySearch(arr, input);
if(x < 0)
x = -x-1;
if(arr.get(x).subString(0,input.length()).equals(input));
System.out.println(arr.get(x))
else
System.out.println("there is no element starting with input string");
Time complexity is O(logn) where n is array's length.
Loop over the sorted array, compute the Levenshtein distance between each string and your target string, and if it is sufficiently small, return.
What constitutes "sufficiently small" is up to you. You'll probably have to do some testing.
Simply loop through each element in the array and compare it to the input, determining if the input is contained in the element. Remove any element that does not meet this prerequisite. Finally go through the remaining elements and pick the one that is first alphabetically.
Loop through all the index values of the array and find the substring match of the input. Find all the matches and print the one whose index value is the lowest.
For example you will find the substring match for Array[0] and Array[3]. Now you have two matches at 0 and 3. Find the next alphabet of the substirng match. At Arrary[0] the next alphabet to Sta is 'c' but at Array[3] the next alphabet is 'k', here a < k, so the output is Array[0]
You may find Trie data structure useful. It is very efficient to find all words you need.
But memory overhead can be significant if you have many words in the list.

Resources