SORTC always moves first element of array to the end - arrays

I've got a character variable which holds a delimited list of strings, like so:
data lists;
format list_val $75.;
list_val = "PDC; QRS; OLN; ABC";
run;
I need to alphabetize the elements of each list (so the desired result when applied to the above string is "ABC; OLN; PDC; QRS;").
I adapted the solution here for my purposes as follows:
data lists_sorted;
set lists;
array_size = count(list_val,";") + 1; /* Cannot be used as array length must be specified at creation */
array t(50) $ 8 _TEMPORARY_;
call missing(of t(*));
do _n_=1 to array_size;
t(_n_)=scan(list_val,_n_,";");
end;
call sortc(of t(*));
new_list_val =catx("; ", of t(*));
put "original: " list_val " new: " new_list_val;
run;
When I run this code I get the following output:
original: PDC; QRS; OLN; ABC new: ABC; OLN; QRS; PDC
Which was not expected or desired. In general, the result of the above code applied to any list is a new list which is sorted alphabetically, except that the first element of the original list becomes the last element of the new list, regardless of its alphabetical ordering.
I can't find anything in the documentation of sortc which would explain this behavior, so I'm wondering if the issue is somehow the way I've set up the temporary array (I don't have much experience with these).
Does anyone know why sortc behaves this way? Side question: is there anyway I can dynamically determine the size of the array, rather than hard-coding a value such as 50?

It is because you included the leading spaces when assigning the values to the array elements. Remove those.
t[_n_]=left(scan(list_val,_n_,";"));
If you want to know what the minimum size array you could use for your data step you would need to process the dataset twice.
proc sql ;
select max(count(list_val,";") + 1) into :max_size trimmed from have;
quit;
....
array t[&max_size] $ 8 _temporary_;
But there is probably not much harm in just using some large constant value.

Related

Grab string from the last cell that isn't missing in SAS

HAVE is a dataset of authors on pubs:
^note that auth50 is the last variable in the dataset.
WANT is a dataset where the first and last authors for each observation ofHAVE are their own variables:
I do the following in a data step to create lastauth...
array auths {*} auth1:auth50;
do i=1 to dim(auths);
if auths{1} NE " " then firstauth = auths{1};
if auths{i} = " " then lastauth = auths{i-1};
else lastauth = auths{i};
end;
...which I thought would iteratively write over lastauth until encountering the last authX variable, but the code does not retrieve the last non-missing value. Any thoughts on what I'm misunderstanding? Thanks!
Check the logic carefully. You want to set the value of FIRST when FIRST is missing and set the value of LAST when the current is NOT missing.
array auths auth1-auth50;
firstauth=auths(1);
do i=1 to dim(auths);
if missing(firstauth) then firstauth = auths{i};
if not missing(auths{i}) then lastauth = auths{i};
end;
Note: The extra assignment before the DO loop is to force SAS to define the new variable since otherwise the first use will be in the IF condition. If you already have a LENGTH or other statement that defines FIRSTAUTH then the extra assignment statement is not needed.
Or skip the array and just use the COALESCEC() function to find the first value. And just reverse the order of the variables to also use it to find the last value.
firstauth = coalescec(of auth1-auth50);
lastauth = coalescec(of auth50-auth1);

Finding specific instance in a list when the list starts with a comma

I'm uploading a spreadsheet and mapping the spreadsheet column headings to those in my database. The email column is the only one that is required. In StringB below, the ,,, simply indicates that a column was skipped/ignored.
The meat of my question is this:
I have a string of text (StringA) comes from a spreadsheet that I need to find in another string of text (StringB) which matches my database (this is not the real values, just made it simple to illustrate my problem so hopefully this is clear).
StringA: YR,MNTH,ANNIVERSARIES,FIRSTNAME,LASTNAME,EMAIL,NOTES
StringB: ,YEAR,,MONTH,LastName,Email,Comments <-- this list is dynamic
MNTH and MONTH are intentionally different;
excelColumnList = 'YR,MNTH,ANNIV,FIRST NAME,LAST NAME,EMAIL,NOTES';
mappedColumnList= ',YEAR,,MONTH,,First Name,Last Name,Email,COMMENTS';
mappedColumn= 'Last Name';
local.index = ListFindNoCase(mappedColumnList, mappedColumn,',', true);
local.returnValue = "";
if ( local.index > 0 )
local.returnValue = ListGetAt(excelColumnList, local.index);
writedump(local.returnValue); // dumps "EMAIL" which is wrong
The problem I'm having is the index returned when StringB starts with a , returns the wrong index value which affects the mapping later. If StringB starts with a word, the process works perfectly. Is there a better way to to get the index when StringB starts with a ,?
I also tried using listtoarray and then arraytolist to clean it up but the index is still off and I cannot reliably just add +1 to the index to identify the correct item in the list.
On the other hand, I was considering this mappedColumnList = right(mappedColumnList,len(mappedColumnList)-1) to remove the leading , which still throws my index values off BUT I could account for that by adding 1 to the index and this appears to be reliably at first glance. Just concerned this is a sort of hack.
Any advice?
https://cfdocs.org/listfindnocase
Here is a cfgist: https://trycf.com/gist/4b087b40ae4cb4499c2b0ddf0727541b/lucee5?theme=monokai
UPDATED
I accepted the answer using EDIT #1. I also added a comment here: Finding specific instance in a list when the list starts with a comma
Identify and strip the "," off the list if it is the first character.
EDIT: Changed to a while loop to identify multiple leading ","s.
Try:
while(left(mappedColumnList,1) == ",") {
mappedColumnList = right( mappedColumnList,(len(mappedColumnList)-1) ) ;
}
https://trycf.com/gist/64287c72d5f54e1da294cc2c10b5ad86/acf2016?theme=monokai
EDIT 2: Or even better, if you don't mind dropping back into Java (and a little Regex), you can skip the loop completely. Super efficient.
mappedColumnList = mappedColumnList.replaceall("^(,*)","") ;
And then drop the while loop completely.
https://trycf.com/gist/346a005cdb72b844a83ca21eacb85035/acf2016?theme=monokai
<cfscript>
excelColumnList = 'YR,MNTH,ANNIV,FIRST NAME,LAST NAME,EMAIL,NOTES';
mappedColumnList= ',,,YEAR,MONTH,,First Name,Last Name,Email,COMMENTS';
mappedColumn= 'Last Name';
mappedColumnList = mappedColumnList.replaceall("^(,*)","") ;
local.index = ListFindNoCase(mappedColumnList, mappedColumn,',', true);
local.returnValue = ListGetAt(excelColumnList,local.index,",",true) ;
writeDump(local.returnValue);
</cfscript>
Explanation of the Regex ^(,*):
^ = Start at the beginning of the string.
() = Capture this group of characters
,* = A literal comma and all consecutive repeats.
So ^(,*) says, start at the beginning of the string and capture all consecutive commas until reaching the next non-matched character. Then the replaceall() just replaces that set of matched characters with an empty string.
EDIT 3: I fixed a typo in my original answer. I was only using one list.
writeOutput(arraytoList(listtoArray(mappedColumnList))) will get rid of your leading commas, but this is because it will drop empty elements before it becomes an array. This throws your indexing off because you have one empty element in your original mappedColumnList string. The later string functions will both read and index that empty element. So, to keep your indexes working like you see to, you'll either need to make sure that your Excel and db columns are always in the same order or you'll have to create some sort of mapping for each of the column names and then perform the ListGetAt() on the string you need to use.
By default many CF list functions ignore empty elements. A flag was added to these function so that you could disable this behavior. If you have string ,,1,2,3 by default listToArray would consider that 3 elements but listToArray(listVar, ",", true) will return 5 with first two as empty strings. ListGetAt has the same "includeEmptyValues" flag so your code should work consistently when that is set to true.

How to slice a variable into array indexes?

There is this typical problem: given a list of values, check if they are present in an array.
In awk, the trick val in array does work pretty well. Hence, the typical idea is to store all the data in an array and then keep doing the check. For example, this will print all lines in which the first column value is present in the array:
awk 'BEGIN {<<initialize the array>>} $1 in array_var' file
However, it is initializing the array takes some time because val in array checks if the index val is in array, and what we normally have stored in array is a set of values.
This becomes more relevant when providing values from command line, where those are the elements that we want to include as indexes of an array. For example, in this basic example (based on a recent answer of mine, which triggered my curiosity):
$ cat file
hello 23
bye 45
adieu 99
$ awk -v values="hello adieu" 'BEGIN {split(values,v); for (i in v) names[v[i]]} $1 in names' file
hello 23
adieu 99
split(values,v) slices the variable values into an array v[1]="hello"; v[2]="adieu"
for (i in v) names[v[i]] initializes another array names[] with names["hello"] and names["adieu"] with empty value. This way, we are ready for
$1 in names that checks if the first column is any of the indexes in names[].
As you see, we slice into a temp variable v to later on initialize the final and useful variable names[].
Is there any faster way to initialize the indexes of an array instead of setting one up and then using its values as indexes of the definitive?
No, that is the fastest (due to hash lookup) and most robust (due to string comparison) way to do what you want.
This:
BEGIN{split(values,v); for (i in v) names[v[i]]}
happens once on startup and will take close to no time while this:
$1 in array_var
which happens once for every line of input (and so is the place that needs to have optimal performance) is a hash lookup and so the fastest way to compare a string value to a set of strings.
not an array solution but one trick is to use pattern matching. To eliminate partial matches wrap the search and array values with the delimiter. For your example,
$ awk -v values="hello adieu" 'FS values FS ~ FS $1 FS' file
hello 23
adieu 99

Apply a string value to several positions of a cell array

I am trying to create a string array which will be fed with string values read from a text file this way:
labels = textread(file_name, '%s');
Basically, for each string in each line of the text file file_name I want to put this string in 10 positions of a final string array, which will be later saved in another text file.
What I do in my code is, for each string in file_name I put this string in 10 positions of a temporary cell array and then concatenate this array with a final array this way:
final_vector='';
for i=1:size(labels)
temp_vector=cell(1,10);
temp_vector{1:10}=labels{i};
final_vector=horzcat(final_vector,temp_vector);
end
But when I run the code the following error appears:
The right hand side of this assignment has too few values to satisfy the left hand side.
Error in my_example_code (line 16)
temp_vector{1:10}=labels{i};
I am too rookie in cell strings in matlab and I don't really know what is happening. Do you know what is happening or even have a better solution to my problem?
Use deal and put the left hand side in square brackets:
labels{1} = 'Hello World!'
temp_vector = cell(10,1)
[temp_vector{1:10}] = deal(labels{1});
This works because deal can distribute one value to multiple outputs [a,b,c,...]. temp_vector{1:10} alone creates a comma-separated list and putting them into [] creates the output array [temp_vector{1}, temp_vector{2}, ...] which can then be populated by deal.
It is happening because you want to distribute one value to 10 cells - but Matlab is expecting that you like to assign 10 values to 10 cells. So an alternative approach, maybe more logic, but slower, would be:
n = 10;
temp_vector(1:n) = repmat(labels(1),n,1);
I also found another solution
final_vector='';
for i=1:size(labels)
temp_vector=cell(1,10);
temp_vector(:,1:10)=cellstr(labels{i});
final_vector=horzcat(final_vector,temp_vector);
end

Bring the length of an array to another array in SAS?

I have a big SAS table, let's describe the columns as, A nd B columns in character format and all other columns are vairable in numerical format (every variable has a different name) with unknow amounth length N, like:
A B Name1 Name2 Name3 .... NameN
-------------------------------------------------
Char Char Number1 Number2 Number3 ..... NumberN
.................................................
.................................................
The goal is that the numerical array Name1-NameN will sum up downward through the Class=B (By B),
So the final table will look like this:
A B Name1 Name2 Name3 .... NameN
----------------------------------------
Char Char Sum1 Sum2 Sum3 ..... SumN
........................................
........................................
To do this sum-up, I described 2 arrays. The first one is:
array Varr {*} _numeric_; /* it reads only numerical columns */
Then I described another array with the same length (Summ1-SummN) to do the sum-up process.
The thing is that I can only describe the length of this new array manually. For example, if there are 80 numerical values, then I have to write manually like:
array summ {80} Summ1-Summ80;
The code works when I write it manually. But instead I want to write something like
array summ {&N} Summ1-Summ&N; /* &N is the dimension of the array Varr */
I tried with do-loop and dim(Varr) under the array in many different ways like:
data want;
array Varr {*} _numeric_;
do i=1 to dim(Varr);
N+1 ;
end;
%put &N;
array Summ {&N} Summ1-Summ&N;
retain Summ;
if first.B then do i=1 to dim(varr); summ(i)=varr(i) ;end;
else do i =1 to dim(varr); summ(i) = summ(i) + varr(i) ; varr(i)=summ(i); end;
drop Summ1-Summ&N;
run;
But it doesn't work. Any idea about how to bring the length of the first array to the second array?
You need to calculate and store the number of numeric variables in a previous step. The easiest way is to use the dictionary.columns metadata table, available in proc sql. This contains all column details for a given dataset, including the type (num or char), you therefore just need to count the number of columns where the type is 'num'.
The code below does just that and stores the result in a macro variable, &N. using the into : functionality. I've also used the functions left and put to remove leading blanks from the macro variable, otherwise you'll encounter problems when putting summ1-summ&N.
I've also added a 2nd solution based on your answer, but will be more efficient as it doesn't read in any records, only the column details
proc sql noprint;
select left(put(count(*),best12.)) into :N
from dictionary.columns
where libname='SASHELP' and memname='CLASS' and type='num';
quit;
%put Numeric variables = &N.;
/*****************************************/
/* alternative solution */
data _null_;
set sashelp.class (obs=0);
array temp{*} _numeric_;
call symputx('N',dim(temp));
run;
%put Numeric variables = &N.;
Now I found another solution with a little modification of the solution from #kl78
Before when I tried with call symput ('N',dim(varr)); I forgot to change the numeric format and to remove the uneccessary spaces. When I run it without format, the code tried to find Summ_____87, so it gave error.
Now I run it with format, call symput ('N',put(dim(varr),2.)); the code can find Summ87, so it is totally sucessfull now.

Resources