SAS, array code, two indices, dropping records - arrays

I am looking through some code and wondering what this does. Below are the code comments. I'm still not sure what this code does even with the code comments. I have used arrays but not familiar with this code. It looks like this code dedupes by using two indices. Is that correct? So if there is a combination of CCS_DR_IDX and TXN_IDX, it will delete those records?
Now handle cases where the dollar matches. If ccs_dr_idx has already been used then delete the record. Dropped txns here will be added back in with the claim data called missing.
PROC SORT DATA=OUT.REQ_1_9_F_AMT_MATCH; BY CCS_DR_IDX DATEDIF; RUN;
DATA OUT.REQ_1_9_F_AMT_MATCH_V2;
SET OUT.REQ_1_9_F_AMT_MATCH;
ARRAY id_one{40000} id_one1-id_one40000;
ARRAY id_two{40000} id_two1-id_two40000;
RETAIN id_one1-id_one40000 id_two1-id_two40000;
IF _n_=1 then i=1;
else i+1;
do j=1 to i;
if CCS_DR_IDX=id_one{j} then delete;
end;
do k = 1 to i;
if TXN_IDX = id_two{k} then delete;
end;
id_one{i}=CCS_DR_IDX;
id_two{i}=TXN_IDX;
drop i j k id_one1-id_one40000 id_two1-id_two40000;
run;

The sort is
BY CCS_DR_IDX DATEDIF;
The filtering or selecting occurs when control reaches the bottom of the data step and implicitly OUTPUTs. That occurs only if CCS_DR_IDX and TXN_IDX is a combination where neither has appeared previously.
Since you have sorted by CCS_DR_IDX you can know there is implicit grouping and there will be at most one record per CCS_DR_IDX output, and for the first group it must be the first record in the group. Each successive row in a CCS_DR_IDX group, post output, will match an entry in id_one and be tossed away by DELETE.
When you start processing the next CCS_DR_IDX group the rows will be processed until you reach the next distinct TXN_ID with respect to those tracked in id_two. Because the sort had a second key DATDIF you can say the output is "a selection of the first occurring combinations of unique pair items CCS_DR_IDX TXN_ID" (somewhat akin to pair-sampling without repeats.)
There could be a case where some CCS_DR_IDX is not in the output -- that would happen when the group contains only TX_IDs that occurred in prior CCS_DR_IDXs.
Without seeing the data model and combination reasons (probably some sort of cartesian join) it's hard to make a less vague statement of what is being selected.

Related

abap loop multiple tables in sync

I have encountered some issues making mass upload programs using abap. I have gathered all data into one table from web to sap table but then, I have to move content of these tables into 3 different tables in order to use other functions for this. Problem is, if I loop through one table, the other ones are not looping through sync.
TABLES
T_HEADER STRUCTURE ZCST5000
T_DETAIL STRUCTURE ZCSS5001
T_TIME STRUCTURE ZCSS5002
LOOP AT T_HEADER
PERFORM XTY TABLES T_DETAIL
USING LS_HEADER_TMP
PERFORM ZCC TABLES T_TIME
USING LS_HEADER_TMP
ENDLOOP
This is example code. So if I loop through T_HEADER, it only loops through T_HEADER and won't loop through T_DEAIL and T_TIME in sync. They all same same row counts since these structures are originally from 1 table. So when T_HEADER row 1 is running this program, row 1 of T_DETAIL, T_TIME should be picked up and T_HEADER moves to row 2, T_DETAIL, T_TIME row 2 should be picked up. How do I tackle this? :(
You can use the classic syntax READ TABLE itab INDEX n INTO struct or the new syntax struct = itab[ n ] to read a specific line of a table into a structure by line-number.
So when you have three tables, and you can guarantee that they all have the same number of rows which all correspond to the rows of the other table with the same row number, you can do it like this (modern syntax):
DO lines( t_header ) TIMES.
DATA(header) = t_header[ sy-index ].
DATA(detail) = t_detail[ sy-index ].
DATA(time) = t_time[ sy-index ].
"You now have the corresponding lines in the structure-variables header, details and time
ENDDO.
Or this (obsolete syntax using tables with header lines like in the question)
DATA num_lines TYPE i.
DESCRIBE TABLE t_header LINES num_lines.
DO num_lines TIMES.
READ TABLE t_header INDEX sy-index.
READ TABLE t_detail INDEX sy-index.
READ TABLE t_time INDEX sy-index.
"You now have the corresponding lines in the header-lines
ENDDO.
Note that these two snippets will behave slightly differently in the error-case when t_detail or t_time have fewer lines than t_header. The first snippet will throw an exception, which will abort the program unless caught. The second snippet will keep running with the header-line from the previous loop iteration.

Random sample from another table's column

I am trying to figure out how to populate a "fake" column by choosing randomly from another table column.
So far this was easy using an array and the rantbl() function as there were not a lot of modalities.
data want;
set have;
array values[2] $10 _temporary_ ('NO','YES');
value=values[rantbl(0,0.5,0.5)];
array start_dates[4] _temporary_ (1735689600,1780358400,1798848000,1798848000);
START_DATE=start_dates[rantbl(0,0.25,0.25,0.25,0.25)];
format START_DATE datetime20.;
run;
However, my question is what happens if there are, for example, more than 150 modalities in the other table? Hence, is there a way to put into an array all the modalities that are in another table ? Or better, to populate the new "fake" column with modalities from another table's column with regards to the modalities's distribution in the other table ?
I'm not entirely sure, but here's how I interpret your request and how I would solve it.
You have a table one. You want to create a new data set want with an additional column. This column should have values that are sampled from a pool of values given in yet another data set two in column y. You want too simulate the new column in the want data set according to the distribution of y in the two data set.
So, in the example below, there should be a .5 change of simulating y = 3 and .25 for 1 and 2 respectively.
I think the way to go is not using arrays at all. See if this helps you.
data one;
do x = 1 to 1e4;
output;
end;
run;
data two;
input y;
datalines;
1
2
3
3
;
data want;
set one;
p = ceil(rand('uniform')*n);
set two(keep = y) nobs = n point = p;
run;
To verify that the new column resembles the distribution from the two data set:
proc freq data = want;
tables y / nocum;
run;
There are probably a dozen good ways to do this, which one being ideal depending on various details of your data - in particular, how performance sensitive this is.
The most SASsy way to do this, I would say, is to use PROC SURVEYSELECT. This generates a random sample of the size you want, and then merges it on. It is not the fastest way, but it is very easy to understand and is fast-ish as long as you aren't talking humungous data sizes.
data _null_;
set sashelp.cars nobs=nobs_cars;
call symputx('nobs_cars',nobs_Cars);
stop;
run;
proc surveyselect data=sashelp.class sampsize=&nobs_Cars out=names(keep=name)
seed=7 method=urs outhits outorder=random;
run;
data want;
merge sashelp.cars names;
run;
In this example, we are taking the dataset sashelp.cars, and appending an owner's name to each car, which we choose at random from the dataset sashelp.class.
What we're doing here is first determining how many records we need - the number of observations in the to-be-merged-to dataset. This step can be skipped if you know that already, but it takes basically zero time no matter what the dataset size.
Second, we use proc surveyselect to generate the random list. We use method=urs to ask for simple random sampling with replacement, meaning we take 428 (in this case) separate pulls, each time every row being equally likely to be chosen. We use outhits and outorder=random to get a dataset with one row per desired output dataset row and in a random order (without outhits it gives one row per input dataset row, and a number of times sampled variable, and without outrandom it gives them in sorted order). sampsize is used with our created macro variable that stores the number of observations in the eventual output dataset.
Third, we do a side by side merge (with no by statement, intentionally). Please note that in some installations, options mergenoby is set to give a warning or error for this particular usage; if so you may need to do this slightly differently, though it is easy to do so using two set statements (set sashelp.cars; set names;) to achieve the identical results.

Returning multiple adjacent cell results from an min array which may include multiple duplicate values

I'm trying to setup a formula that will return the contents of an related cell (my related cell is on another sheet) from the smallest 2 results in an array. This is what I'm using right now.
=INDEX('Sheet1'!$A$40:'Sheet1'!$A$167,MATCH(SMALL(F1:F128,1),F1:F128,0),1)
And
=INDEX('Sheet1'!$A$40:'Sheet1:!$A$167,MATCH(SMALL(F1:F128,2),F1:F128,0),1)
The problem I've run into is twofold.
First, if there are multiple lowest results I get whichever one appears first in the array for both entries.
Second, if the second lowest result is duplicated but the first is not I get whichever one shows up on the list first, but any subsequent duplicates are ignored. I would like to be able to display the names associated with the duplicated scores.
You will have to adjust the k parameter of the SMALL function to raise the k according to duplicates. The COUNTIF function should be sufficient for this. Once all occurrences of the top two scores are retrieved, standard 'lookup multiple values' formulas can be applied. Retrieving successive row positions with the AGGREGATE¹ function and passing those into an INDEX of the names works well.
    
The formulas in H2:I2 are,
=IF(SMALL(F$40:F$167, ROW(1:1))<=SMALL(F$40:F$167, 1+COUNTIF(F$40:F$167, MIN(F$40:F$167))), SMALL(F$40:F$167, ROW(1:1)), "") '◄ H2
=IF(LEN(H40), INDEX(A$40:A$167, AGGREGATE(15, 6, ROW($1:$128)/(F$40:F$167=H40), COUNTIF(H$40:H40, H40))), "") '◄ I2
Fill down as necessary. The scores are designed to terminate after the last second place so it would be a good idea to fill down several rows more than is immediately necessary for future duplicates.
¹ The AGGREGATE function was introduced with Excel 2010². It is not available in earlier versions.
² Related article for pre-xl2010 functions - see Multiple Ranked Returns from INDEX().
The following formula will do what I think you want:
=IF(OR(ROW(1:1)=1,COUNTIF($E$1:$E1,INDEX(Sheet1!$A$40:$A$167,MATCH(SMALL($F$1:$F$128,ROW(1:1)),$F$1:$F$128,0)))>0,ROW(1:1)=2),INDEX(Sheet1!$A$40:$A$167,MATCH(1,INDEX(($F$1:$F$128=SMALL($F$1:$F$128,ROW(1:1)))*(COUNTIF($E$1:$E1,Sheet1!$A$40:$A$167)=0),),0)),"")
NOTE:
This is an array formula and must be confirmed with Ctrl-Shift-Enter.
There are two references $E$1:$E1. This formula assumes that it will be entered in E2 and copied down. If it is going in a different column Change these two references. It must go in the second row or it will through a circular reference.
What it will do
If there is a tie for first place it will only list those teams that are tied for first.
If there is only one first place but multiple tied for second places it will list all those in second.
So make sure you copy the formula down far enough to cover all possible ties. It will put "" in any that do not fill, so err on the high side.
To get the Scores use this simple formula, I put mine in Column F:
=IF(E2<>"",SMALL($F$1:$F$128,ROW(1:1)),"")
Again change the E reference to the column you use for the output.
I did a small test:

Matching DB records within a certain tolerance for error

I have a reference table (REF), and a table of incomplete data (ICD). Both record structures are the same. In this case it happens to contain person related data. I am trying to fill in records in ICD with data from the best matching row in REF, if a best matching row exists within reason. The ICD table may have or be missing first name, middle name, last name, address, city, state, zip, dob, and some others.
I need to ensure that the data I fill in is 100% accurate, so I already know I will not be able to fill it in on ICD rows where there isn't a reasonable match.
To this point I have written close to a dozen variations of matching queries, running over currently unmatched rows, and bringing over data in the case of a match. I know there must be a better way.
What I would like to do is to set up a weight on certain criteria, and then take the summed weight for all satisfied criteria, and if above a certain threshold, honor the match. For example:
first name match = 15 points
first name soundex match = 7 points
last name match = 25 points
last name soundex match = 10 points
middle name match = 5 points
address match = 25 points
dob match = 15 points
Then I search for previously unmatched rows, and any match above a sum of 72 points is declared a confident match.
I can't seem to get my head around how to do this in a single query. I also have a unique id in each table (they don't align or overlap, but it may prove useful in keeping track of a subquery result).
Thanks!!!
Here are some basic ideas:
If there is no column which is guaranteed to match you will have to consider every row in REF for every row in ICD. Basically this is just a join of the two tables without any join condition.
This join gives you pairs of rows, i.e. twice the number of columns in each table. Now you'll have to apply your ranking system and compute an additional column weight. You can do this with a stored procedure or lots of decodes, something like
decode(REF.firstName, ICD.firstName, 15, 0) + decode ...
Now you select the ones which have a weight greater than 72. The problem is that one ICD row may have more than one confident match. If you want the best match only, then you better order your result by weight descending and perform your updates. If an ICD row has more than one confident REF row, you will do some obsolete updates, but the better update will always come last and overwrite previous updates.
If you want to do it all in a single update is yet another challenge.
The comment by #tbone was the most correct, but I cannot mark it as an answer because it was just a comment.
I did in fact create a package to deal with this, which contains two functions.
The first function queries any record from the reference table that matches any of a variety of indexes. It then calculates a score associated with each field. For example, last name exact match is worth 25, last name soundex match is worth 16, null is worth 0 and a total mismatch is worth -15. These are summed together, and then I ran various queries to tell how low the threshold could be before letting in junk. The first function returned each row from ref, and the score associated with it, in desc order.
The second function uses lead and lag to specifically consider the highest score returned by the first function, and then I added a second check, that the highest score was at least a reasonable number of points in the lead from the second highest score. This trick eliminated my false matches. The second function returns a single unique id of the match if there is a match.
From there is was pretty easy to write an update statement on the ICD table.

Finding arrays that contain a subset of another array without using #> with postgreSQL

I have a table with 1.5 MM records. Each record has a row number and an array with between 1 and 1,000 elements in the array. I am trying to find all of the arrays that are a subset of the larger arrays.
When I use the code below, I get ERROR: statement requires more resources than resource queue allows (possibly because there are over a trillion possible combinations):
select
a.array as dup
from
table a
left join
table b
on
b.array #> a.array
and a.row_number <> b.row_number
Is there a more efficient way to identify which arrays are subsets of the other arrays and mark them for removal other than using #>?
Your example code suggests that you are only interested in finding arrays that are subsets of any other array in another row of the table.
However, your query with a JOIN returns all combinations, possibly multiplying results.
Try an EXISTS semi-join instead, returning qualifying rows only once:
SELECT a.array as dup
FROM table a
WHERE EXISTS (
SELECT 1
FROM table b
WHERE a.array <# b.array
AND a.row_number <> b.row_number
);
With this form, Postgres can stop iterating rows as soon as the first match is found. If this won't go through either, try partitioning your query. Add a clause like
AND table_id BETWEEN 0 AND 10000
and iterate through the table. Should be valid for this case.
Aside: it's a pity that your derivate (Greenplum) doesn't seem to support GIN indexes, which would make this operation much faster. (The index itself would be big, though)
Well, I don't see how to do this efficiently in a single declarative SQL statement without appropriate support from an index. I don't know how well this would work with a GIN index, but using a GIN index would certainly avoid the need to compare every possible pair of rows.
The first thing I would do is to carefully investigate the types of indexes you do have at your disposal, and try creating one as necessary.
If that doesn't work, the first thing that comes to my mind, procedurally speaking, would be to sort all the arrays, then sort the rows into a graded lexicographic order on the arrays. Then start with the shortest arrays, and work upwards as follows: e.g. for [1,4,9], check all the arrays with length <= 3 that start with 1 if they are a subset, then check all arrays with length <= 2 that start with 4, and then check all the arrays of length <= 1 that start with 9, removing any found subsets from consideration as you go so that you don't keep re-checking the same rows over and over again.
I'm sure you could tune this algorithm a bit, especially depending on the particular nature of the data involved. I wouldn't be surprised if there was a much better algorithm; this is just the first thing that I thought of. You may be able to work from this algorithm backwards to the SQL you want, or you may have to dump the table for client-side processing, or some hybrid thereof.

Resources