trying to concatenate values from data sets into an array in sas - arrays

I am trying to add a Data step that creates the work.orders_fin_qtr_tot data set from the work.orders_fin_tot data set. This new data set should contain new variables for quarterly sales and profit. Use two arrays to create the new variables: QtrSales1-QtrSales4 and QtrProfit1-QtrProfit4. These represent total sales and total profit for the quarter (1-4). Use the quarter number of the year in which the order was placed to index into the correct variable to add either the TotalSales or TotalProfit to the new appropriate variable.
Add a Proc step that displays the first 10 observations of the work.orders_fin_qtr_tot data set.
My issue is that I can't seem to get the two diff arrays to meld with out spaces
proc sort data=work.orders_fin_tot_qtr;
by workqtr;
run;
data work.orders_fin_tot_qtr;
set work.orders_fin_tot_qtr;
array QtrSales{4} quarter1-quarter4 ;
do i = 1 by 1 until (last.order_id);
if workqtr=i then QtrSales{i}=totalsales;
end;
drop totalsales totalprofit _TYPE_ _FREQ_;
run;
proc print data=work.orders_fin_tot_qtr;
run;

The syntax last.order_id is only appropriate if there is a BY statement in the DATA Step -- if not present, the last. reference is always missing and the loop will never end; so you have coded an infinite loop!
The step has drop totalsales totalprofit _TYPE_ _FREQ_. Those underscored variables indicate the incoming data set was probably created with a Proc SUMMARY.
Your orders_fin_tot data set should have columns order_id quarter (valid values 1,2,3,4), and totalsales. If the data is multi-year, it should have another column named year.
The missing BY and present last.id indicate you are reshaping the data from acategorical vector going down a column to one that goes across a row -- this is known as a pivot or transpose. The do construct you show in the question is incorrect but similar to that of a technique known in SAS circles as a DOW loop -- the specialness of the technique is that the SET and BY are coded inside the loop.
Try adjusting your code to the following pattern
data want;
do _n_ = 1 by 1 until (last.order_id);
SET work.orders_fin_tot; * <--- presumed to have data 'down' a column for each quarter of an order_id;
BY order_id; * <--- ensures data is sorted and makes automatic flag variable LAST.ORDER_ID available for the until test;
array QtrSales quarter1-quarter4 ; * <--- define array for step and creates four variables in the program data vector (PDV);
* this is where the pivot magic happens;
* the (presumed) quarter value (1,2,3,4) from data going down the input column becomes an
* index into an array connected to variables going across the PDV (the output row);
QtrSales{quarter} = totalsales;
end;
run;
Notice there is no OUTPUT statement inside or outside the loop. When the loop completes it's iteration the code flow reaches the bottom of the data step and does an implicit OUTPUT (because there is no explicit OUTPUT elsewhere in the step).
Also, for any data set specified in code, you can use data set option OBS= to select which observation numbers are used.
proc print data=MyData(obs=10);
OBS is a tricky option name because it really means last observation number to use. FIRSTOBS is another data set option for specifying the row numbers to use, and when not present defaults to 1. So the above is equivalent to
proc print data=MyData(firstobs=1 obs=10);
OBS= should be thought of conceptually as LASTOBS=; there is no actual option name LASTOBS. The following would log an ERROR: because OBS < FIRSTOBS
proc print data=MyData(firstobs=10 obs=50);

Related

Long calculation times with XLOOKUP vs INDEX-MIN-COLUMN

I'm using this formula =IF(B24="","",IFERROR(INDEX(Sheet3!$C$3:$EE$3,,MIN(IF(Sheet3!$C$4:$EE$23=(Sheet2!C24&$K$18),COLUMN(Sheet3!$C:$EE)))-2),"NF")) to return a cell value in the top row of an array - a date in this case.
The search criteria is a combination of a unique project number and a 2 digit status alphanumerical code for the project. The array consists of 23 rows where combinations of the unique numbers are found, each with different status codes.
So essentially, I'm building a FILTERED project status dashboard that returns dates linked to the relevant project status.
The code above is inspired from ( LINK ) that uses a very similar layout, but it uses town suburbs linked to postal codes instead of project numbers and status codes. The formula works well (though, not entered as an array formula), but I don't have a single formula in the sheet, I have 3 300 occurrences of this formula.
The problem comes in when the user changes the FILTER - Excel recalculates the entire dashboard and that takes anywhere from 2 to 5 minutes to run. You hit the escape button and cancel the calculation after setting the filter, but Excel just starts calculating again after a few seconds. After that, Excel's response is sluggish and almost unusable. Yes - our hardware is pretty weak ...
I tried XLOOKUP as well, but can't set the "lookup_array" to an array ( Sheet3!$C$4:$EE$23 ) because it doesn't match the "return-array" ( Sheet3!$C$3:$EE$3 ) Concatenating the lookup arrays with & works, but then you'd have to do that for all 23 rows, and again, multiply that by 3 300.
I thought of creating a UDF, but the function will still be called every time Excel recalculates after filtering... 3 300 calls ...
Any ideas on how to make the INDEX version run faster, or make the XLOOKUP accept the lookup_array as Sheet3!$C$4:$EE$23 in the hopes that it'll run faster?
Thank you!
Not really an elegant solution, but it works.
I imported the dataset into a helper sheet, where I combined the cell value with the corresponding value in Column A for each row ( a name in this case ) and the date from row 1 for each column, using underscore as a delimiter.
This new data range was then given a unique name, EE in this case.
On a second helper sheet, using this formula =INDEX(Filtered,1+INT((ROW('Sheet1'!C3)-1)/COLUMNS(Filtered)),MOD(ROW('Sheet1'!C3)-1+COLUMNS(Filtered),COLUMNS(Filtered))+1) and drag it down till it returns an REF! error and going back one row before the error.
This transposes all the data into a single column G. Using =UNIQUE(SORT(FILTER(B3:B3240,B3:B3240<> "",""))) then gives me a filtered list of unique values in column H that I then run
=IF(H3="","",LEFT(H3, SEARCH("_",H3,1)-1)) for the first data value in I, and
=IF(H3="","",MID(H3, SEARCH("_",H3) + 1, SEARCH("_",H3,SEARCH("_",H3)+1) - SEARCH("_",H3) - 1)) for the middle data value in J, and
=IF(H3="","",IFERROR(TEXT(RIGHT(H3,5),"yyyy-mm-dd"),"NF")) for the last data value in K.
Then just run XLOOPUP across columns I, J and K.
Runs quick and easy and solves a few of the other issue I had as well.
The second data set has just over 35 000 rows - still works well and fast.

SAS DO LOOP with specific dates

I want to create a data set where I only want to keep 5 specific dates.
So my &date is 31mar2020 and &enddate is 31mar2025 and I only want to keep 31mar every year until 2025.
With my code below it creates dates for everyday up to 31mar2025 and thats to much so I only want to keep 5 specific dates.
How can i do that?
Thank you
DATA LOOP;FORMAT ROLL_BASE_DT DATE9.;DO ROLL_BASE_DT =&DATE TO &ENDdate;OUTPUT;END;RUN;
enter code here
enter code here
You can use commas in the DO statement to list multiple values.
do date='31mar2021'd,'31mar2022'd,'31mar2023'd,'31mar2024'd,'31mar2025'd;
...
end;
You could loop over the YEAR value instead.
do year=2021 to 2025;
date=mdy(3,31,year);
...
end;
You could use INTNX() to increment the date by YEAR. You can use INTCK() to figure out how many times to run the loop.
do index=0 to intck('year',&DATE,&ENDdate);
date=intnx('year',&date,index,'s');
...
end;
If it's just the 5 dates you want, you could use the cards input (I know of it but have never used it personally).
Alternatively, rather than using a loop just set the values individually with the output keyword after each time you set the value. That should do it.

Postgres ordering table by element in large data set

I have a tricky problem trying to find an efficient way of ordering a set of objects (~1000 rows) that contain a large (~5 million) number of indexed data points. In my case I need a query that allows me to order the table by a specific datapoint. Each datapoint is a 16-bit unsigned integer.
I am currently solving this problem by using an large array:
Object Table:
id serial NOT NULL,
category_id integer,
description text,
name character varying(255),
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
data integer[],
GIST index:
CREATE INDEX object_rdtree_idx
ON object
USING gist
(data gist__intbig_ops)
This index is not currently being used when I do a select query, and I am not certain it would help anyway.
Each day the array field is updated with a new set of ~5 million values
I have a webserver that needs to list all objects ordered by the value of a particular data point:
Example Query:
SELECT name, data[3916863] as weight FROM object ORDER BY weight DESC
Currently, it takes about 2.5 Seconds to perform this query.
Question:
Is there a better approach? I am happy for the insertion side to be slow as it happens in the background, but I need the select query to be as fast as possible. In saying this, there is a limit to how long the insertion can take.
I have considered creating a lookup table where every value has it's own row - but I'm not sure how the insertion/lookup time would be affected by this approach and I suspect entering 1000+ records with ~5 million data points as individual rows would be too slow.
Currently inserting a row takes ~30 seconds which is acceptable for now.
Ultimately I am still on the hunt for a scalable solution to the base problem, but for now I need this solution to work, so this solution doesn't need to scale up any further.
Update:
I was wrong to dismiss having a giant table instead of an array, while insertion time massively increased, query time is reduced to just a few milliseconds.
I am now altering my generation algorithm to only save a datum if it non-zero and changed from previous update. This has reduced insertions to just a few hundred thousands values which only takes a few seconds.
New Table:
CREATE TABLE data
(
object_id integer,
data_index integer,
value integer,
)
CREATE INDEX index_data_on_data_index
ON data
USING btree
("data_index");
New Query:
SELECT name, coalesce(value,0) as weight FROM objects LEFT OUTER JOIN data on data.object_id = objects.id AND data_index = 7731363 ORDER BY weight DESC
Insertion Time: 15,000 records/second
Query Time: 17ms
First of all, do you really need a relational database for this? You do not seem to be relating some data to some other data. You might be much better off with a flat-file format.
Secondly, your index on data is useless for the query you showed. You are querying for a datum (a position in your array) while the index is built on the values in the array. Dropping the index will make the inserts considerably faster.
If you have to stay with PostgreSQL for other reasons (bigger data model, MVCC, security) then I suggest you change your data model and ALTER COLUMN data SET TYPE bytea STORAGE external. Since the data column is about 4 x 5 million = 20MB it will be stored out-of-line anyway, but if you explicitly set it, then you know exactly what you have.
Then create a custom function in C that fetches your data value "directly" using the PG_GETARG_BYTEA_P_SLICE() macro and that would look somewhat like this (I am not a very accomplished PG C programmer so forgive me any errors, but this should help you on your way):
// Function get_data_value() -- Get a 4-byte value from a bytea
// Arg 0: bytea* The data
// Arg 1: int32 The position of the element in the data, 1-based
PG_FUNCTION_INFO_V1(get_data_value);
Datum
get_data_value(PG_FUNCTION_ARGS)
{
int32 element = PG_GETARG_INT32_P(1) - 1; // second argument, make 0-based
bytea *data = PG_GETARG_BYTEA_P_SLICE(0, // first argument
element * sizeof(int32), // offset into data
sizeof(int32)); // get just the required 4 bytes
PG_RETURN_INT32_P((int32*)data);
}
The PG_GETARG_BYTEA_P_SLICE() macro retrieves only a slice of data from the disk and is therefore very efficient.
There are some samples of creating custom C functions in the docs.
Your query now becomes:
SELECT name, get_data_value(data, 3916863) AS weight FROM object ORDER BY weight DESC;

How to loop through a range of week status variables in SAS and pull status according to specific week given by other variable

I have a range of weekly variables describing a person's "status" (from week 1, 2010 to 2012 week 17).
The variables are given by:
y_1001, y_1002,...y1052, y1101, y1102,......y_1217
I define the period of variables like this:
%let period = y_1001-1052 y_1101-y1148;
I also have a treatment period given as a start date and an end date. My challenge is to find the status given by the y_ variables in the week after the person stops the treatment.
I am not too familiar with SAS, but my idea was to "pick" the correct y_ variable based on a week counter, say by counting the number of weeks since the beginning of the period (week 1 in 2010) and until the date where the treatment ends.
I get the weeks until end of treatment like this
week_count = 1 + intck( 'week.2', '1JAN2010'd, end_treatment_date, 'd');
But how can I retrieve the corresponding y_ variable based on this count?
After fruitless search on how to loop over the period variables and pick the number corresponding to the week_count variable for each person, I thought about going a different way... say something like this.
array weeks(*) &period;
do i = 1 to dim(weeks) by 1;
if week_count = i then end_status = y_10&i;
end
...but with modifications to take into account that there is a mismatch between the dimension of the array and the number of weeks and years.
But then my challenge is to make the following part work...
if week_count = i then end_status = y_10&i;
How can I make SAS pick the right y_ variable based on the loop index? This seems like a really simple problem, but somehow I have not managed to find a solution. Is there no way to use the variable "i" as input in defining the correct y_ variable?
Would really appreciate if somebody could throw some hints.
I think you want:
if week_count = i then end_status = weeks{i};
VVALUEX is a little known function in SAS which can help you to extract a value from a SAS variable name. Here your problem is to construct the SAS variable name given the number of weeks from the 1st Jan to the end of treatment. You can avoid using a DO LOOP for every observation by using the ideas in the example below -
data _null_;
end_treatment_date = "09FEB2010"d;
y_1007 = 'D';
status = vvaluex(compress("y_" || substr(strip(year(end_treatment_date)), 3, 2) || put(1 + intck("week.2", "01JAN2010"d, end_treatment_date), z2.)));
put status;
run;
The variable name is constructed as follows - initally you take the string "y_" and then append the last two digits of the year followed by the week using similar logic to your variable week_count as before. You get the variable you want, and then apply VVALUEX to get the value. Running the DO LOOP for every observation can be inefficient if you have millions of them.

Using a sort order column in a database table

Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.
Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.
Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.
Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id
It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!
One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.
I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing

Resources