Loop a regression with increasing sample sizes

Loop a regression with increasing sample sizes - loops

I am still learning Stata and am not sure how to get this to work. I need to run a regression over increasing sample sizes. I know how to get it to run for a specific sample size:
reg y x1 x2 in 1/10
but I need it to do it for sample size 10, then 11, then 12, etc. up to 1000. I tried the following:
foreach var in varlist x1 x2 {
reg y x1 x2 in 10/_n+1
}
but that did not work. How do I get it to loop the regression increasing sample size by 1 each time?

forvalues i = 10/1000 {
reg y x1 x2 if _n <= `i'
}

The first answer does exactly what you asked, but you then have the output from (in your example) 991 separate regressions to process. The next question is how to select what you want for further analysis. You could check out official command rolling as providing various machinery or rangestat from SSC. Here's a token reproducible example with a listing of results. The dataset in question is panel data: note that options like by(company) insist on separate regressions for each panel. If you wanted more or different results to be kept, there are related commands.
. webuse grunfeld
. rangestat (reg) invest mvalue, int(year . 0)
+------------------------------------------------------------------------------------------+
| year reg_nobs reg_r2 reg_adj~2 b_mvalue b_cons se_mvalue se_cons |
|------------------------------------------------------------------------------------------|
| 1935 10 .86526037 .84841791 .10253446 .20584135 .01430537 16.364981 |
| 1936 20 .75028331 .73641016 .0893724 7.3202446 .01215286 17.885845 |
| 1937 30 .70628424 .69579439 .08388549 11.163111 .01022308 17.600837 |
| 1938 40 .69788155 .68993107 .08333349 10.545486 .00889458 14.396864 |
| 1939 50 .70399484 .69782807 .08056278 9.3450069 .00754013 12.312899 |
|------------------------------------------------------------------------------------------|
| 1940 60 .72557587 .72084442 .0842278 7.6507759 .0068016 11.294786 |
| 1941 70 .73233666 .72840043 .08992373 7.496037 .00659263 11.034693 |
| 1942 80 .72656572 .72306015 .094125 7.7007269 .00653803 10.706012 |
| 1943 90 .73520447 .73219543 .0968378 6.7626664 .00619519 10.093819 |
| 1944 100 .74792336 .74535115 .09900035 6.0220131 .00580579 9.4585067 |
|------------------------------------------------------------------------------------------|
| 1945 110 .76426375 .762081 .1001161 5.3512756 .00535037 8.8046084 |
| 1946 120 .77316485 .77124251 .10424112 3.9977716 .00519777 8.6559782 |
| 1947 130 .76829138 .76648116 .10701191 4.703102 .0051944 8.549975 |
| 1948 140 .75348635 .75170002 .10927737 6.1833536 .00532076 8.6413437 |
| 1949 150 .75420863 .75254788 .11128353 6.3261435 .00522201 8.4080085 |
|------------------------------------------------------------------------------------------|
| 1950 160 .7520656 .75049639 .114046 5.7698694 .00520945 8.3379321 |
| 1951 170 .75387998 .75241498 .11796668 5.0385173 .00520028 8.4011668 |
| 1952 180 .74822014 .74680565 .12304588 3.702257 .00534999 8.728181 |
| 1953 190 .74683845 .74549185 .1322075 -1.296652 .00561387 9.387601 |
| 1954 200 .734334 .73299225 .14138597 -6.9762842 .00604359 10.272724 |
+------------------------------------------------------------------------------------------+

Related

Sybase 15.7: Aggregating data in a large table

We have a fairly simple stored procedure that aggregates all the data in a large table (and then puts the results in another table). For a 5M rows table this process takes around 4 minutes - which is perfectly reasonable. But for a 13M rows table this same process takes around 60 minutes.
It looks like we are breaking some kind of a threshold and I struggle to find a simple workaround. Manually rewriting this as several "threads" aggregating small portions of table results in a reasonable run time of 10-15 minutes.
Is there a way to see the actual bottleneck here?
Update: The query plans are of course identical for both queries and they look like this:
|ROOT:EMIT Operator (VA = 3)
|
| |INSERT Operator (VA = 2)
| | The update mode is direct.
| |
| | |HASH VECTOR AGGREGATE Operator (VA = 1)
| | | GROUP BY
| | | Evaluate Grouped SUM OR AVERAGE AGGREGATE.
| | | Evaluate Grouped COUNT AGGREGATE.
| | | Evaluate Grouped SUM OR AVERAGE AGGREGATE.
[---//---]
| | | Evaluate Grouped SUM OR AVERAGE AGGREGATE.
| | | Using Worktable1 for internal storage.
| | | Key Count: 10
| | |
| | | |SCAN Operator (VA = 0)
| | | | FROM TABLE
| | | | TableName
| | | | Table Scan.
| | | | Forward Scan.
| | | | Positioning at start of table.
| | | | Using I/O Size 16 Kbytes for data pages.
| | | | With MRU Buffer Replacement Strategy for data pages.
| |
| | TO TABLE
| | #AggTableName
| | Using I/O Size 2 Kbytes for data pages.
Update 2:
Some statistics:
5M rows 13M rows
CPUTime 263,350 1,180,700
WaitTime 577,574 1,927,399
PhysicalReads 2,304,977 13,704,583
LogicalReads 11,479,123 27,911,085
PagesRead 5,550,737 19,518,030
PhysicalWrites 131,924 5,557,143
PagesWritten 263,640 6,103,708
Update 3:
set statistics io,time on results reformatted for easier comparison:
5M rows 13M rows
+-------------------------+-----------+------------+
| #AggTableName | | |
| logical reads regular | 81 114 | 248 961 |
| apf | 0 | 0 |
| total | 81 114 | 248 961 |
| physical reads regular | 0 | 2 |
| apf | 0 | 0 |
| total | 0 | 2 |
| apf IOs | 0 | 0 |
+-------------------------+-----------+------------+
| Worktable1 | | |
| logical reads regular | 1 924 136 | 8 200 130 |
| apf | 0 | 0 |
| total | 1 924 136 | 8 200 130 |
| physical reads regular | 1 621 916 | 11 906 846 |
| apf | 0 | 0 |
| total | 1 621 916 | 11 906 846 |
| apf IOs | 0 | 0 |
+-------------------------+-----------+------------+
| TableName | | |
| logical reads regular | 5 651 318 | 13 921 342 |
| apf | 52 | 20 |
| total | 5 651 370 | 13 921 362 |
| physical reads regular | 38 207 | 345 156 |
| apf | 820 646 | 1 768 064 |
| total | 858 853 | 2 113 220 |
| apf IOs | 819 670 | 1 754 171 |
+-------------------------+-----------+------------+
| Total writes | 0 | 5 675 198 |
| Execution time | 4 339 | 18 678 |
| CPU time | 211 321 | 930 657 |
| Elapsed time | 316 396 | 2 719 108 |
+-------------------------+-----------+------------+
5M rows:
Total writes for this command: 0
Execution Time 0.
Adaptive Server cpu time: 0 ms. Adaptive Server elapsed time: 0 ms.
Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Table: #AggTableName scan count 0, logical reads: (regular=81114 apf=0 total=81114), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Table: Worktable1 scan count 1, logical reads: (regular=1924136 apf=0 total=1924136), physical reads: (regular=1621916 apf=0 total=1621916), apf IOs used=0
Table: TableName scan count 1, logical reads: (regular=5651318 apf=52 total=5651370), physical reads: (regular=38207 apf=820646 total=858853), apf IOs used=819670
Total writes for this command: 0
Execution Time 4339.
Adaptive Server cpu time: 211321 ms. Adaptive Server elapsed time: 316396 ms.
13M rows:
Total writes for this command: 0
Execution Time 0.
Adaptive Server cpu time: 0 ms. Adaptive Server elapsed time: 0 ms.
Parse and Compile Time 1.
Adaptive Server cpu time: 50 ms.
Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Table: #AggTableName scan count 0, logical reads: (regular=248961 apf=0 total=248961), physical reads: (regular=2 apf=0 total=2), apf IOs used=0
Table: Worktable1 scan count 1, logical reads: (regular=8200130 apf=0 total=8200130), physical reads: (regular=11906846 apf=0 total=11906846), apf IOs used=0
Table: TableName scan count 1, logical reads: (regular=13921342 apf=20 total=13921362), physical reads: (regular=345156 apf=1768064 total=2113220), apf IOs used=1754171
Total writes for this command: 5675198
Execution Time 18678.
Adaptive Server cpu time: 930657 ms. Adaptive Server elapsed time: 2719108 ms.

Database schema for product prices which can change daily

I am looking for suggestions to design a database schema based on following requirements.
Product can have variant
each variant can have different price
Price can be different for specific day of a week
Price can be different for a specific time of the day
all prices will have validity for specific dates only
Prices can be defined for peak, high or medium seasons
Suppliers offering any product can define their own prices and above rules still applies
what could be best possible schema where data is easy to retrieve without impacting performance?
Thanks in advance for your suggestions.
Regards
Harmeet

SO isn't really a forum for suggestions as much as answers. No answer anyone gives can be definitively correct. With that said, I would keep the tables as granular as possible to allow for easy changes across products.
In regards to #5, I would place start and end dates on products. If the price is no-longer valid, the product should no longer be available.
This includes relations for different seasonal prices, however you would either need to hardcode the seasons or create another table to define those.
For prices, if this is more than 1 region you may want a regions table in which case a currency column would be appropriate.
This is operational data, not temporal data. If you want it available for historical analysis of pricing you would need to create temporal tables as well.
Product Table
+-------------+-----------+------------------+----------------+
| ProductName | ProductID | ProductStartDate | ProductEndDate |
+-------------+-----------+------------------+----------------+
| Product1 | 1 | 01/01/2017 | 01/01/2018 |
| Product2 | 2 | 01/01/2017 | 01/01/2018 |
+-------------+-----------+------------------+----------------+
Variant Table
+-----------+-----------+-------------+---------------+-------------+-------------+
| ProductID | VariantID | VariantName | NormalPriceID | HighPriceID | PeakPriceID |
+-----------+-----------+-------------+---------------+-------------+-------------+
| 1 | 1 | Blue | 1 | 3 | 5 |
| 1 | 2 | Black | 2 | 4 | 5 |
+-----------+-----------+-------------+---------------+-------------+-------------+
Price Table
+---------+-----+-----+-----+-----+-----+-----+-----+
| PriceID | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
+---------+-----+-----+-----+-----+-----+-----+-----+
| 1 | 30 | 30 | 30 | 30 | 35 | 35 | 35 |
| 2 | 35 | 35 | 35 | 35 | 40 | 40 | 40 |
| 3 | 33 | 33 | 33 | 33 | 39 | 39 | 39 |
| 4 | 38 | 38 | 38 | 38 | 44 | 44 | 44 |
| 5 | 40 | 40 | 40 | 40 | 50 | 50 | 50 |
+---------+-----+-----+-----+-----+-----+-----+-----+

SQL Server : Islands And Gaps

I'm struggling with an "Islands and Gaps" issue. This is for SQL Server 2008 / 2012 (we have databases on both).
I have a table which tracks "available" Serial-#'s for a Pass Outlet; i.e., Buss Passes, Admissions Tickets, Disneyland Tickets, etc. Those Serial-#'s are VARCHAR, and can be any combination of numbers and characters... any length, up to the max value of the defined column... which is VARCHAR(30). And this is where I'm mightily struggling with the syntax/design of a VIEW.
The table (IM_SER) which contains all this data has a primary key consisting of:
ITEM_NO...VARCHAR(20),
SERIAL_NO...VARCHAR(30)
In many cases... particularly with different types of the "Bus Passes" involved, those Serial-#'s could easily track into the TENS of THOUSANDS. What is needed... is a simple view in SQL Server... which simply outputs the CONSECUTIVE RANGES of Available Serial-#'s...until a GAP is found (i.e. a BREAK in the sequences). For example, say we have the following Serial-#'s on hand, for a given Item-#:
123
124
125
139
140
ABC123
ABC124
ABC126
XYZ240003
XYY240004
In my example above, the output would be displayed as follows:
123 -to- 125
139 -to- 140
ABC123 -to- ABC124
ABC126 -to- ABC126
XYZ240003 to XYZ240004
In total, there would be 10 Serial-#'s...but since we're outputting the sequential ranges...only 5-lines of output would be necessary. Does this make sense? Please let me know...and, again, THANK YOU!...Mark

This should get you started... the fun part will be determining if there are gaps or not. You will have to handle each serial format a little bit differently to determine if there are gaps or not...
select x.item_no,x.s_format,x.s_length,x.serial_no,
LAG(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length
ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) PreviousValue,
LEAD(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length
ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) NextValue
from
(
select item_no,serial_no,
len(serial_no) as S_LENGTH,
case
WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND
PATINDEX('%[a-z]%',serial_no) = 0 THEN 'NUMERIC'
WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND
PATINDEX('%[a-z]%',serial_no) > 0 THEN 'ALPHANUMERIC'
ELSE 'ALPHA'
end as S_FORMAT
from table1 ) x
order by item_no,s_format,s_length,serial_no
http://sqlfiddle.com/#!3/5636e2/7
| item_no | s_format | s_length | serial_no | PreviousValue | NextValue |
|---------|--------------|----------|-----------|---------------|-----------|
| 1 | ALPHA | 4 | ABCD | (null) | ABCF |
| 1 | ALPHA | 4 | ABCF | ABCD | (null) |
| 1 | ALPHANUMERIC | 6 | ABC123 | (null) | ABC124 |
| 1 | ALPHANUMERIC | 6 | ABC124 | ABC123 | ABC126 |
| 1 | ALPHANUMERIC | 6 | ABC126 | ABC124 | (null) |
| 1 | ALPHANUMERIC | 9 | XYY240004 | (null) | XYZ240003 |
| 1 | ALPHANUMERIC | 9 | XYZ240003 | XYY240004 | (null) |
| 1 | NUMERIC | 3 | 123 | (null) | 124 |
| 1 | NUMERIC | 3 | 124 | 123 | 125 |
| 1 | NUMERIC | 3 | 125 | 124 | 139 |
| 1 | NUMERIC | 3 | 139 | 125 | 140 |
| 1 | NUMERIC | 3 | 140 | 139 | (null) |

Aggregate JSON arrays with column values as keys

I've got a scenario where I'm trying to aggregate data and insert that aggregated data into another table, all from inside of a function. The data is being inserted into the other table as arrays and JSON. I've been able to aggregate into arrays perfectly fine, but I'm running into some trouble trying to aggregate the data into JSON the way I want.
Basically here is a sample of the data I'm aggregating:
id_1 | id_2 | cat_ids_array
------+------+---------------
201 | 4232 | {9,10,11,13}
201 | 4236 | {11}
201 | 4249 | {12}
201 | 4251 | {9,10}
202 | 4245 | {11}
202 | 4249 | {12}
202 | 4251 | {9,10}
202 | 4259 | {9}
203 | 4232 | {9,10,11,13}
203 | 4236 | {11}
203 | 4249 | {12}
203 | 4251 | {9,10}
203 | 4377 | {14}
204 | 4232 | {15,108}
204 | 4236 | {15}
205 | 4232 | {17,109}
205 | 4245 | {17}
205 | 4377 | {18}
206 | 4253 | {20}
When I use json_agg() to aggregate the id_2 and cat_ids_array into a JSON string here is what I get:
id_1 | json_agg
------+----------------------------------
201 | [{"f1":4232,"f2":[9,10,11,13]}, +
| {"f1":4236,"f2":[11]}, +
| {"f1":4249,"f2":[12]}, +
| {"f1":4251,"f2":[9,10]}]
202 | [{"f1":4245,"f2":[11]}, +
| {"f1":4249,"f2":[12]}, +
| {"f1":4251,"f2":[9,10]}, +
| {"f1":4259,"f2":[9]}]
203 | [{"f1":4232,"f2":[9,10,11,13]}, +
| {"f1":4236,"f2":[11]}, +
| {"f1":4249,"f2":[12]}, +
| {"f1":4251,"f2":[9,10]} +
| {"f1":4377,"f2":[14]}]
204 | [{"f1":4232,"f2":[15,108]}, +
| {"f1":4236,"f2":[15]}]
205 | [{"f1":4232,"f2":[17,109]}, +
| {"f1":4245,"f2":[17]}, +
| {"f1":4377,"f2":[18]}]
206 | [{"f1":4253,"f2":[20]}]
Here is what I'm trying to get:
id_1 | json_agg
------+-------------------------------------------------------------
201 | [{"4232":[9,10,11,13],"4236":[11],"4249":[12],"4251":[9,10]}]
202 | [{"4245":[11],"4249":[12],"4251":[9,10],"4259":[9]}]
203 | [{"4232":[9,10,11,13],"4236":[11],"4249":[12],"4251":[9,10],"4377":[14]}]
204 | [{"4232":[15,108],"4236":[15]}]
205 | [{"4232":[17,109],"4245":[17],"4377":[18]}]
206 | [{"4253":[20]}]
I'm thinking that I will have to do some kind of string concatenation, but I'm not entirely sure the best way to go about this. As stated before, I'm doing this from inside of a function, so I've got some flexibility in what I can do since I'm not limited to just SELECT syntax magic.
Also pertinent, I'm running PostgreSQL 9.3.4 and cannot upgrade to 9.4 in the near future.

It's a pity you cannot upgrade, Postgres 9.4 has jsonb and much added functionality for JSON. In particular json_build_object() would be perfect for you:
Return multiple columns of the same row as JSON array of objects
Almost, but not quite
While stuck with Postgres 9.3, you can get help from hstore to construct an hstore value with id_2 as key and cat_ids_array as value:
hstore(id_2::text, cat_ids_array::text)
Or:
hstore(id_2::text, array_to_json(cat_ids_array)::text)
Then:
json_agg(hstore(id_2::text, array_to_json(cat_ids_array)::text))
But the array is not recognized as array. Once cast to hstore, it's a text string for Postgres. There is hstore_to_json_loose(), but it only identifies boolean and numerical types.
Solution
So I ended up with string manipulation like you predicted. there are various ways to construct the json string. Each is more or less fast / elegant:
format('{"%s":[%s]}', id_2::text, translate(cat_ids_array::text, '{}',''))::json
format('{"%s":%s}', id_2::text, to_json(cat_ids_array))::json
replace(replace(to_json((id_2, cat_ids_array))::text, 'f1":',''),',"f2', '')::json
I picked the second variant, seems to be the most reliable and works for other array types than the simple int[] as well, which might need escaping:
SELECT id_1
, json_agg(format('{"%s":%s}', id_2::text, to_json(cat_ids_array))::json)
FROM tbl
GROUP BY 1
ORDER BY 1;
Result as desired.
SQL Fiddle demonstrating all.

Transform ranged data in an Access table

I have a table in Access database as below;
Name | Range | X | Y | Z
------------------------------
A | 100-200 | 1 | 2 | 3
A | 200-300 | 4 | 5 | 6
B | 100-200 | 10 | 11 | 12
B | 200-300 | 13 | 14 | 15
C | 200-300 | 16 | 17 | 18
C | 300-400 | 19 | 20 | 21
I have trying write a query that convert this into the following format.
Name | X_100_200 | Y_100_200 | Z_100_200 | X_200_300 | Y_200_300 | Z_200_300 | X_300_400 | Y_300_400 | Z_300_400
A | 1 | 2 | 3 | 4 | 5 | 6 | | |
B | 10 | 11 | 12 | 13 | 14 | 15 | | |
C | | | | 16 | 17 | 18 | 19 | 20 | 21
After trying for a while the best method I could come-up with is to write bunch of short queries that selects the data for each Range and then put them together again using a Union query. The problem is that for this example I have shown 3 columns (X, Y and Z), but I actually have much more. Access is starting to strain with the amount of SQL I have come up with.
Is there a better way to achieve this?

The answer was simple. Just use Access Pivotview. Finding it hard to export the results to Excel though.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Loop a regression with increasing sample sizes - loops

forvalues i = 10/1000 { reg y x1 x2 if _n <= `i' }

Related

Sybase 15.7: Aggregating data in a large table

Database schema for product prices which can change daily

SQL Server : Islands And Gaps

Aggregate JSON arrays with column values as keys

Transform ranged data in an Access table

Categories

Resources