KDB:issue using recursive function with two variables - loops

I have tried to do the following in several different ways but haven't succeeded so far.
I have a list of tables that goes like this:
.rsk.list
extract1.csv | +`date`code etc.
...
I can call the table like this .rsk.list[`extract1.csv]
Or I can raze .rsk.list to get a table of all extracts. Except it's not really a table (type 0b)
My goal is to copy some extracts, let's say toload:`extract1.csv`extract5.csv to a second table, let's call it .rsk.list2
I have tried many variations of this
tmp:{.rsk.risk[x]} each toload \which works
{.rsk.list2[x]:y}[each toload;each tmp} \which does not
If you have any idea how to make the above work or how to cast raze .rsk.list to a table, I'll be forever grateful.

In order to join on extract1.csv and extract5.csv effectively it would help to see what they look like.
When you raze a list of non-conformant tables it will generate a mixed list (type 0) of dictionaries
e.g.:
q).rsk.list:`1.csv`2.csv!(([]a:`a`b`c;b:til 3);([]c:`d`e`f;d:2*til 3))
q).rsk.list[`1.csv]
a b
---
a 0
b 1
c 2
q).rsk.list[`2.csv]
c d
---
d 0
e 2
f 4
q)raze .rsk.list
`a`b!(`a;0)
`a`b!(`b;1)
`a`b!(`c;2)
`c`d!(`d;0)
`c`d!(`e;2)
`c`d!(`f;4)
q)type raze .rsk.list
0h
One method is using uj and over(/) but it depends on your usecase whether this achieves the desired result:
q)(uj/) .rsk.list[`1.csv`2.csv]
a b c d
-------
a 0
b 1
c 2
d 0
e 2
f 4
This is a table but not necessarily a useful one

Related

Alternative solutions to an array search in PostgreSQL

I am not sure if my database design is good for this tricky case and I also ask for help how the query for this could look like.
I plan a query with the following table:
search_array | value | id
-----------------------+-------+----
{XYa,YZb,WQb} | b | 1
{XYa,YZb,WQb,RSc,QZa} | a | 2
{XYc,YZa} | c | 3
{XYb} | a | 4
{RSa} | c | 5
There are 5 main elements in the search_array: XY, YZ, WQ, RS, QZ and 3 Values: a, b, c that are concardinated to each element.
Each row has also one value: a, b or c.
My aim is to find all rows that fit to a specific row in this sense: At first it should be checked if they have any same main elements in their search_arrays (yellow marked in the example).
As example:
Row id 4 an row id 5 wouldnt match because XY != RS.
Row id 1, 2 and 3 would match two times because they have all XY and YZ.
Row id 1 and 2 would even match three times because they have also WQ in common.
And second: if there is a Main Element match it should be 'crosschecked' if the lowercase letters after the Main Elements fit to the value of the other row.
As example: The only match for Row id 1 in the table would be Row id 4 because they both search for XY and the low letters after the elements match each value of the two rows.
Another match would be ROW id 2 and 5 with RS and search c to value c and search a to value a (green and orange marked).
My idea was to cut the search_array elements in the query in two parts with the RIGHT and LEFT command for strings. But I dont know how to combine the subqueries for this search.
Or would be a complete other solution faster? Like splitting the search array into another table with the columns 'foregin key' to the maintable, 'main element' and 'searched_value'. I am not sure if this is the best solution because the program would all the time switch to the main table to find two rows out of 3 million rows to compare their searched_values to the values?
Thank you very much for your answers and your time!
You'll have to represent the data in a normalized fashion. I'll do it in a WITH clause, but it would be better to store the data in this fashion to begin with.
WITH unravel AS (
SELECT t.id, t.value,
substr(u.val, 1, 2) AS arr_main,
substr(u.val, 3, 1) AS arr_val
FROM mytable AS t
CROSS JOIN LATERAL unnest(t.search_array) AS u(val)
)
SELECT a.id AS first_id,
a.value AS first_value,
b.id AS second_id,
b.value AS second_value,
a.arr_main AS main_element
FROM unravel AS a
JOIN unravel AS b
ON a.arr_main = b.arr_main
AND a.arr_val = b.value
AND b.arr_val = a.value;

T-SQL: How to break a column with concatenated string into multiple rows?

I'm working with a dataset where most columns are normal, but one has one or more concatenated values jammed into a single string, using a '|' as a delimiter between values. I need to reshape it so that there's one row per existing row, per concatenated value. There are 60 potential values--that I know of-- in the concatenated string, and most rows have between 0 and 10 values smashed into the string. It's also going to be necessary to repeat this process over the next few months, and it's possible the list will change/ add new members.
I'm going to have to do this on an unknown number of future tables--at least 4 more--so if there's an approach I can easily repurpose it will be MUCH better. Also, I'm using t-SQL, but l could probably bring in R or something if that would help. Any ideas?
If you have a table containing the 60 possible values, you could join to it with tsql something like this:
select table1.id, potentialvalues.value
from table1
inner join potentialvalues
on charindex('|'+potentialvalues.value+'|', '|'+table1.concatField+'|')>0
Note: Added the pipes to beginning and end of the concatfield so that it can match the first and last values in the field. So, if your concatfield is something like '1|2|10' on a record it would be able to match '|1|', '|2|' and '|10|'.
In R, you could use dplyr and tidyr functions to expand your rows by separating each combined string at the pipe symbol. This has the advantage that it can be applied to your table without knowing what the piped combinations are in advance.
library(dplyr)
library(tidyr)
separate_rows(df, string, sep = "[|]") %>%
mutate(string = trimws(string))
The trimws function from base R is used to remove any extra whitespace that may be between your piped string components. Toy test data and results shown below.
Test data
df = data.frame(key = c("A", "B", "C", "D"),
string = c("Simple", "Piped 1 | Piped 2", "Simple 2", "Piped A1 | Piped A2 | Piped A3"), stringsAsFactors = FALSE)
> df
key string
1 A Simple
2 B Piped 1 | Piped 2
3 C Simple 2
4 D Piped A1 | Piped A2 | Piped A3
Result
key string
1 A Simple
2 B Piped 1
3 B Piped 2
4 C Simple 2
5 D Piped A1
6 D Piped A2
7 D Piped A3

Can I set rules for string comparison in SQL? (or do I need to hardcode using CASE WHEN)

I need to make a comparison for ratings in two points in time and indicate if the change was upwards,downwards or stayed the same.
For example:
This would be a table with four columns:
ID T0 T0+1 Status
1 AAA AA Lower
2 BB A Higher
3 C C Same
However, this does not work when applying regular string comparison, because in SQL
A<B
B<BBB
I need
A>B
B<BBB
So my order(highest to lowest): AAA,AA,A,BBB,BB,B
SQL order(highest to lowest): BBB,BB,B,AAA,AA,A
Now I have 2 options in mind, but I wonder if someone know a better one:
1) Use CASE WHEN statements for all the possibilities of ratings going up and down ( I have more values than indictaed above)
CASE WHEN T0=T0+1 then 'Same'
WHEN T0='AAA' and To+1<>'AAA' then 'Lower'
....adress all other options for rating going down
ELSE 'Higher'
However, this generates a very large number of CASE WHEN statements.
2) My other option requires generating 2 tables. In table 1 I use case when statements to assign values/rank to the ratings.
For example:
CASE WHEN T0='AAA' then 6
CASE WHEN T0='AA' then 5
CASE WHEN T0='A' then 4
CASE WHEN T0='BBB' then 3
CASE WHEN T0='BB' then 2
CASE WHEN T0='B' then 1
The same for T0+1.
Then in table 2 I use a regular compariosn between column T0 and Column T0+1 on the numeric values.
However, I am looking for a solution where I can do it in one table (with as little lines as possible), and optimally never really show the ranking column.
I think a nested statement would be the best option, but it did now work for me.
Anybody has suggestions?
I use SQL Server 2008.
If you are using Credit Rating, this is very likely that this is not just about AAA > AA or BBB > BB.
Whether you are using one agency or another, it could also be AA+ or Aa1 for long term, F1+ for short term or something else in different contexts or with other agencies.
It is also often requiered to convert data from one agency to other agencies Rating.
Therefore it is better to use a mapping table such as:
Id | Rating
0 | AAA
1 | AA+
2 | AA
3 | AA-
4 | A+
5 | A
6 | A-
7 | BBB+
Using this table, you only have to join the rating in your data table with the rating in the mapping table:
SELECT d.Rating_T0, d.Rating_T1
CASE WHEN d.Rating_T0 = d.Rating_T1 THEN '='
WHEN m0.id < m1.id THEN '<'
WHEN m0.id > m1.id THEN '>'
END
FROM yourData d
INNER JOIN RatingMapping m0
ON m0.Rating= d.Rating_T0
INNER JOIN RatingMapping m1
ON m1.Rating= d.Rating_T1
If you only store the Rating id in you data table, you will not only save space (1 byte for tinyint versus up to 4 chars) but will also be able to compare without the JOIN to the mapping table.
SELECT d.Rating_Id0, d.Rating_Id1
CASE WHEN d.Rating_Id0 = d.Rating_Id1 THEN '='
WHEN d.Rating_Id0 < d.Rating_Id1 THEN '<'
WHEN d.Rating_Id0 > d.Rating_Id1 THEN '>'
END
FROM yourData d
The JOIN would only be requiered when you want to display the actual Rating value such as AAA for Rating_ID = 0.
You could also add an agency_Id to the Mapping table. This way, you can easily choose which Notation agency you want to display and easily convert between Agency 1 and Agency 2 or Agency 3 (ie. Id 1 => S&P and Id 2 => Fitch, Id 3 => ...)

Setting Up a Dynamic Stopping Point for a Loop

Data is setup with a bunch of information corresponding to an ID, which can show-up more than once.
ID Data
1 X
1 Y
2 A
2 B
2 Z
3 X
I want a loop that signifies which instance of the ID I am looking at. Is it the first time, second time, etc? I want it as a string in the form _# so I have to go beyond the simple _n function in Stata, to my knowledge. If someone knows a way to do what I want without the loop let me know, but I would still like the answer.
I have the following loop in Stata
by ID: gen count_one = _n
gen count_two = ""
quietly forval j = 1/3 {
replace count_two = "_`j'" if count_one == `j'
}
The output now looks like this:
ID Data count_one count_two
1 X 1 _1
1 Y 2 _2
2 A 1 _1
2 B 2 _2
2 Z 3 _3
3 X 1 _1
The question is how can I replace the 16 above with to tell Stata to take the max of the count_one column because I need to run this weekly and that max will change and I want to reduce errors.
It's hard to understand why you want this, but it is one line whether you want numeric or string:
bysort ID : gen nummax = _N
bysort ID : gen strmax = "_" + string(_N)
Note that the sort order within ID is irrelevant to the number of observations for each.
Some parts of your question aren't clear ("...replace the 16 above with to tell Stata...") but:
Why don't you just use _n with tostring?
gsort +ID +data
bys ID: g count_one=_n
tostring count_one, gen(count_two)
replace count_two="_"+count_two
Then to generate the max (answering the partial question at the end there) -- although note this value will be repeated across instances of each ID value:
bys ID: egen maxcount1=max(count_one)
or more elegantly:
bys ID: g maxcount2=_N

Moving Data from a Grid into a Database

I have the following lookup grid
x A B C D
A 0 2 1 1
B 2 0 1 1
C 1 1 0 1
D 1 1 1 0
Think of this similar to the travelling salesman with point to point, although the algorithm isn't relevant to this problem. It is More like a lookup from A->B
What would be the best way to store in a database, since the time is the same both directions. A to B is 2, and B to A is 2
Start End Time
A B 2
A C 1
B A 2
etc
Doing this seems like it will be duplicating all the data which wouldn't be a good design.
Any thoughts which would be the best way implement this?
Don't store the duplicate rows. Just do a select like this:
select *
from LookupTable
where (Start = 'A' and End = 'B')
or (Start = 'B' and End = 'A')
Agree with OrbMan. You may adopt a convention to store either the upper triangle or the lower triangle. and after loading that triangle from the database just mirror it. Doing this in the db streamer, and loader should encapsulate/localize the behavior in one place.
Oh, another thing, you should probably use a matrix implementation which is similar so that a[i,j] returns a[j,i] if i>j, 0 if i==j. You get the point... Then just have to save and load the items where i<j.

Resources