How to index a dataframe with a user-defined string? - database

Assumptions: I have a Julia DataFrame with a column titled article_id.
Normally, I can declare a DataFrame using some syntax like df = DataFrame(CSV.File(dataFileName; delim = ",")). If I wanted to get the column pertaining to a known attribute, I could do something like df.article_id. I could also index that specific column by doing df."article_id".
However, if I created a string and assigned it to the value of article_id, such as str = "article_id", I cannot index the dataframe via df.str: I get an error by doing so. This makes sense, as str is not an attribute of the DataFrame, yet the value of str is an attribute of the DataFrame. How can I index the DataFrame to get the column corresponding to the value of str? I'm looking for some syntax similar to df.valueof(str).
Are there any solutions to this?

From the DataFrames.jl manual's "Getting started" page:
Columns can be directly (i.e. without copying) accessed via df.col, df."col", df[!, :col] or df[!, "col"]. The two latter syntaxes are more flexible as they allow passing a variable holding the name of the column, and not only a literal name.
So you can write df[!, str], and that will be equivalent to df.article_id if str == "article_id".
The Indexing section of the manual goes into even more detail, for when you need more advanced types of indexing or want a deeper understanding of the options.

For an additional reference. When you write:
df.colname
it is equivalent to writing getproperty(df, :colname). Therefore if you have column name stored in the str variable you can write getproperty(df, str).
However, as Sundar R noted it is usually more convenient to use indexing instead of property access. Two most common patterns are df[!, str] which is equivalent to getproperty(df, str) and gets you a column without copying it and df[:, str] which gets you a copy of a column.

Related

How to split a search string into parts, then check parts against a database

Here's what I'm dealing with:
We have a database of machines and their part lists are specified using strings. For example, one machine might be specified with the string &XXX&YYY-ZZZ, meaning the machine contains parts XXX and YYY and not ZZZ.
We use &XXX to specify that a part exists in a machine, and -XXX to specify that a part does not exist in a machine.
It's also possible that a part is not listed (i.e. not specified whether or not it exists in the machine). For example I might only have &XXX&YYY (ZZZ is not specified).
Additionally, the codes can be in any order, for example I might have &XXX&YYY-ZZZ or &XXX-ZZZ&YYY.
In order to search for machines, I get a string like this: &XXX-YYY/&YYY&ZZZ (/ is an OR operator), meaning "I want to find all machines that either a) contain XXX and do not contain YYY, or b) contain both YYY and ZZZ.
I'm having trouble parsing the string based on the variable ordering, possibility that parts may not be shown, and handling of the / operator. Note, we use Microsoft 365.
Looking for some suggestions!
When I search for &XXX-YYY/&YYY&ZZZ, I should return the following machines:
Machine
Result
&XXX-YYY&ZZZ
TRUE (because XXX exists and YYY does not exist)
&XXX-YYY-ZZZ
TRUE (because XXX exists and YYY does not exist)
&XXX&YYY&ZZZ
TRUE (because YYY exists and ZZZ exists)
&XXX&ZZZ
FALSE (because YYY is specified in the search, but this machine doesn't specify it)
&ZZZ&YYY
TRUE (showing that parts can be in any order)
You can try it in cell C2 with the following formula:
=LET(query, A2, queries, TEXTSPLIT(query,, "/"), input, B2:B7,
qryNum, ROWS(queries),
SPLIT, LAMBDA(txt,LET(str, SUBSTITUTE(SUBSTITUTE(txt, "&",";1_"),
"-",";0_"), TEXTSPLIT(str,,";",TRUE))),
lkUps, DROP(REDUCE("", queries, LAMBDA(acc,qry, HSTACK(acc, SPLIT(qry)))),,1),
MAP(input, LAMBDA(txt, LET(str, SPLIT(txt),
out, REDUCE("", SEQUENCE(qryNum, 1), LAMBDA(acc,idx,
LET(cols, INDEX(lkUps,,idx), qry, FILTER(cols, cols<>""),
matches, SUM(N(ISNUMBER(XMATCH(str, qry)))),
result, IF(ROWS(qry)=matches,1,0),IF(acc="", result, MAX(acc, result))
))), IF(out=1, TRUE, FALSE)
)))
)
and here is the corresponding output:
Assumptions:
String values (operation and part) should be unique, i.e. the case &XXX-YYY&XXX is not considered, because &XXX is duplicated.
Explanation
The main idea is to transform the input information in a way we can do comparisons at array level via XMATCH. The first thing to do is to identify each OR condition in the search string because we need to test each one of them against the Input column. The name queries is an array with all the OR conditions.
We can transform the string inputs in a way we can split the string into an array. SPLIT is a user LAMBDA function that does that:
SUBSTITUTE(SUBSTITUTE(txt, "&",";1_"),"-",";0_"), TEXTSPLIT(str,,";",TRUE)))
What it does is convert for example the input: &XXX-YYY&ZZZ into the following array:
1_XXX
0_YYY
1_ZZZ
We change the original operations &,- into 1,0 just for convenience, but you can keep the original operation value, it is irrelevant for the calculation. It is important to set the fourth TEXTSPLIT input argument to TRUE to ensure no empty rows are generated.
The name lkUps is an array with all the OR conditions organized by column for query. In the format we want, for example:
1_XXX 1_YYY
0_YYY 1_ZZZ
Note: For creating lkUps we use the pattern: DROP/REDUCE/HSTACK, for more information about it, check the answer to the question: how to transform a table in Excel from vertical to horizontal but with different length provided by #DavidLeal.
Now we have all the elements we need to build the recurrence. We use MAP to iterate over all Input column values. For each element (txt) we transform it to the format of our convenience via SPLIT user LAMBDA function and name it str.
We use REDUCE function inside MAP to iterate over all columns of lkUps to check against str. We use SEQUENCE(qryNum, 1) as input of REDUCE to be able to iterate over each lkUps column (qry).
Now we are going to use the above variables in XMATCH and name the variable matches as follows:
SUM(N(ISNUMBER(XMATCH(str, qry))))
If all values from qry were found in str then we have a match. If that is the case the item of the SUM will be 1, otherwise 0. Therefore the SUM for the match case should be of the same size as qry.
Because we include in the XMATCH both parts and operations (1,0), we ensure that not just the same parts are found, but also their corresponding operations are the same. The order of the parts is not relevant, XMATCH ensures it.
The REDUCE recurrence keeps the maximum value from the previous iteration (previous OR condition). We just need at least one match among all OR conditions. Therefore once we finish all the recurrence, if the result value of REDUCE is 1 at least one match was found. Finally, we transform the result into a TRUE/FALSE.
Note: For a large list of operations instead of using the above approach of two SUBSTITUTE calls. The SPLIT function can be defined as follow:
LAMBDA(txt,tks, LET(seq, SEQUENCE(COLUMNS(tks),1),
out, REDUCE("", seq, LAMBDA(acc,idx, LET(str, IF(acc="", txt, acc),
SUBSTITUTE(str, INDEX(tks,1,idx), INDEX(tks,2,idx))))),
TEXTSPLIT(out,,";",TRUE)))
and the input tks (tokens) can be defined as follow: {"&","-";"1_", "0_"}, i.e. in the first row old values and in the second row the new values.

EXCEL 365 - Use wildcard with the lookup_array part of the INDEX MATCH function

I am attempting to use excel INDEX MATCH with the MATCH lookup_array (refdata!$A$2:refdata!$A$150) containing a string e.g. 'AMAZON' which might be part of the MATCH lookup_value ($C2), which is a longer string e.g.''AMZNMKTPLACE AMAZON.CO AMAZON.CO.UK GBR'.
=INDEX(refdata!$A$2:refdata!$C$150,MATCH($C2,refdata!$A$2:refdata!$A$150,0),3)
Is it possible to have the MATCH lookup_array string values set to use wildcards and produce 'AMAZON' from 'AMAZON' and have this then successfully compared to (or found in?) the MATCH lookup_value which would be ''AMZNMKTPLACE AMAZON.CO AMAZON.CO.UK GBR'?
I'm not sure I understand your structure, but this might work:
=INDEX(FILTER(refdata!$A$2:refdata!$C$150,ISNUMBER(FIND(refdata!$A$2:refdata!$A$150,$C2))),,3)
It is looking for AMAZON inside of all strings in Column A and then filtering on it. It will spill if there are multiple matches, so if you don't want that, you can do:
=INDEX(FILTER(refdata!$A$2:refdata!$C$150,ISNUMBER(FIND(refdata!$A$2:refdata!$A$150,C2))),1,3)

SQL Server: STRING_SPLIT() result in a computed column

I couldn't find good documentation on this, but I have a table that has a long string as one of it's columns. Here's some example data of what it looks like:
Hello:Goodbye:Apple:Orange
Example:Seagull:Cake:Chocolate
I would like to create a new computed column using the STRING_SPLIT() function to return the third value in the string table.
Result #1: "Apple"
Result #2: "Cake"
What is the proper syntax to achieve this?
At this time your answer is not possible.
The output rows might be in any order. The order is not guaranteed to
match the order of the substrings in the input string.
STRING_SPLIT reference
There is no way to guarantee which item was the third item in the list using string_split and the order may change without warning.
If you're willing to build your own, I'd recommend reading up on the work done by
Brent Ozar and Jeff Moden.
You shouldn't be storing data like that in the first place. This points to a potentially serious database design problem. BUT you could convert this string into JSON by replacing : with ",", surround it with [" and "] and retrieve the third array element , eg :
declare #value nvarchar(200)='Example:Seagull:Cake:Chocolate'
select json_value('["' + replace(#value,':','","' )+ '"]','$[2]')
The string manipulations convert the string value to :
["Example","Seagull","Cake","Chocolate"]
After that, JSON_VALUE parses the JSON string and retrieves the 3rd item in the array using a JSON PATH expression.
Needless to say, this will be slow and can't take advantage of indexing. If those values are meant to be read or written individually, they should be stored in separate columns. They'll probably take less space than one long string.
If you have a lot of optional fields but only a subset contain values at any time, you could use sparse columns. This way you could have thousands of rows, only a few of which would contain data at any time

How to compare numeric in PostgreSQL JSONB

I ran into strange situation working with jsonb type.
Expected behavior
Using short jsonb structure:
{"price": 99.99}
I wrote query like this:
SELECT * FROM table t WHERE t.data->price > 90.90
And it fail with error operator does not exist: jsonb > numeric the same as text (->>) operator does not exist: text > numeric
Then I wrote comparison as mentioned in many resources:
SELECT * FROM table t WHERE (t.data->>price)::NUMERIC > 90.90
And it's works as expected.
What's strange:
SELECT * FROM table t WHERE t.data->price > '90.90';
a little weird but query above works right.
EXPLAIN: Filter: ((data -> 'price'::text) > '90.90'::jsonb)
But if I change jsonb value to text as: {"price": "99.99"}
there is no result any more - empty.
Question: How actually PostgreSQL compare numeric data and what preferable way to do this kind of comparison.
But you aren't comparing numeric data, are you.
I can see that you think price contains a number, but it doesn't. It contains a JSON value. That might be a number, or it might be text, or an array, or an object, or an object containing arrays of objects containing...
You might say "but the key is called 'price' of course it is a number" but that's no use to PostgreSQL, particularly if I come along and sneakily insert an object containing arrays of objects containing...1
So - if you want a number to compare to you need convert it to a number (t.data->>price)::NUMERIC or convert your target value to JSON and let PostgreSQL do a JSON-based comparison (which might do what you want, it might not - I don't know what the exact rules are for JSON).
1 And that's exactly the sort of thing I would do, even though it is Christmas. I'm a bad person.

Create postgresql index on text column casted to array

I have a postgresql table that has a column with data type = 'text' in which I need to create an index which involves this column being type casted to integer[]. However, whenever I try to do so, I get the following error:
ERROR: functions in index expression must be marked IMMUTABLE
Here is the code:
create table test (a integer[], b text);
insert into test values ('{10,20,30}','{40,50,60}');
CREATE INDEX index_test on test USING GIN (( b::integer[] ));
Note that one potential workaround is to create a function that is marked as IMMUTABLE that takes in a column value and performs the type casting within the function, but the problem (aside from adding overhead) is that I have many different 'target' array data types (EG: text[], int2[], int4[], etc...), and it would not be possible to create a separate function for each potential target array data type.
Answered in this thread on the PostgreSQL mailing lists. Click on "Follow-ups" or "next by thread" in the links after the post to follow the (short) thread on the topic.
There's no recipe given there, but Tom's just talking about defining an explicit cast from text[] to integer[]. If time permits I'll flesh this answer out with an example.

Resources