How to query specific information from a JSON column? - arrays

I have searched Stack Overflow to get an answer to my question, but while I found many interesting cases, none of them quite address mine.
I have a column called fields in my data, that contains JSON information, such as presented below:
Row Fields
1 [{"label":"Label 1","key":"label_1","description":"Value of label_1"},{"label":"Label 2","key":"label_2","error":"Something"}]
2 [{"description":"something","label":"Row 1","key":"row_1"},{"label":"Row 2","message":"message_1","key":"row_2"}]
In essence, I have many rows of JSON that contain label and key, and bunch of other parameters like that. From every {}, I want to extract only label and key, and then (optional, but ideally) stretch every label and key in every {} to its own row. So, as a result, I would have the following output:
Row Label Key
1 Label 1 label_1
1 Label 2 label_2
2 Row 1 row_1
2 Row 2 row_2
Please note, contents of label and key within JSON can be anything (strings, integers, special characters, a mix of everything, etc. In addition, key and label can be anywhere in relation to other parameters within each {}.
Here is the Big Query SQL dummy data for convenience:
SELECT '1' AS Row, '[{"label":"Label 1","key":"label_1","description":"Value of label_1"},{"label":"Label 2","key":"label_2","error":"Something"}]' AS Fields
UNION ALL
SELECT '2' AS Row, '[{"description":"something","label":"Row 1","key":"row_1"},{"label":"Row 2","message":"message_1","key":"row_2"}]' AS Fields
I have first thought of using REGEX to isolate all the brackets and only show me information with label and key. Then, I looked into BQ Documentation of JSON functions and got very stuck on json_path parameters, specifically because their example doesn't match mine.

Consider below approach
select `row`,
json_extract_scalar(el, '$.label') label,
json_extract_scalar(el, '$.key') key
from your_table, unnest(json_extract_array(fields)) el
if applied to sample data in your question - output is

Related

No other way than dropping the duplicates, if ValueError: Index contains duplicate entries, cannot reshape?

enter image description here
Hi everyone, this is my first question.
I'm working on a dataset from patients who undergone urine analysis.
Every row refer to a single Patient Id and every Request ID could refer to different types of urine analysis (aspect, colour, number of erythrocytes, bacteria and go on).
I've add an image to let you understand my dataset.
I'd like to reshape making one request = one row , with all the tests done in the same request on the same row.
After that I want to merge with another df, that I reshape by Request ID (cause the first was missing a "long result" column, that I downloaded from another software in use in our Hospital).
I've tried:
df_pivot = df.pivot(index='Id Richiesta', columns = 'Nome Analisi Elementare', values = 'Risultato')
df_pivot.reset_index(inplace=True)
After I want to do --> df_merge = pd.merge (df_pivot,df,how='left', on='Id Richiesta')
I've tried once with another dataset, but I had to drop_duplicates for other purpose, and it worked.
But this time I have to analyse all the features.
How can I do? Is there no other way than dropping the duplicates?
Thank you for any help! :)
I've studied more my data and discovered 1 duplicate of bacteria for the same id request (1 in almost 8 million entries....)
df.drop_duplicates[df[['Id Richiesta', 'Id Analisi Elementare', 'Risultato']].duplicated()]
Then visualized all the rows referring at the "Id Richiesta" and the keep last (they were the same).
Thank you and sorry.
Please, tell me if I had to delete this question.

Array formula is not working to my subsequent entries - filling rest of the entries with respect to the first entry only

Im using an array formula in google sheet as a response to a google form which has an F field that has the first name of a person (say, thomas) and a G field whic is the last name of a person(say, mathew)and E field that has a custom email domain say "test.org" - the result expected is "thomasm_#test.org"
Im applying this array formula to one of my header field which will be given with a header name "UserID".
ArrayFormula(IFS(ROW(A:A)=1, "UserID", LEN(A:A)=0, IFERROR(1/0), LEN(A:A)>0,LOWER(CONCATENATE(SUBSTITUTE(F2," ",""),LEFT(G2,1),"_",E2))))
Applying this formula it will apply to whatever entries i had given in the second row to all subsequent rows.. subsequent entries are not reflecting... Pls help
Hi #richtom welcome to this community!
A few reasons:
concatenate function will not work inside of an arrayfunction as you're expecting, so avoid this.
Another thing is that also here SUBSTITUTE(F2," ",""),LEFT(G2,1),"_",E2), you must use the same rangesize as in the beginning, i.e. A:A... F:F... G:G. the use of F2 will use F2 only in all rows.
So, I recommend the following:
=arrayformula(if(row(F:F)=row(),"UserID",if(F:F="","",if(G:G="","",if(E:E="","",lower(substitute(F:F & left(G:G,1) & "_#" & E:E," ","")))))))

Snowflake Flatten Query for array

Snowflake Table has 1 Variant column and loaded with 3 JSON record. The JSON records is as follows.
{"address":{"City":"Lexington","Address1":"316 Tarrar Springs Rd","Address2":null} {"address":{"City":"Hartford","Address1":"318 Springs Rd","Address2":"319 Springs Rd"} {"address":{"City":"Avon","Address1":"38 Springs Rd","Address2":[{"txtvalue":null},{"txtvalue":"Line 1"},{"Line1":"Line 1"}]}
If you look at the Address2 field in the JSON , The first one holds NULL,2nd String and 3rd one array.
When i execute the flatten query for Address 2 as one records holds array, i get only the 3rd record exploded. How to i get all 2 records with exploded value in single query.
select data:address:City::string, data:address:Address1::string, value:txtvalue::string
from add1 ,lateral flatten( input => data:address:Address2 );
When I execute the flatten query for Address 2 as one records holds array, I get only the 3rd record exploded
The default behaviour of the FLATTEN table function in Snowflake will skip any columns that do not have a structure to expand, and the OUTER argument controls this behaviour. Quoting the relevant portion from the documentation link above (emphasis mine):
OUTER => TRUE | FALSE
If FALSE, any input rows that cannot be expanded, either because they cannot be accessed in the path or because they have zero fields or entries, are completely omitted from the output.
If TRUE, exactly one row is generated for zero-row expansions (with NULL in the KEY, INDEX, and VALUE columns).
Default: FALSE
Since your VARIANT data is oddly formed, you'll need to leverage conditional expressions and data type predicates to check if the column in the expanded row is of an ARRAY type, a VARCHAR, or something else, and use the result to emit the right value.
A sample query illustrating the use of all above:
SELECT
t.v:address.City AS city
, t.v:address.Address1 AS address1
, CASE
WHEN IS_ARRAY(t.v:address.Address2) THEN f.value:txtvalue::string
ELSE t.v:address.Address2::string
END AS address2
FROM
add1 t
, LATERAL FLATTEN(INPUT => v:address.Address2, OUTER => TRUE) f;
P.s. Consider standardizing your input at ingest or source to reduce your query complexity.
Note: Your data example is inconsistent (the array of objects does not have homogenous keys), but going by your example query I've assumed that all keys of objects in the array will be named txtvalue.

Show UniData SELECT results that are not record keys

I'm looking over some UniData fields for distinct values but I'm hoping to find a simpler way of doing it. The values aren't keys to anything so right now I'm selecting the records I'm interested in and selecting the data I need with SAVING UNIQUE. The problem is, in order to see what I have all I know to do is save it out to a savedlist and then read through the savedlist file I created.
Is there a way to see the contents of a select without running it against a file?
If you are just wanted to visually look over the data, use LIST instead of SELECT.
The general syntax of the command is something like:
LIST filename WITH [criteria] [sort] [attributes | ALL]
So let's say you have a table called questions and want to look over all the author for questions that used the tag unidata. Your query might look something like:
LIST questions WITH tag = "unidata" BY author author
Note: The second author isn't a mistake, it's the start of the list of attributes you want displayed - in this case just author, but you might want the record id as well, so you could do #ID author instead. Or just do ALL to display everything in each record.
I did BY author here as it will make spotting uniques easier, but you can also use other query features like BREAK.ON to help here as well.
I don't know why I didn't think of it at the time but I basically needed something like SQL's DISTINCT statement since I just needed to view the unique values. Replicating DISTINCT in UniData is explained here, https://forum.precisonline.com/index.php?topic=318.0.
The trick is to sort on the values using BY, get a single unique value of each using BREAK-ON, and then suppress everything except those unique values using DET-SUP.
LIST BUILDINGS BY CITY BREAK-ON CITY DET-SUP
CITY.............
Albuquerque
Arlington
Ashland
Clinton
Franklin
Greenville
Madison
Milton
Springfield
Washington

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources