BigQuery ARRAY_TO_STRING based on condition in non-array field - arrays

I have a table that I query like this...
select *
from table
where productId = 'abc123'
Which returns 2 rows (even though the productId is unique) because one of the columns (orderName) is an Array...
**productId, productName, created, featureCount, orderName**
abc123, someProductName, 2020-01-01, 12, someOrderName
, , , , someOtherOrderName
I'm not sure whether the missing values in the 2nd row are empty strings or nulls because of the way the orderName array expands my search results but I want to now run a query like this...
select productName, ARRAY_TO_STRING(orderName,'-')
from table
where productId = 'abc123'
and ifnull(featureCount,0) > 0
But this query returns...
someProductName, someOrderName-someOtherOrderName
i.e. both array values came back even though I specified a condition of featureCount>0.
I'm sure I'm missing something very basic about how Arrays function in BigQuery but from Google's ARRAY_TO_STRING documentation I don't see any way to add a condition to the extracting of ARRAY values. Appreciate any thoughts on the best way to go about this.

For what I understand, this is because you are just querying one row of data which have a column as ARRAY<STRING>. As you are using ARRAY_TO_STRINGS it will only accept ARRAY<STRING> values you will see all array values fit into just one cell.
So, when you run your script, your output will fit your criteria and return the columns with arrays with additional rows for visibility.
The visualization on the UI should look like your mention in your question:
Row
productId
productName
created
featureCount
orderName
1
abc123
someProductName
2020-01-01
12
someOrderName
someOtherOrderName
Note: On bigquery this additional row is gray out ( ) and Its part of row 1 but it shows as an additional row for visibility. So this output only have 1 row in the table.
And the visualization on a JSON will be:
[
{
"productId": "abc123",
"productName": "someProductName",
"created": "2020-01-01",
"featureCount": "12",
"orderName": [
"someOrderName",
"someOtherOrderName"
]
}
]
I don't think there is specific documentation info about how you visualize arrays on UI but I can share the docs that talks about how to flattening your rows outputs into a single row line, check:
Working with Arrays
Flattening Arrays
I use the following to replicate your issue:
CREATE OR REPLACE TABLE `project-id.dataset.working_table` (
productId STRING,
productName STRING,
created STRING,
featureCount STRING,
orderName ARRAY<STRING>
);
insert into `project-id.dataset.working_table` (productId,productName,created,featureCount,orderName)
values ('abc123','someProductName','2020-01-01','12',['someOrderName','someOtherOrderName']);
insert into `project-id.dataset.working_table` (productId,productName,created,featureCount,orderName)
values ('abc123X','someProductNameX','2020-01-02','15',['someOrderName','someOtherOrderName','someData']);
output
Row
productId
productName
created
featureCount
orderName
1
abc123
someProductName
2020-01-01
12
someOrderName
someOtherOrderName
2
abc123X
someProductNameX
2020-01-02
15
someOrderName
someOtherOrderName
someData
Note: Table contains 2 rows.

Related

Google Sheets - Flattening a table into two columns, but returning results based on varying conditions

My table has data similar to the following:
Raw Data
I am flattening the data into two columns using the following:
=INDEX(QUERY(SPLIT(FLATTEN(IF(SheetName!B1:D=TRUE, SheetName!B1:D1&"×"&SheetName!A1:A,)), "×"),"where Col2 is not null order by Col1 asc"))
The result is depicted here:
Flattened Data
However, I also need to return within the "flattened data" columns the following:
1#email.com | 1#email.com
2#email.com | 2#email.com
3#email.com | 3#email.com
In other words, I need to return SheetName!A1:A&"×"&SheetName!A1:A for each email address contained in the email column (Column A), in addition to the other data that is being flattened into the column. I have tried variations using IF/IFS statements, wildcards (not permitted within IFs), etc. However, I am now seeking some help after striking out many times. Thanks for any help you can offer!
Update Example spreadsheet with sample data and current vs. desired results (formula in cell H2):
https://docs.google.com/spreadsheets/d/1wq9kR4UqYeWsqSCbnkGGHbR-TIHTfaBb_ixGk41X2cs/edit?usp=sharing
perhaps:
=INDEX(QUERY(SPLIT(FLATTEN(IF(SheetName!B1:D=TRUE,
SheetName!B1:D1&"×"&SheetName!A1:A&"×"&SheetName!A1:A,)), "×"),
"where Col3 is not null order by Col1 asc"))
update:
=INDEX({QUERY(SPLIT(FLATTEN(
IF(Sheet1!B1:F=TRUE, Sheet1!B1:F1&"×"&Sheet1!A1:A,)), "×"),
"where Col2 is not null order by Col1");
TEXT(UNIQUE(FILTER(A2:A, A2:A<>"")), {"#", "#"})})

SSRS multiple single cells

I am trying to figure out best way to add multiple fields to a SSRS report.
Report has some plots and tablix which are populated from queries but now I have been asked to add a table with ~20 values. The problem is that I need to have them in a specific order/layout (that I cannot obtain by sorting) and they might need to have a description added above which will be static text (not from the DB).
I would like to avoid situation where I keep 20 copy of the same query which returns single cell where the only difference would be in:
WHERE myTable.partID = xxxx
Any chance I could keep a single query which takes that string like a parameter which I could specify somehow via expression or by any other means?
Not a classical SSRS parameter as I need a different one for each cell...
Or will I need to create 20 queries to fetch all those single values and then put them as separate textfields on the report?
When I've done this in the past, I build a single query that gets all the data I need with some kind of key.
For example I might have a list of captions and values, one per row, that I need to display as part of a report page. The dataset query might look something like ...
DECLARE #t TABLE(Key varchar(20), Amount float, Caption varchar(100))
INSERT INTO #t
SELECT 'TotalSales', SUM(Amount), NULL AS Amount FROM myTable WHERE CountryID = #CountryID
UNION
SELECT 'Currency', NULL, CurrencyCode FROM myCurrencyTable WHERE CountryID = #CountryID
UNION
SELECT 'Population', Population, NULL FROM myPopualtionTable WHERE CountryID = #CountryID
SELECT * FROM #t
The resulting dataset would look like this.
Key Amount Caption
'TotalSales' 12345 NULL
'Currency' NULL 'GBP'
'Population' 62.3 NULL
Lets say we call this dataset dsStuff then in each cell/textbox the xpression would simply be something like.
=LOOKUP("Population", Fields!Key.Value, Fields!Amount.Value, "dsStuff")
or
=LOOKUP("Currency", Fields!Key.Value, Fields!Caption.Value, "dsStuff")

How to log or notify when a column is truncated using a LEFT()

As part of our OLAP modeling workflow, we are often truncating fields as upstream data sources have no restrictions or defined data types. A field which should be a 10 character string can sometimes be 50 or 100 characters long if it is a free form user input. I've been told this can cause problems with downstream processes which involve uploads to external sources.
I've been asked to find a way to identify instances in which one ore more of these fields is truncated.
How we handle these fields now is something like this:
SELECT
LEFT(FreeResponseField, 10) AS Comment
INTO
dbo.ModeledTable
FROM
dbo.SourceTable
Essentially if the field is greater than 10 characters, who cares, we only take the first 10.
If dbo.SourceTable.FreeResponseField has a length greater than 10, now we want to know somehow (be it a warning/error message or insertion into a log table). We have a lot of tables with a lot of fields, so the above example is a simplification. Identifying just the field in which this occurs and/or the tuple in the table would be helpful to see where these issues are occurring.
Is something like this possible? You can't just compare data types of the source table with the target table as the source table sets everything to essentially VARCHAR(MAX). The naive approach is to check the length every single value of every tuple against the defined length of the target table.
The original specifications weren't descriptive, but I've figured out a solution and thought I'd share in case anyone stumbles across this for some reason.
Imagine we have a SourceTable which are pulling in to our model. We have defined zip codes as being of length 5 and addresses of being length 25. Say we have the following two records:
CustomerID | ZipCode | Address
1 | 90210 | 123 Fake Street
2 | 902106 | 546 Fake Street
Based on our model definitions, there is an error with ZipCode for the record where CustomerID equals 2. We would like to identify both ZipCode as being the problem field and the record where CustomerID equals 2. The following query with a CROSS APPLY does that:
WITH CTE AS (
SELECT
CustomerID,
ZipCodeFlag = IIF(LEN(ZipCode) > 5, 1, 0),
AddressFlag = IIF(Len(Address) > 25, 1, 0),
ZipCode,
Address
FROM
SourceTable
)
SELECT
CustomerID,
TruncatedField,
RawValue
FROM
CTE
CROSS APPLY (
VALUES ('ZipCode', ZipCodeFlag, ZipCode),
('Address', AddressFlag, Address)
) CA(TruncatedField, TruncatedFlag, RawValue)
WHERE
TruncatedFlag = 1
ORDER BY
CustomerID
With the following output:
CustomerID | TruncatedField | RawValue
2 | ZipCode | 902106

Filter Columns based the condition of a Row

I have the following Matrix in SSRS 2008 R2:
What I would like to do is only show columns where the "FTE" row has a value. (Marked with red circles). I've tried the filter and show/hide options based on expressions in the Column Properties, but I can't figure out how to only reference the rows when "category"="FTE". My data looks like this:
tblPhysicians
employee_id last_name ...
102341145 Smith
123252252 Jones
tblPhysiciansMetrics
id fy period division_name category_name employee_id
123 2014 1 Allergy Overhead 123456
124 2014 1 Allergy Salary 125223
125 2014 1 Allergy FTE 1.0
query
SELECT * FROM
tblPhysicians
INNER JOIN tblPhysicianMetrics
ON tblPhysicians.employee_id = tblPhysicianMetrics.employee_id
WHERE
tblPhysicianMetrics.division_name = #division_name
AND tblPhysicianMetrics.fy = #fy
AND tblPhysicianMetrics.period = #period
Notice that the rows in my Matrix is just the category_name, so I can't just hide when category_name = "FTE", that's not really what I want. What I really need is some way of saying, "For rows where category_name = "FTE", if the value is not set, don't show that column". Is this possible?
An alternative would be to not even get those in the query, but as with the filtering of the matrix, if I simply add "AND tblPhysiciansMetrix.category_name = 'FTE'" to the WHERE clause, my entire data set is reduced to only those records where category_name is FTE.
Any help is much appreciated!
Update: added definition of Matrix to help:
You need to set the column visibility with an expression that checks all underlying data in the column for the FTE category.
I have a simple dataset:
And a simple matrix based on this:
Which looks exactly as you expect:
So in the above example, we want the Brown column to be hidden. To do this, use an expression like this for the Column Visibility:
=IIf(Sum(IIf(Fields!category_name.Value = "FTE", 1, 0), "last_name") > 0
, False
, True)
This is effectively counting all the fields in the column (determined by the last_name scope parameter in the Sum expression) - if this is > 0 show the column. Works as required:
CAVEAT I'm not 100% sure how you are grouping this data in the matrix. It isn't 100% clear from your description. Let me know why this doesn't work (if it doesn't and I'll update my answer accordingly.
I've replaced the column employee_id with the name value for my answer, to keep my explanation simple.
Add this nested IIF() to the visibility property of your Matrix's column.
=IIF(Fields!category_Name.Value="FTE" ,IIF( Fields!value.Value >= 0, True,False),false)
It will check both the value of the the category and the 'FTE' in that row's cell.

MS Access row number, specify an index

Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.
Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.
How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.
I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.
Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING
If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID

Resources