Indexing JSON arrays of integers using Mariadb? - arrays

I'm storing arrays of integers as json in a mariadb (10.3.23) table :
SELECT
tag_list
FROM
dw.final_document
LIMIT 1 ;
Result :
[903, 1258, 1261, 393]
To retrieve entries matching a specific id, this works :
SELECT SQL_NO_CACHE
count(*)
FROM
dw.final_document
where
JSON_CONTAINS(tag_list, 684 ) ;
Result :
+----------+
| count(*) |
+----------+
| 9696 |
+----------+
1 row in set (2.084 sec)
However the performance is not good without an index (2 sec in a 1M rows table)
The possibility to index a specific field of the json is well documented (percona post).
You can add an index on a generated column like this :
ALTER TABLE test_features
ADD COLUMN street VARCHAR(30)
GENERATED ALWAYS AS (json_unquote(json_extract(`feature`,'$.properties.STREET'))) VIRTUAL;
Is there a similar way to index arrays of integers ?

Related

Use TQuery.Locate() function to find other then first matching

Locate moves the cursor to the first row matching a specified set of search criteria.
Let's say that q is TQuery component, which is connected to the database with two columns TAG and TAGTEXT. With next code I am getting letter a. And I would like to use Locate() function to get letter d.
If q.Locate('TAG','1',[loPartialKey]) Then
begin
tag60 := q.FieldByName('TAGTEXT');
end
For example if I got table like this:
TAG | TAGTEXT
+---+--------+
| 1 | a |
+---+--------+
| 2 | b |
+---+--------+
| 3 | c |
+---+--------+
| 1 | d |
+---+--------+
| 4 | e |
+---+--------+
| 1 | f |
+---+--------+
is it possible to locate the second time number one occurred in table?
EDIT
My job is to find the occurrence of TAG with value 1 (which occurrence I need depends on the parameter I get), I need to iterate through table and get the values from all the TAGTEXT fields till I find that value in TAG field is again number 1. Number 1 in this case represents the start of new segment, and all between the two number 1s belongs to one segment. It doesn't have to be same number of rows in each segment. Also I am not allowed to do any changes on table.
What I thought I could do is to create a counter variable that is going to be increased by one every time it comes to TAG with value 1 in it. When the counter equals to the parameter that represents the occurrence I know that I am in the right segment and I am going to iterate through that segment and get the values I need.
But this might be slow solution, and I wanted to know if there was any faster.
You need to be a bit wary of using Locate for a purpose like this, because some
TDataSet descendants' implementation of Locate (or the underlying db-access layer) construct a temporary index on the dataset. which can be discarded immediately afterwards, so repeatedly calling Locate to iterate the rows of a given segment may be a lot more inefficient than one might expect it to be.
Also, TClientDataSet constructs, uses and then discards an expression parser for each invocation of Locate (in its internal call to LocateRecord), which is a lot of overhead for repeated calls, especial when they are entirely avoidable.
In any case, the best way to do this is to ensure that your table records which segment a given row belongs to, adding a column like the SegmentID below if your table does not already have one:
TAG | TAGTEXT|SegmentID
+---+--------+---------+
| 1 | a | 1
| 2 | b | 1
| 3 | c | 1
| 1 | d | 2
+---+--------+---------+ // btw, what happened to the 2 missing rows after this one?
| 4 | e | 2
| 1 | f | 3
+---+--------+---------+
Then, you could use code like this to iterate the rows of a segment:
procedure IterateSegment(Query : TSomeTypeOfQueryComponent; SegmentID : Integer);
var
Sql; String;
begin
Sql := Format('select * from mytable where SegmentID = %d order by Tag', [SegmentID]);
if Query.Active then
Query.Close;
Query.Sql.Text := Sql;
Query.Open;
Query.DisableControls;
try
while not Query.Eof do begin
// process row here
Query.Next;
end;
finally
Query.EnableControls;
end;
end;
Once you have the SegmentID column in the table, if you don't want to open a new query to iterate a block, you can set up a local index (by SegmentID then Tag), assuming your dataset type supports it, set a filter on the dataset to restrict it to a given SegmentID and then iterate over it
You have much options to do this.
If your component donĀ“t provide a locateNext you can make your on function locateNext, comparing the value and make next until find.
You can also bring the sql with order by then use locate for de the first value and test if the next value match the comparision.
If you use a clientDataset you can filter into the component filter propertie, or set IndexFieldNames to order values instead the "order by" of sql in the prior suggestion.
You can filter it on the SQL Where clausule too.

Hive Array<Struct<>> Insertion shows null

I created a table temp that has array of struct
create table temp (regionkey smallint, name string, comment string, nations array<struct<n_nationkey:smallint,n_name:string,n_comment:string>>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ',';
Then I loaded the data into the table
LOAD DATA LOCAL INPATH '/Data Sets/region.csv' INTO TABLE temp;
Desired output when did
select * from temp;
is
4 EUROPE Low sale Business Region [{"n_nationkey":22,"n_name":"Ryan","n_comment":"Reference the site"}]
But actual output is
4 EUROPE Low sale Business Region [{"n_nationkey":22,"n_name":null,"n_comment":null},{"n_nationkey":null,"n_name":null,"n_comment":null},{"n_nationkey":null,"n_name":null,"n_comment":null}]
DATA FILE
4|EUROPE|Low sale Business Region for Training4Exam.com|7,Bulgaria,Reference
4|EUROPE|Low sale Business Region for HadoopExam.com|19,Belgium,Reference site
4|EUROPE|Low sale Business Region for Training4Exam.com|22,Ryan,Reference site
This was my first exam with arrays and struct and I am blank on this.
Any help with be highly appreciated.
Thanks
map keys terminated by ','
create external table temp
(
regionkey smallint
,name string
,comment string
,nations array<struct<n_nationkey:smallint,n_name:string,n_comment:string>>
)
row format delimited
fields terminated by '|'
map keys terminated by ','
;
select * from temp
;
+-----------+--------+------------------------------------------------+-----------------------------------------------------------------------+
| regionkey | name | comment | nations |
+-----------+--------+------------------------------------------------+-----------------------------------------------------------------------+
| 4 | EUROPE | Low sale Business Region for Training4Exam.com | [{"n_nationkey":7,"n_name":"Bulgaria","n_comment":"Reference "}] |
| 4 | EUROPE | Low sale Business Region for HadoopExam.com | [{"n_nationkey":19,"n_name":"Belgium","n_comment":"Reference site "}] |
| 4 | EUROPE | Low sale Business Region for Training4Exam.com | [{"n_nationkey":22,"n_name":"Ryan","n_comment":"Reference site"}] |
+-----------+--------+------------------------------------------------+-----------------------------------------------------------------------+
FYI
To be backward-compatible, initialize the first 3 separator to be the given values from the table properties.
The default number of separators is 8;
if only hive.serialization.extend.nesting.levels is set, the number of separators is extended to 24;
if hive.serialization.extend.additional.nesting.levels is set, the number of separators is extended to 154.
#param tableProperties table properties to extract the user provided separators
https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySerDeParameters.java
David's answer is very efficient & I liked it very much but cannot understand why collection items must be replaced by map keys (Seems there is a bug in Hive based on the description as he has suggested, I am not a pro in coding).
However, this is the long version
create table regiontemp(str string);
load data inpath '/user/cloudera/MohsenFiles/first_first.csv' into table regiontemp;
create external table region (r_regionkey smallint,
r_name string,
r_comment string,
r_nations array<struct<n_nationkey:smallint,n_name:string,n_comment:string>>)
row format delimited
fields terminated by '|'
collection items terminated by ','
insert overwrite table region
select split(str,'\\|')[0] r_regionkey,
split(str,'\\|')[1] r_name,
split(str,'\\|')[2] r_comment,
array(named_struct("n_nationkey",cast(split(split(str,'\\|')[3],",")[0] as smallint),
"n_name",split(split(str,'\\|')[3],",")[1] ,
"n_comment",split(split(str,'\\|')[3],",")[2] ))
from regiontemp ;
now in impala
INVALIDATE METADATA;
or in Hive (Aggregation On Struct columns Hive again based on David's answer for another Q)

Find valid combinations based on matrix

I have a in CALC the following matrix: the first row (1) contains employee numbers, the first column (A) contains productcodes.
Everywhere there is an X that productitem was sold by the corresponding employee above
| 0302 | 0303 | 0304 | 0402 |
1625 | X | | X | X |
1643 | | X | X | |
...
We see that product 1643 was sold by employees 0303 and 0304
What I would like to see is a list of what product was sold by which employees but formatted like this:
1625 | 0302, 0304, 0402 |
1643 | 0303, 0304 |
The reason for this is that we need this matrix ultimately imported into an SQL SERVER table. We have no access to the origins of this matrix. It contains about 50 employees and 9000+ products.
Thanx for thinking with us!
try something like this
;with data as
(
SELECT *
FROM ( VALUES (1625,'X',NULL,'X','X'),
(1643,NULL,'X','X',NULL))
cs (col1, [0302], [0303], [0304], [0402])
),cte
AS (SELECT col1,
col
FROM data
CROSS apply (VALUES ('0302',[0302]),
('0303',[0303]),
('0304',[0304]),
('0402',[0402])) cs (col, val)
WHERE val IS NOT NULL)
SELECT col1,
LEFT(cs.col, Len(cs.col) - 1) AS col
FROM cte a
CROSS APPLY (SELECT col + ','
FROM cte B
WHERE a.col1 = b.col1
FOR XML PATH('')) cs (col)
GROUP BY col1,
LEFT(cs.col, Len(cs.col) - 1)
I think there are two problems to solve:
get the product codes for the X marks;
concatenate them into a single, comma-separated string.
I can't offer a solution for both issues in one step, but you may handle both issues separately.
1.
To replace the X marks by the respective product codes, you could use an array function to create a second table (matrix). To do so, create a new sheet, copy the first column / first row, and enter the following formula in cell B2:
=IF($B2:$E3="X";$B$1:$E$1;"")
You'll have to adapt the formula, so it covers your complete input data (If your last data cell is Z9999, it would be =IF($B2:$Z9999="X";$B$1:$Z$1;"")). My example just covers two rows and four columns.
After modifying it, confirm with CTRL+SHIFT+ENTER to apply it as array formula.
2.
Now, you'll have to concatenate the product codes. LO Calc lacks a feature to concatenate an array, but you could use a simple user-defined function. For such a string-join function, see this answer. Just create a new macro with the StarBasic code provided there and save it. Now, you have a STRJOIN() function at hand that accepts an array and concatenates its values, leaving empty values out.
You could add that function using a helper column on the second sheet and apply it by dragging it down. Finally, to get rid of the cells with the single product IDs, copy the complete second sheet, paste special into a third sheet, pasting only the values. Now, you can remove all columns except the first one (employee IDs) and the last one (with the concatenated product ids).
I created a table in sql for holding the data:
CREATE TABLE [dbo].[mydata](
[prod_code] [nvarchar](8) NULL,
[0100] [nvarchar](10) NULL,
[0101] [nvarchar](10) NULL,
[and so on...]
I created the list of columns in Calc by copying and pasting them transposed. After that I used the concatenate function to create the columnlist + datatype for the create table statement
I cleaned up the worksheet and imported it into this table using SQL Server's import wizard. Cleaning meant removing unnecessary rows/columns. Since the columnnames were identical mapping was done correctly for 99%.
Now I had the data in SQL Server.
I adapted the code MM93 suggested a bit:
;with data as
(
SELECT *
FROM dbo.mydata <-- here i simply referenced the whole table
),cte
and in the next part I uses the same 'worksheet' trick to list and format all the column names and pasted them in.
),cte
AS (SELECT prod_code, <-- had to replace col1 with 'prod_code'
col
FROM data
CROSS apply (VALUES ('0100',[0100]),
('0101', [0101] ),
(and so on... ),
The result of this query was inserted into a new table and my colleagues and I are querying our harts out :)
PS: removing the 'FOR XML' clause resulted in a table with two columns :
prodcode | employee
which containes al the unique combinations of prodcode + employeenumber which is a lot faster and much more practical to query.

How to retrieve multiple values when more than one string matches with TSQL?

Consider the a table that contains
ReturnValueID | ReturnValue TriggerValue
------------------------------------------
1 | returnValue1 | testvalue
2 | returnValue2 | testing...
3 | returnValue3 | value3
And given a string: HERE IS THE TEXT testing... AND MORE TEXT testvalue MORE TEXT
I have written a CTE using SQL Server 2008 that uses a FindInString() function I wrote to indicate where the matched text is found. 0 = not found:
1 | returnValue1 | 43
2 | returnValue2 | 18
3 | returnValue3 | 0
What I need to do now, is iterate through this result set in a loop where I will perform some additional logic based on each row.
I have seen a few examples of looping, but I would rather not use a cursor.
What is the best way to approach this?
Thanks.
-- UPDATE --
Once a match is made, the ID of the matched row is added to a table, if it doesn't already exist, then the return value is appended to an VARCHAR value, if it doesn't already exist in the dynamic string:
IF NOT EXISTS -- check if this value is already recorded
(
SELECT *
FROM RecordedReturnValue
WHERE ReturnValueID = #ReturnValueID
)
BEGIN
-- add the visitor/external tag ID to historical table
INSERT INTO RecordedReturnValue (...)
VALUES (...)
-- function checks if string is already present
SET #DynamicString = dbo.AppendDynamicOutput(#ReturnValue, #DynamicString)
END
This must be performed for each matched TriggerValue from the CTE.
Ended up using a CTE, added the values to a temp table, then iterated through the results and performed some logic.

A single MySQL query for 'bouncing' table selects

So, say for the sake of simplicity, I have a master table containing two fields - The first is an attribute and the second is the attributes value. If the second field is set to reference a value in another table it is denoted in parenthesis.
Example:
MASTER_TABLE:
Attr_ID | Attr_Val
--------+-----------
1 | 23(table1) --> 23rd value from `table1`
2 | ...
1 | 42 --> the number 42
1 | 72(table2) --> 72nd value from `table2`
3 | ...
1 | txt --> string "txt"
2 | ...
4 | ...
TABLE 1:
Val_Id | Value
--------+-----------
1 | some_content
2 | ...
. | ...
. | ...
. | ...
23 | some_content
. | ...
Is it possible to perform a single query in SQL (without parsing the results inside the application and requerying the db) that would iterate trough master_table and for the given <attr_id> get only the attributes that reference other tables (e.g. 23(table1), 72(table2), ...), then parse the tables names from the parenthesis (e.g. table1, table2, ...) and perform a query to get the (23rd, 72nd, ...) value (e.g. some_content) from that referenced table?
Here is something I've done, and it parses the Attr_Val for the table name, but I don't know how to assign it to a string and then do a query with that string.
PREPARE pstmt FROM
"SELECT * FROM information_schema.tables
WHERE TABLESCHEMA = '<my_db_name>' AND TABLE_NAME=?";
SET #str_tablename =
(SELECT table.tablename FROM
(SELECT #string:=(SELECT <string_column> FROM <table> WHERE ID=<attr_id>) as String,
#loc1:=length(#string)-locate("(", reverse(#string))+2 AS from,
#loc2:=length(#string)-locate(")", reverse(#string))+1-#loc1 AS to,
substr(#string,#loc1, #loc2) AS tablename
) table
); <--this returns 1 rows which is OK
EXECUTE pstmt USING #str_tablename; <--this then returns 0 rows
Any thoughts?
I love the purity of this approach, if pulled off. But I'm thinking you're creating a maintenance bomb. With a cure like this, who needs to be sick?
No one has ever said of a web site "Man, their data sure is pure!" They compliment what is being done with the data. I don't recommend you keep your hands tied behind your back on this one. I guarantee your competitors aren't.

Resources