BigQuery | bq load with inline schema with STRING REPEATED - database

I am trying to load a bq table with the below definition and one of the column (ref_list) is of STRING REPEATED.
[
{
"name": "emp",
"type": "STRING"
},
{
"mode": "REPEATED",
"name": "ref_list",
"type": "STRING"
},
{
"name": "update_date",
"type": "DATE"
}
]
Below is how my input data is:
{"emp":"Adam","ref_list":["Roger","Calvin","Andrew","Kohl"],"update_date":"1999-01-01"}
{"emp":"AntiP27","ref_list":["John","Patrick","Nick","Chris"],"update_date":"2020-01-01"}
I am able to load the table by point the .schema file from my local but the same is failing when I provide the in-line schema.
Here is my bq load command with inline schema option. I am not quite sure how I could specify the mode = REPEATED
bq load --replace --source_format=NEWLINE_DELIMITED_JSON emp_stage.emp_dtl gs://1324-global-delivery/emp_dtl.json emp:STRING,ref_list:STRING,update_date:DATE

According to the documentation, it's not possible to specify a RECORD and the columns mode (NULLABLE, REPEATED), with an inline schema :
When you specify the schema on the command line, you cannot include a
RECORD (STRUCT) type, you cannot include a column description, and you
cannot specify the column's mode. All modes default to NULLABLE. To
include descriptions, modes, and RECORD types, supply a JSON schema
file instead.
bq_manually_specifying_schemas
If you need to use these parameters, you have to specify them in a Json schema in a dedicated file, as you used in your example.

Related

How can I generate a DB, that fits my room scheme?

I have a database with quite a lot of Entities and I want to preload data from a file, on first creation of the database. For that the scheme of Room needs to fit the scheme of the database file. Since converting the json scheme by hand to SQLite statements is very error-prone ( I would need to copy paste every single of the statements and exchange the variable names) I am looking for a possibility to automatically generate a database from the scheme, that I then just need to fill with the data.
However apparently there´s no information if that is possible or even how to do so, out in the internet. It´s my first time working with SQLite (normally I use MySQL) and also the first time I see a database scheme in json. (Since standard MariaDB export options always just export the CREATE TABLE statements.)
Is there a way? Or does Room provide anyway to actually get the create table statements as a proper text, not split up in tons of JSON arrays?
I followed the guide on Android Developer Guidelines to get the json-scheme, so I have that file already. For those, who do not know it´s structure, it looks like this:
{
"formatVersion": 1,
"database": {
"version": 1,
"identityHash": "someAwesomeHash",
"entities": [
{
"tableName": "Articles",
"createSql": "CREATE TABLE IF NOT EXISTS `${TABLE_NAME}` (`id` INTEGER NOT NULL, `germanArticle` TEXT NOT NULL, `frenchArticle` TEXT, PRIMARY KEY(`id`))",
"fields": [
{
"fieldPath": "id",
"columnName": "id",
"affinity": "INTEGER",
"notNull": true
},
{
"fieldPath": "germanArticle",
"columnName": "germanArticle",
"affinity": "TEXT",
"notNull": true
},
{
"fieldPath": "frenchArticle",
"columnName": "frenchArticle",
"affinity": "TEXT",
"notNull": false
}
],
"primaryKey": {
"columnNames": [
"id"
],
"autoGenerate": false
},
"indices": [
{
"name": "index_Articles_germanArticle",
"unique": true,
"columnNames": [
"germanArticle"
],
"createSql": "CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_germanArticle` ON `${TABLE_NAME}` (`germanArticle`)"
},
{
"name": "index_Articles_frenchArticle",
"unique": true,
"columnNames": [
"frenchArticle"
],
"createSql": "CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_frenchArticle` ON `${TABLE_NAME}` (`frenchArticle`)"
}
],
"foreignKeys": []
},
...
Note: My question was not, how to create the Room DB out of the scheme. To receive the scheme, I already had to create all the Entities and the database. But how to get the structure Room creates as SQL to prepopulate my Database. However, I think the answer is a really nice explanation, and in fact I found the SQL-Statements I was searching for in the generated Java-file, which was an awesome hint. ;)
Is there a way? Or does Room provide anyway to actually get the create table statements as a proper text, not split up in tons of JSON arrays?
You cannot simply provide the CREATE SQL for Room, what you need to do is to generate the java/Kotlin classes (Entities) from the JSON and then add those classes to the project.
native SQLite (i.e. not using Room) would be a different matter as it could do done at runtime.
The way Room works is that the database is generated from the classes annotated with #Entity (at compile time).
The Entity/classes have to exist for the compile to correctly generate the code that it generates.
Furthermore the Entity(ies) have to be incorporated/included into a class for the Database, that being annotated with #Database (this class is typically abstract).
Yet furthermore to access the database tables you have abstract classes or interfaces for the SQL each being annotated with #Dao and again these require the Entity classes as the SQL is checked at compile time.
e.g. the JSON you provided would equate to something like :-
#Entity(
indices = {
#Index(value = "germanArticle", name = "index_Articles_germanArticle", unique = true),
#Index(value = "frenchArticle", name = "index_Articles_frenchArticle", unique = true)
}
, primaryKeys = {"id"}
)
public class Articles {
//#PrimaryKey // Could use this as an alternative
long id;
#NonNull
String germanArticle;
String frenchArticle;
}
so your process would have to convert the JSON to create the above and which could then be copied into the project.
You would then need a Class for the database which could be for example :-
#Database(entities = {Articles.class},version = 1)
abstract class MyDatabase extends RoomDatabase {
}
Note that Dao classes would be added to body of the above along the lines of :-
abstract MyDaoClass getDao();
Or does Room provide anyway to actually get the create table statements as a proper text, not split up in tons of JSON arrays?
Yes it does ....
At this stage if you compile it generates java (MyDatabase_Impl for the above i.e. the name of the Database class suffixed with _Impl). However as there are no Dao classes/interfaces. The database would unusable from a Room perspective (and thus wouldn't even get created).
part of the code generated would be :-
#Override
public void createAllTables(SupportSQLiteDatabase _db) {
_db.execSQL("CREATE TABLE IF NOT EXISTS `Articles` (`id` INTEGER NOT NULL, `germanArticle` TEXT NOT NULL, `frenchArticle` TEXT, PRIMARY KEY(`id`))");
_db.execSQL("CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_germanArticle` ON `Articles` (`germanArticle`)");
_db.execSQL("CREATE UNIQUE INDEX IF NOT EXISTS `index_Articles_frenchArticle` ON `Articles` (`frenchArticle`)");
_db.execSQL("CREATE TABLE IF NOT EXISTS room_master_table (id INTEGER PRIMARY KEY,identity_hash TEXT)");
_db.execSQL("INSERT OR REPLACE INTO room_master_table (id,identity_hash) VALUES(42, 'f7294cddfc3c1bc56a99e772f0c5b9bb')");
}
As you can see the Articles table and the two indices are created, the room_master_table is used for validation checking.

explode() of pyspark.sql is not working as expected even if only one of the record is not following the schema

I have a json file which I would like to convert (say csv) by expanding one of the fields into columns.
I used explode() for this, but it gives an error even if one of the many records is not having the exact schema.
Input File:
{"place": "KA",
"id": "200",
"swversion": "v.002",
"events":[ {"time": "2020-05-23T22:34:32.770Z", "eid": 24, "app": "testing", "state": 0} ]}
{"place": "AP",
"id": "100",
"swversion": "v.001",
"events":[[]]}
In the above, i want to expand the fields of the "events" and they should become columns.
Ideal case, "events" is a Array of Struct Type..
Expected Output File Columns:
*place, id, swversion, time, eid, app, state*
For this, i have used explode() available in pyspark.sql, but because my second record in the Input file, does not follow the schema where "events" is an Array of Struct Type, explode() fails here by giving an error.
The code i have used to explode :
df = spark.read.json("InputFile")
ndf = df.withColumn("event", explode("events")).drop("events")
ndf.select("place", "id", "swversion", "event.*")
The last line fails because of the second record in my input file..
It should not be too difficult for explode() to handle this i believe.
Can you suggest how to avoid
Cannot expand star types
If I change the "events":[[]] to "events":[{}], explode() works fine since again its an Array of StructType, but since I have no control of the input data, i need to handle this.

Postgres JSONB Query and Index on Nested String Array

I have some troubles wrapping my head around how to formulate queries and provide proper indices for the following situation. I have customer entities represented in JSON like this (only relevant properties are retained):
{
"id": "50000",
"address": [
{
"line": [
"2nd Main Street",
"123 Harris Plaza"
],
"city": "Boston",
"state": "Massachusetts",
"country": "US",
},
{
"line": [
"1st Av."
],
"city": "Jamestown",
"state": "Massachusetts",
"country": "US",
}
]
}
The customers are stored in the following customer table:
CREATE TABLE Customer (
id BIGSERIAL PRIMARY KEY,
resource JSONB
);
I manage to do simple queries on the resource column, e.g. a projection query like this works (retrieve all lower-case address lines for cities starting with "bo"):
SELECT LOWER(jsonb_array_elements_text(jsonb_array_elements(c.resource#>'{address}') #> '{line}')) FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE LOWER(a->>'city') LIKE 'bo%';
I have trouble doing the following: my goal is to query all customers that have at least one address line beginning with "12". Case insensitivity is a requirement for my use case. The example customer would match my query, as the first address object has an address line starting with "12". Please note that "line" is an Array of JSON Strings, not complex objects. So far the closest thing I could come up with is this:
SELECT c.resource FROM Customer c, jsonb_array_elements(c.resource #> '{address}') a WHERE a->'line' ?| array['123 Harris Plaza'];
Obviously this is not a case-insensitive LIKE query. Any help/pointers on how to formulate both query and accompanying GIN index are greatly appreciated. My first query already selects all address lines as text, so maybe this could be used in a GIN index?
I'm using Postres 9.5, but am happy to upgrade if this can only be achieved in more recent Postgres versions.
While GIN indexes have machinery to support prefix matching, this machinery is only hooked up for tsvectors. array_ops does not have it hooked up, nor does json_ops or json_path_ops. So unless you want to create new operator class/families (or normalize your data into separate tables) you will have to shoe-horn your data into a tsvector.
Here is a crude way to do that, which doesn't account for the possibility that a address line might contain literal single quotes or perhaps other meaningful characters:
create function addressline_tsvector(jsonb) returns tsvector immutable language SQL as $$
select string_agg('''' || lower(value) || '''', ' ')::tsvector
from jsonb_array_elements($1->'address') a(a),
jsonb_array_elements_text(a->'line')
$$;
create index on customer using gin (addressline_tsvector(resource));
select * from customer where addressline_tsvector(resource) ## lower('''2nd Main'':*')::tsquery;
Given that your example table only has one row, the index will probably not actually be used unless you set enable_seqscan = off first.

Load json file with arrays /structs and flexible schema into Hive table

Need some help loading a json file into a table . Here is an example of some of the json objects within the file:
{"asin": "0002000202", "title": "Black Berry, Sweet Juice: On Being Black and White in Canada", "price": 13.88, "imUrl": "http://ecx.images-amazon.com/images/I/51PQAYJ9EDL.jpg", "related": {"also_bought": ["0393333094"], "buy_after_viewing": ["0393333094", "1554685087"]}, "salesRank": {"Books": 3013713}, "categories": [["Books"]]}
{"asin": "0000041696", "title": "Arithmetic 2 A Beka Abeka 1994 Student Book (Traditional Arithmentic Series)", "price": 6.53, "imUrl": "http://ecx.images-amazon.com/images/I/41cGaan-BrL._SL500_.jpg", "related": {"also_viewed": ["B000KOYDUY", "B004GE1B7W", "B008SXBO88", "B001EH7Y02", "B000W7PN62", "B004H3G1X6", "B004WOEIXA", "B000AXWEEM", "0789478722", "B000MN2C56", "1402709269", "B001HHOKG0", "B000Y9TO1S", "1402711441", "0756609356", "0142400106", "1556616465", "0545021383", "B004LDD18A", "B000HZH18C", "1557996563", "B00CZTVUKI", "B001CXK8Y2", "B000QX6KY6"], "buy_after_viewing": ["B000KOYDUY", "B004GE1B7W", "B000LBXGRC", "0439827655"]}, "salesRank": {"Books": 2554321}, "categories": [["Books"]]}
As you can see the schema varies among objects. Some not all attributes are present in all objects. There are also structs and arrays.
Here is my create table statement
create table amazon.products_test
(asin string,
title string,
description string,
brand string,
price float,
salesRank struct<category:string, rank:int> ,
imUrl string,
categories array<string>,
related struct<also_bought:string, also_viewed:string, buy_after_viewing:string, bought_together:string>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
My load statement:
load data inpath '/user/amazon/products_test.json'
overwrite into table amazon.products_test;
Here I try and query
hive> select * FROM products_test;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Field name expected
Do I have the right datatypes ?
Is there a better serde ?
Do I need add TBLPROPERTIES or SERDEPROPERTIES ?
I found the answer . As suspected, I needed to use a different SERDE:
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
I saw some forums suggesting that I may need to use this SERDE but I didn't know how to implement and add the jars from :
https://github.com/rcongiu/Hive-JSON-Serde
also , I needed to use a map map type not a struct for the salesRank

Generalized way to extract JSON from a relational database?

Ok, maybe this is too broad for StackOverflow, but is there a good, generalized way to assemble data in relational tables into hierarchical JSON?
For example, let's say we have a "customers" table and an "orders" table. I want the output to look like this:
{
"customers": [
{
"customerId": 123,
"name": "Bob",
"orders": [
{
"orderId": 456,
"product": "chair",
"price": 100
},
{
"orderId": 789,
"product": "desk",
"price": 200
}
]
},
{
"customerId": 999,
"name": "Fred",
"orders": []
}
]
}
I'd rather not have to write a lot of procedural code to loop through the main table and fetch orders a few at a time and attach them. It'll be painfully slow.
The database I'm using is MS SQL Server, but I'll need to do the same thing with MySQL soon. I'm using Java and JDBC for access. If either of these databases had some magic way of assembling these records server-side it would be ideal.
How do people migrate from relational databases to JSON databases like MongoDB?
Here is a useful set of functions for converting relational data to JSON and XML and from JSON back to tables: https://www.simple-talk.com/sql/t-sql-programming/consuming-json-strings-in-sql-server/
SQL Server 2016 is finally catching up and adding support for JSON.
The JSON support still does not match other products such as PostgreSQL, e.g. no JSON-specific data type is included. However, several useful T-SQL language elements were added that make working with JSON a breeze.
E.g. in the following Transact-SQL code a text variable containing a JSON string is defined:
DECLARE #json NVARCHAR(4000)
SET #json =
N'{
"info":{
"type":1,
"address":{
"town":"Bristol",
"county":"Avon",
"country":"England"
},
"tags":["Sport", "Water polo"]
},
"type":"Basic"
}'
and then, you can extract values and objects from JSON text using the JSON_VALUE and JSON_QUERY functions:
SELECT
JSON_VALUE(#json, '$.type') as type,
JSON_VALUE(#json, '$.info.address.town') as town,
JSON_QUERY(#json, '$.info.tags') as tags
Furhtermore, the OPENJSON function allows to return elements from referenced JSON array:
SELECT value
FROM OPENJSON(#json, '$.info.tags')
Last but not least, there is a FOR JSON clause that can format a SQL result set as JSON text:
SELECT object_id, name
FROM sys.tables
FOR JSON PATH
Some references:
https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server
https://learn.microsoft.com/en-us/sql/relational-databases/json/convert-json-data-to-rows-and-columns-with-openjson-sql-server
https://blogs.technet.microsoft.com/dataplatforminsider/2016/01/05/json-in-sql-server-2016-part-1-of-4/
https://www.red-gate.com/simple-talk/sql/learn-sql-server/json-support-in-sql-server-2016/
I think one 'generalized' solution will be as follows:-
Create a 'select' query which will join all the required tables to fetch results in a 2 dimentional array (like CSV / temporary table, etc)
If each row of this join is unique, and the MongoDB schema and the columns have one to one mapping, then its all about importing this CSV/Table using MongoImport command with required parameters.
But a case like above, where a given Customer ID can have an array of 'orders', needs some computation before mongoImport.
You will have to write a program which can 'vertical merge' the orders for a given customer ID.For small set of data, a simple java program will work. But for larger sets, parallel programming using spark can do this job.
SQL Server 2016 now supports reading JSON in much the same way as it has supported XML for many years. Using OPENJSON to query directly and JSON datatype to store.
There is no generalized way because SQL Server doesn’t support JSON as its datatype. You’ll have to create your own “generalized way” for this.
Check out this article. There are good examples there on how to manipulate sql server data to JSON format.
https://www.simple-talk.com/blogs/2013/03/26/sql-server-json-to-table-and-table-to-json/

Resources