Cant find uuid in org.apache.spark.sql.types.DataTypes - uuid

We have a PostgreSQL table which has UUID as one of the column. How do we send UUID field in Spark dataset(using Java) to PostgreSQL DB.
We are not able to find uuid field in org.apache.spark.sql.types.DataTypes.
Please advice.

As already pointed out, despite these resolved issues (10186, 5753) there is still no supported uuid Postgres data type as of Spark 2.3.0.
However, there's a workaround by using Spark's SaveMode.Append and setting the Postgres JDBC property to allow string types to be inferred. In short, it works like:
val props = Map(
JDBCOptions.JDBC_DRIVER_CLASS -> "org.postgresql.Driver",
"url" -> url,
"user" -> user,
"stringtype" -> "unspecified"
)
yourData.write.mode(SaveMode.Append)
.format("jdbc")
.options(props)
.option("dbtable", tableName)
.save()
The table should be created with the uuid column already defined with type uuid. If you try to have Spark 2.3.0 create this table though, you will again hit a wall:
yourData.write.mode(SaveMode.Overwrite)
.format("jdbc")
.options(props)
.option("dbtable", tableName)
.option("createTableColumnTypes", "some_uuid_column_name uuid")
.save()
Result:
DataType uuid is not supported.(line 1, pos 21)

Yes, you are right, there is no UUID datatype in SparkSQL. Treating them as String should work because the connector will convert the String to UUID.
I haven't tried with PostgreSQL, but when I used Cassandra (and Scala) it worked perfectly.

Related

Unsupported Scan, storing driver.Value type []uint8 into type *guid.GUID

I work with Golang and SQL Server.
My struct in Golang:
type Role struct {
Id guid.GUID `gorm:"primaryKey;column:Id;type:uniqueidentifier" json:"id"`
RoleName string `gorm:"column:RoleName;not null;unique" json:"roleName"`
IsEnable bool `gorm:"column:IsEnable" json:"isEnable"`
Permissions []RolePermission }
I use gorm to query data but receive error:
unsupported Scan, storing driver.Value type []uint8 into type *guid.GUID.
I used uuid before but the id data is wrong when query (guid to uuid).
Is any way to store and work with Guid using Golang and SQL server
Early versions of go-gorm (v0.2) were including UUID/GUID support for SQLTag, with isUUID() a test on the type name ("uuid" or "guid").
But that code is no longer present in current go-gorm v2.0.
You might need to implement a custom Data Type Scanner / Valuer, or use one like google/uuid:
import (
"github.com/google/uuid"
"github.com/lib/pq"
)
type Post struct {
ID uuid.UUID `gorm:"type:uuid;default:uuid_generate_v4()"`
Title string
Tags pq.StringArray `gorm:"type:text[]"`
}

EF Core Projecting optional property in jsonb column

I need to project some fields from jsonb column where few of them are optional
I'm using EF Core 3.1 and npgsl and so far I got this
var shipments = _dbContext.Shipments.Select(x => new
{
ShipmentNo= x.ShipmentNumber,
ReportNum = x.ShipmentData.RootElement.GetProperty("reportNumber"),
ShipmentValue= x.ShipmentData.RootElement.GetProperty("shipmentMetadata").GetProperty("value").GetString(),
}
However value is optional and this is throwing exception. I see .TryGetProperty(...) method but it requires output variable and, I presume, its evaluation on server side. I wonder if there is way to handle this so query runs completely in Postgres.
You've forgotten to add GetInt32 (or whatever the type is) for reportNumber, just like you have a GetString after the shipmentMetadata. In order for this to be translatable to SQL, you need to tell the provider which type you expect to come out of the JSON element.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

HiveQL to HBase

I am using Hive 0.14 and Hbase 0.98.8
I would like to use HiveQL for accessing a HBase "table".
I created a table with a complex composite rowkey:
CREATE EXTERNAL TABLE db.hive_hbase (rowkey struct<p1:string, p2:string, p3:string>, column1 string, column2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY ';'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,cf:c1,cf:c2")
TBLPROPERTIES("hbase.table.name"="hbase_table");
The table is getting successfully created, but the HiveQL is taking forever:
SELECT * from db.hive_hbase WHERE rowkey.p1 = 'xyz';
Queries without using the rowkey are fine and also using the hbase shell with filters are working.
I don't find anything in the logs, but I assume that there could be an issue with complex composite keys and performance.
Did anybody face the same issue? Hints to solve it? Other ideas, what I could try?
Thank you
Update 16.07.15:
I changed the log4j properties to 'DEBUG' and found some interesting information:
It says:
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(823)) - Pushdown Predicates of FIL For Alias : hive_hbase
2015-07-15 15:56:41,232 INFO ppd.OpProcFactory (OpProcFactory.java:logExpr(826)) - (rowkey.p1 = 'xyz')
But some lines later:
2015-07-15 15:56:41,430 DEBUG ppd.OpProcFactory (OpProcFactory.java:pushFilterToStorageHandler(1051)) - No pushdown possible for predicate: (rowkey.p1 = 'xyz')
So my guess is: HiveQL over HBase does not do any predicate pushdown in Hbase but rather starts a MapReduce job.
Could there be a bug with the predicate pushdown?
I tried similar situation using Hive 0.13 and it works fine. I got the result. What version of hive are you working on?

Prevent Django from updating identity column in MSSQL

I'm working with a legacy DB in MSSQL. We have a table that has two columns that are causing me problems:
class Emp(models.Model):
empid = models.IntegerField(_("Unique ID"), unique=True, db_column=u'EMPID')
ssn = models.CharField(_("Social security number"), max_length=10, primary_key=True, db_column=u'SSN') # Field name made lowercase.
So the table has the ssn column as primary key and the relevant part of the SQL-update code generated by django is this:
UPDATE [EMP] SET [EMPID] = 399,
.........
WHERE [EMP].[SSN] = 2509882579
The problem is that EMP.EMPID is an identity field in MSSQL and thus pyodbc throws this error whenever I try to save changes to an existing employee:
ProgrammingError: ('42000', "[42000] [Microsoft][SQL Native Client][SQL Server]C
annot update identity column 'EMPID'. (8102) (SQLExecDirectW); [42000] [Microsof
t][SQL Native Client][SQL Server]Statement(s) could not be prepared. (8180)")
Having the EMP.EMPID as identity is not crucial to anything the program, so dropping it by creating a temporary column and copying, deleting, renaming seems like the logical thing to do. This creates one extra step in transferring old customers into Django, so my question is, is there any way to prevent Django from generating the '[EMPID] = XXX' snippet whenever I'm doing an update on this table?
EDIT
I've patched my model up like this:
def save(self, *args, **kwargs):
if self.empid:
self._meta.local_fields = [f for f in self._meta.local_fields if f.name != 'empid']
super().save(*args, **kwargs)
This works, taking advantage of the way Django populates it's sql-sentence in django/db/models/base.py (525). If anyone has a better way or can explain why this is bad practice I'd be happy to hear it!
This question is old and Sindri found a workable solution, but I wanted to provide a solution that I've been using in production for a few years that doesn't require mucking around in _meta.
I had to write a web application that integrated with an existing business database containing many computed fields. These fields, usually computing the status of the record, are used with almost every object access across the entire application and Django had to be able to work with them.
These types of fields are workable with a model manager that adds the required fields on to the query with an extra(select=...).
ComputedFieldsManager code snippet: https://gist.github.com/manfre/8284698
class Emp(models.Model):
ssn = models.CharField(_("Social security number"), max_length=10, primary_key=True, db_column=u'SSN') # Field name made lowercase.
objects = ComputedFieldsManager(computed_fields=['empid'])
# the empid is added on to the model instance
Emp.objects.all()[0].empid
# you can also search on the computed field
Emp.objects.all().computed_field_in('empid', [1234])

Resources