I've created some hive tables using a JDBC in a python notebook on Databricks. This was on Data Science and Engineering UI. I'm able to query the tables in a Databricks Notebook and user direct SQL with the magic command %
When switching to Databricks SQL UI, I'm still able to see the tables in Hive metastore explorer. However I'm not able to read the data. A very clear message says that only csv, parquet and so are supported.
Even though, I found this surprising, since I can use the data on DS and Engineering UI why it's not the case on Databricks SQL? Is there any solution to overcome that?
Yes, it's a known limitation that Databricks SQL right now supports only file-based formats. As I remember it's related to a security model, plus the fact that DBSQL is using Photon under the hood where JDBC integration could be not so performant. You may reach your solution architect or customer success engineer to get information on if it will be supported in the future.
The current workaround would be only to have a job that will periodically read all data from database via JDBC and dump into Delta table - it could be even more performant compared to JDBC, the only issue is the freshness of data.
You can import a Hive table from cloud storage into Databricks using an external table and query it using Databricks SQL.
Step 1: Show the CREATE TABLE statement
Issue a SHOW CREATE TABLE <tablename> command on your Hive command line to see the statement that created the table.
Refer below example:
hive> SHOW CREATE TABLE wikicc;
OK
CREATE TABLE `wikicc`(
`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/wikicc'
TBLPROPERTIES (
'totalSize'='2335',
'numRows'='240',
'rawDataSize'='2095',
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'transient_lastDdlTime'='1418173653')
Step 2: Issue a CREATE EXTERNAL TABLE statement
If the statement that is returned uses a CREATE TABLE command, copy the statement and replace CREATE TABLE with CREATE EXTERNAL TABLE.
EXTERNAL ensures that Spark SQL does not delete your data if you drop the table.
You can omit the TBLPROPERTIES field.
DROP TABLE wikicc
CREATE EXTERNAL TABLE `wikicc`(
`country` string,
`count` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/hive/warehouse/wikicc'
Step 3: Issue SQL commands on your data
SELECT * FROM wikicc
Source: https://docs.databricks.com/data/data-sources/hive-tables.html
Related
Here's the basic idea of what I want to do in SSIS:
I have a large query against a production Oracle database, and I need the following where clause that brings in a long list of ids from SQL Server. From there, the results are sent elsewhere.
select ...
from Oracle_table(s) --multi-join
where id in ([select distinct id from SQL_SERVER_table])
Alternatively, I could write the query this way:
select ...
from Oracle_table(s) --multi-join
...
join SQL_SERVER_table sst on sst.ID = Oracle_table.ID
Here are my limitations:
The Oracle query is large and cannot be run without the where id in (... clause
This means I cannot run the Oracle query, then join it against the ids in another step. I tried this, and the DBA's killed the temp table after it became 3 TB in size.
I have 160k id's
This means it is not practical to iterate through the id's one by one. In the past, I have run against ~1000 IDs, using a comma-separated list. It runs relatively fast - a few minutes.
The main query is in Oracle, but the ids are in SQL Server
I do not have the ability to write to Oracle
I've found many questions like this.
None of the answers I have found have a solution to my limitations.
Similar question:
Query a database based on result of query from another database
To prevent loading all rows from the Oracle table. The only way is to apply the filter in the Oracle database engine. I don't think this can be achieved using SSIS since you have more than 160000 ids in the SQL Server table, which cannot be efficiently loaded and passed to the Oracle SQL command:
Using Lookups and Merge Join will require loading all data from the Oracle database
Retrieving data from SQL Server, building a comma-separated string, and passing it to the Oracle SQL command cannot be done with too many IDs (160K).
The same issue using a Script Task.
Creating a Linked Server in SQL Server and Joining both tables will load all data from the Oracle database.
To solve your problem, you should search for a way to create a link to the SQL Server database from the Oracle engine.
Oracle Heterogenous Services
I don't have much experience in Oracle databases. Still, after a small research, I found something in Oracle equivalent to "Linked Servers" in SQL Server called "heterogeneous connectivity".
The query syntax should look like this:
select *
from Oracle_table
where id in (select distinct id from SQL_SERVER_table#sqlserverdsn)
You can refer to the following step-by-step guides to read more on how to connect to SQL Server tables from Oracle:
What is Oracle equivalent for Linked Server and can you join with SQL Server?
Making a Connection from Oracle to SQL Server - 1
Making a Connection from Oracle to SQL Server - 2
Heterogeneous Database connections - Oracle to SQL Server
Importing Data from SQL Server to a staging table in Oracle
Another approach is to use a Data Flow Task that imports IDs from SQL Server to a staging table in Oracle. Then use the staging table in your Oracle query. It would be better to create an index on the staging table. (If you do not have permission to write to the Oracle database, try to get permission to a separate staging database.)
Example of exporting data from SQL Server to Oracle:
Export SQL Server Data to Oracle using SSIS
Minimizing the data load from the Oracle table
If none of the solutions above solves your issue. You can try minimizing the data loaded from the Oracle database as much as possible.
As an example, you can try to get the Minimum and Maximum IDs from the SQL Server table, store both values within two variables. Then, you can use both variables in the SQL Command that loads the data from the Oracle table, like the following:
SELECT * FROM Oracle_Table WHERE ID > #MinID and ID < #MaxID
This will remove a bunch of useless data in your operation. In case your ID column is a string, you can use other measures to filter data, such as the string length, the first character.
I am working to migrate a SQLite database to SQL Server and I need to use IntelliJ IDEA to import all the data from the SQLite tables in to the MSSQL database.
I have exported the data to CSV format, but when I import into SQL Server, I need to maintain the existing ID columns (as foreign keys refer to it).
Normally, I can do this by executing SET IDENTITY_INSERT xxx ON; prior to my INSERT statements.
However, I do not know how to do this when importing CSV using IntelliJ.
The only other option I see is to export the data as a series of SQL INSERT statements, but that is very time consuming as the schemas between the two databases are slightly different (not to mention the SQL syntax).
Is there another way to import this data?
I don't know how to perform an Identity Insert ON in an IntelliJ query, but I do know how to work around this problem. Import your data into a temporary table destination, then execute a query within SQL Server that
Sets Identity Insert ON
Inserts the data from the temporary table into the final destination
Sets Identity Insert OFF
What this really does is prevent you from having to spend (potentially) hours finding out how to implement an Identity Insert ON in IntelliJ when you may never need to do this again. It is straightforward and simple to code as well.
However, if you want to learn if there is a way to do this in IntelliJ, go for it. That would be a more optimal method.
We are looking for a solution to implement "always encrypted" columns in a database, where we are using at the same time many SQL temporary tables (#tmp).
We explored the alternate path - stop using #temp tables, but this would mean a high impact on our app in terms of time/cost.
Did anyone find a way to write queries like "insert into #tmp select from my_table", where my_table contains AE columns?
I tried applying the same CMK and CEK to the tempdb database, so that I can create the same structure for the #tmp table, as the structure of my_table.
This doesn't solve the problem though - having the tables in 2 different databases seems to prevent the data transfer.
I'm looking for an SQL solution, and not for a solution which involves a client app (C#, vb, etc.) which has access to all the encryption keys.
Insert operations in the manner you are describing are not supported for encrypted columns.
"insert into #tmp select from my_table"
You will have to write a client app to achieve a similar result. If you want to explore that path, please leave a comment and I can guide you.
You should be able to achieve something similar in C# as follows.
Do select * from encryptedTable to load the data in a SqlDataReader then use SqlBulkCopy to load it to the temp table using SqlBulkCopy.WriteToServer(IDataReader) Method
If you have the encrypted table and the plaintext table on the same SQL Server instance, then be aware that you might to leaking information to SQL Server admin, because they can examine the plaintext data and corresponding ciphertext
I need a bit advice how to solve the following task:
I got a source system based on IBM DB2 (IBMDA400) which has a lot of tables that changes rapidly and daily in structure. I must load specified tables from the DB2 into a MSSQL 2008 R2 Server. Therefore i thought using SSIS is the best choice.
My first attempt was just to add both datasources, drop all tables in MSSQL and recreate them with a "Select * Into #Table From #Table". But I was not able to get this working because I could not connect both OLEDB Connections. I also tried this with an Openrowset statement but the SQL Server does not allow that for security reasons and I am not allowed to change that.
My second try was to manually read the tables from the source and drop and recreate the tables with a for each loop and then load the data via the Data Flow Task. But I got stuck on getting the meta data from the Execute SQL Task... so i dont got the column names and types.
I can not believe that this is too hard to archieve. Why is there no "create table if not exist" checkbox on the Data Flow Task?
Of course i searched for the problem here before but could not find a solution.
Thanks in advance,
Pad
This is the solution i got at the end:
Create a File/Table which is used for selection of the source tables.
Important: Create a linked Server on your SQL Instance or a working Connectionstring for the OPENROWSET (i was not able to do so - i choosed the linked server)
Query source File/Table
Build a loop through the resultset
Use Variables and Script Task to build your query
Drop the destination table
Build another Querystring with INSERT INTO TABLE FROM OPENROWSET (or if you used linked Server OPENQUERY)
Execute this Statement
Done.
As i said above i am not quite happy with this but for now it should be ok. I will update this if i got another solution.
I have loaded a huge table from SQL Server onto Hive. The mistake I made is I created the table as a Internal table in HIVE. Can anyone suggest any hack so that I can alter the table structure , without dropping the data.
The data is huge and I cant afford to export the data out of source again.
The problem right now, is that since the column orders don't match the SQL server table, a lot of columns display NULL.
Any help will be highly appreciated.
I do not see any problem to use an Alter Table on a internal table. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column)
Another - but not recommended - option would be to open your hive metastore(HCatalog) and apply the changes there. Hive reads out the schema information from a relational database (configured during the Hadoop setup, default is MySQL). In this MySQL you can try to change some settings. However, this is not recommended as with a mistake, you can screw your whole Hive databases.
The safest way is creating a new table and using the existing as a source
create table new_table
as
select
[...]
from existing_table