loading a csv file to existing HIVE tale through Spark - sql-server

Below is the code that I have written to connect to a RDBMS, then create temp table , execute SQL query on that temp table, saving the SQL query output to a .csv format through databricks module.
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://<server>:<port>").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxxx").option("user","xxxxx").option("password","xxxxx").load()
df.registerTempTable("test")
df1= sqlContext.sql("select * from test where xxx= 6")
df1.write.format("com.databricks.spark.csv").save("/xxxx/xxx/ami_saidulu")
df1.write.option("path", "/xxxx/xxx/ami_saidulu").saveAsTable("HIVE_DB.HIVE_TBL",format= 'csv',mode= 'Append')
Where HIVE.DB is an existing HIVE DATABASE
HIVE.TBL is an existing HIVE TABLE
after I execute the code, I am getting below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o68.saveAsTable.
: java.lang.RuntimeException: Append mode is not supported by com.databricks.spark.csv.DefaultSource15
Does that mean, the databricks module doesn't support "saveAsTable" function?
If yes, then please point out the mistakes in my code.
If no, then what is the solution/work around/industry standards ?
Spark 1.6.1

I can suggest you one another solution.
You can use Insert functionality to insert in the table.
sqlContext.sql("INSERT INTO/OVERWRITE TABLE HIVE_DB.HIVE_TBL select * from test where xxx= 6")
I hope this solution will help you and you can directly write into table, why do you want to write in csv and then writing into the table?
Even if you want text delimited file #table path. Just define table as TextFile table with the required delimiter. Your files #table path would be the delimited one after insert.
Thanks

Assuming your table is managed:
Just do df.write.saveAsTable('HIVE_DB.HIVE_TBL',write_mode='Append')‌, no need to go through an intermediate csv-File.
What this error means is that the databricks module for csv does not support Append mode. There is an issue on github here. So the solution is not to use csv with append mode.

Related

How do I put a csv into an external table in SNowflake?

I have a staged file and I am trying to query the first line/row of it because it contains the column headers of the file. Is there a way I can create an external table using this file so that I can query the first line?
I am able to query the staged file using
SELECT a.$1
FROM #my_stage (FILE_FORMAT=>'my_file_format',PATTERN=>'my_file_path') a
and then to create the table I tried doing
CREATE EXTERNAL TABLE MY_FILE_TABLE
WITH
LOCATION='my_file_path'
FILE_FORMAT = my_file_format;
Reading Headers from CSV is not supported however this answer from StackOverflow gives a workaround.

SQL Server Excel Worksheets Import

I am working on an import script, the idea being to import multiple workbooks into one table. I have made progress, so I am able to import one workbook successfully into my table. What I want to do is create a query that will loop a folder read the file names and import the data into my database in Microsoft SQL Server Management Studio.
--Creating the TABLE--
CREATE TABLE BrewinDolphinHoldings
(
recordID INT AUTO_NUMBER
FUNDNAME VARCHAR(25),
SEDOL VARCHAR(7),
ISIN VARCHAR(11),
NAME VARCHAR(20),
WEIGHT INTEGER(3)
)
constraint[pk_recordID]PRIMARYKEY
(
[recordID] ASC
)
INSERT INTO BrewinDolphinHoldings
VALUES
("HoldingsData', GB125451241, DavidsHoldings, 22)
--SELECTING THE SHEET--
SELECT/UPDATE? *
FROM OPENROWSET('Microsoft.JET.OLEDB.4.0',
'Excel 8.0;Database=X:\CC\sql\DEMO\SpreadsheetName + '.xlsx',
'SELECT * FROM [Sheet1$]') AS HoldingsData
So essentially my question is, I want to create a loop a loop that will read the file name in a directory, and the import will read that name every time it loops and import the relevant spreadsheets? so,for example:
DECLARE SpreadsheetName as STRING
DECLARE FileExtension as '.xlsx'
FOR EACH ITEM IN DIRECTORY
X=1
Y=MAX
FILENAME THAT LOOP READS = SpreadsheetName
SELECT * FROM
OPENROWSET('Microsoft.JET.OLEDB.12.0',
'Excel 8.0;Database=X:\CC\sql\DEMO\SpreadsheetName + fileExtension.xls
END LOOP
So, I'm thinking maybe something like this? Although I don't know if the loop will overwrite my database? maybe instead of UPDATE I should use INSERT?
I don't want to use SSIS, preferably a query, although if anyone can recommend anything I could look into, or, help me with this loop It would greatly help
I'm open to new ideas from you guys, so if anyone can try and fix my code, or give me a few examples of imports for multiple excel sheets, would be greatly appreciated!
I'm new to SQL Server, I do have some previous programming experience!
Thanks!
You can use bcp to do what you are talking about to import any type of delimited text file, such as csv or text tab delimited. If it is possible generate/save the spreadsheets as csv and use this method. See these links.
Import Multiple CSV Files to SQL Server from a Folder
http://www.databasejournal.com/features/mssql/article.php/3325701/Import-multiple-Files-to-SQL-Server-using-T-SQL.htm
If it has to be excel, then you can't use bcp, but these should still help you with the logic for the loops on the file folders. I have never used the excel openrowset before, but if you have it working like you said, it should be able to insert in just the same. You can still use the xp_cmdshell/xp_dirtree to look at the files and generate the path even though you can't import it with bcp.
How to list files inside a folder with SQL Server
I would then say it would be easiest to do a insert from a select statement from the openrowset to put it into the table.
http://www.w3schools.com/sql/sql_insert_into_select.asp
Make sure xp_cmdshell is enabled on your sql server instance as well.
https://msdn.microsoft.com/en-us/library/ms190693(v=sql.110).aspx

Loading data of one table into another residing on different databases - Netezza

I have a big file which I have loaded in a table in a netezza database using an ETL tool, lets call this database Staging_DB. Now, post some verifications, the content of this table needs to be inserted into similar structured table residing in another netezza DB, lets call this one PROD_DB. What is the fastest way to transfer data from staging_DB to PROD_DB?
Should I be using the ETL tool to load the data into PROD_DB? Or,
Should the transfer be done using external tables concept?
If there is no transformation need to be done, then better way to transfer is cross database data transfer. As described in Netezza documentation that Netezza support cross database support where the user has object level permission on both databases.
You can check permission with following command -
dbname.schemaname(loggenin_username)=> \dpu username
Please find below working example -
INSERT INTO Staging_DB..TBL1 SELECT * FROM PROD_DB..TBL1
If you want to do some transformation and than after you need to insert in another database then you can write UDT procedures (also called as resultset procedures).
Hope this will help.
One way you could move the data is by using Transient External Tables. Start by creating a flat file from your source table/db. Because you are moving from Netezza to Netezza you can save time and space by turning on compression and using internal formatting.
CREATE EXTERNAL TABLE 'C:\FileName.dat'
USING (
delim 167
datestyle 'MDY'
datedelim '/'
maxerrors 2
encoding 'internal'
Compress True
REMOTESOURCE 'ODBC'
logDir 'c:\' ) AS
SELECT * FROM source_table;
Then create the table in your target database using the same DDL in the source and just load it up.
INSERT INTO target SELECT * FROM external 'C:\FileName.dat'
USING (
delim 167
datestyle 'MDY'
datedelim '/'
maxerrors 2
encoding 'internal'
Compress True
REMOTESOURCE 'ODBC'
logDir 'c:\' );
I would write a SP on production db and do a CTAS from stage to production database. The beauty of SP is you can add transformations as well.
One other option is NZ migrate utility provided by Netezza and that is the fastest route I believe.
A simple SQL query like
INSERT INTO Staging_DB..TBL1 SELECT * FROM PROD_DB..TBL1
works great if you just need to do that.
Just be aware that you have to be connected to the destination database when executing the query, otherwise you will get an error code
HY0000: "Cross Database Access not supported for this type of command"
even if you have read/write access to both databases and tables.
In most cases you can simply change the catalog using a "Set Catalog" command
https://www-304.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_set_catalog.html
set catalog='database_name';
insert into target_db.target_schema.target_table select source_db.source_schema.source_table;

How to import csv files

How can I import CSV file data into SQL Server 2000 table? I need to insert data from CSV file to table twice a day. Table has more then 20 fields but I only need to insert value into 6 fields.
i face same problem before i can suggest start reading here. The author covers:"This is very common request recently – How to import CSV file into SQL Server? How to load CSV file into SQL Server Database Table? How to load comma delimited file into SQL Server? Let us see the solution in quick steps."
I need to insert data from CSV file to table twice a day.
Use DTS to perform the import, then schedule it.
For SQL 2000, I would use DTS. You can then shedule this as a job when your happy with it.
Below is a good Microsoft link explaining how to use it.
Data Transformation Services (DTS)
You describe two distinct problems:
the CSV import, and
the extraction of data into only those 6 fields.
So break your solution down into two steps:
import the CSV into a raw staging table, and
then insert into your six 'live' fields from that staging table.
There is a function for the first part, called BULK INSERT, the syntax looks like this:
BULK INSERT target_staging_table_in_database
FROM 'C:\Path_to\CSV_file.csv'
WITH
(
DATAFILETYPE = 'CHAR'
,FIRSTROW = 2
,FIELDTERMINATOR = ','
,ROWTERMINATOR = '\n'
);
Adjust to taste, and consult the docs for more options. You might also want to TRUNCATE or DELETE FROM your staging table before doing the bulk insert so you don't have any old data in there.
Once you get the information into the database, doing an UPDATE or INSERT into those six fields should be straightforward.
You can make of use SQL Server Integration services(SSIS). It's jusy one time task to create the Package. Next time onwards just run that package.
You can also try Bulk Insert as daniel explained.
You can also try Import export wizard in SQL Server 2000.

Can't import as null value SQL Server 2008 TSV file

I import data from a TSV file with SQL Server 2008.
null is replaced by 0 when I confirm a table after import with integer column.
How to import as null, please Help me!!
Using bcp, -k switch
Using BULK INSERT, use KEEPNULLS
After comment:
Using SSIS "Bulk insert" task, options page, "Keep nulls" = true
This is what the import wizard uses: but you'll have to save and edit it first because I see no option in my SSMS 2005 wizard.
This can be set in the OLE DB Destination editor....there is a 'Keep nulls' option.
Alternative for those using the Import and Export Wizard on SQL Server Express, or anyone who finds themselves too lazy to modify the SSIS package:
Using text editing software before you run the wizard, replace NULLs with a valid value that you know doesn't appear in your dataset (eg. 987654; be sure to do a search first!) and then run the Import Export Wizard normally. If your data contains every single value (maybe bits or tinyints), you'll have some data massaging ahead of you, but it's still possible by using a temporary table with datatypes that can store a greater number of values. Once it's in SQL, use commands like
UPDATE TempTable
SET Column1 = NULL
WHERE Column1 = 987654
to get those NULLs where they belong. If you've used a temporary table, use INSERT INTO or MERGE to get your data into your end table.

Resources