Convert spark dataframe to DeltaLake in Databricks

Convert spark dataframe to DeltaLake in Databricks - database

I have a Spark dataframe which is actually a huge parquet file read from the container instance in Azure. And I want to make delta lake format out of it. By every time I try to do that it throws an error without any message attached.
I want to save it to the Databricks itself or to the container instance (if possible).
I tried already df.write.format("delta").save("f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data)
and
df.write.format("delta").saveAsTable("f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data)
and
CREATE DELTA TABLE lifetime_delta
USING parquet
OPTIONS (f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/")
I do think I need to create a table somehow and I heard that since parquet is native for Delta Lake it's already existing in Delta Lake context but for some reason it's not quite true.
None of it worked for me. Thank you in advance.

There are two main ways to convert Parquet files to a Delta Lake:
Read the Parquet files into a Spark DataFrame and write out the data as Delta files. Looks like this is what you're trying to do. Here's your code:
df.write.format("delta").save("f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data)
You may need to change it as follows in accordance with the Python f-string syntax:
df.write.format("delta").save(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/my_data")
You can also convert from Parquet to Delta Lake in place using the following code:
from delta import *
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`tmp/lake2`")
Here's an example notebook with code snippets to perform this operation that you may find useful.

Related

CSV vs. Parquet Files for Athena tables

We would like to run an experiment to determine whether our target/curated product should be stored in csv or parquet format through a series of queries (joins and aggregations). Other than just checking the execution time in Athena, are there other stats we can check in Athena?
I found the Explain button, but I am not familiar with database explain so I'm unsure what I should be looking for...
Any advice would be appreciated. Thank you.

You should use Parquet or ORC, and make sure it is compressed. It will be both faster and cheaper, no question about it.
Follow these examples and you'll see for yourself: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Basically:
Amazon Athena charges based on data read from disk. Compressed data will reduce the amount of data read from disk. Using a columnar file format will also greatly reduce the amount of disk access required.
Columnar data formats are faster to query because it is 'intelligent' and allows data to be passed-over and never read from disk
You can convert to Snappy-compressed Parquet format using a CREATE TABLE AS command -- see Examples of CTAS queries - Amazon Athena:
CREATE TABLE new_table
WITH (
format = 'Parquet',
write_compression = 'SNAPPY')
AS SELECT *
FROM old_table;

Does snowflake always use a staging area?

I am new to Snowflake, just want to understand the data loading in Snowflake.
Let's say I have some files in Azure ADLS or Amazon S3. I use python to read those files, do some transformation and load the data to a Snowflake table using Pandas 'to_sql' function. Does Snowflake use any staging area implicitly to load the data first and then move the data to the table? or does it directly load the data to the table?

The easiest way to figure out what actual query is actulay executed is checking the QUERY_HISTORY
Writing Data from a Pandas DataFrame to a Snowflake Database
To write data from a Pandas DataFrame to a Snowflake database, do one of the following:
Call the write_pandas() function.
Call the pandas.DataFrame.to_sql() method (see the Pandas documentation), and specify pd_writer() as the method to use to insert the data into the database.
pd_writer(parameters...):
Purpose: pd_writer is an insertion method for inserting data into a Snowflake database.
When calling pandas.DataFrame.to_sql (see the Pandas documentation), pass in method=pd_writer to specify that you want to use pd_writer as the method for inserting data. (You do not need to call pd_writer from your own code. The to_sql method calls pd_writer and supplies the input parameters needed.)
The pd_writer function uses the write_pandas() function to write the data in the DataFrame to the Snowflake database.
and finally write_pandas(parameters...):
Writes a Pandas DataFrame to a table in a Snowflake database.
To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO command to copy the data from the files to the table. You can use some of the function parameters to control how the PUT and COPY INTO statements are executed.

How to modify the projection of a dataset in a ADF Dataflow

I want to optimize my dataflow reading just data I really need.
I created a dataset that maps a view on my database. This dataset is used by different dataflow so I need a generic projection.
Now I am creating a new dataflow and I want to read just a subset of the dataset.
Here how I created the dataset:
And that is the generic projection:
Here how I created the data flow. That is the source settings:
But now I want just a subset of my dataset:
It works but I think I am doing wrong:
I wanto to read data from my dataset (as you can see from source settings tab), but when I modify the projection I read from the underlying table (as you can see from source option). It seems an inconsistence. Which is the correct way to manage this kind of customization?
Thank you
EDIT
The solution proposed does not solve my problem. If I go in monitor and I analyze the exections that is what I saw...
Before I had applyed the solution proposed and with the solution I wrote above I got this:
As you can see I had read just 8 columns from database.
With the solution proposed, I get this:
And just then:
Just to be clear, the purpose of my question is:
How ca n I read only the data I really need instead of read all data and filter them in second moment?
I found a way (explained in my question) but there is an inconsistency with the configuration of the dataflow (I set a dataflow as input but in the option I write a query that read from db).

First import data as a Source.
You can use Select transformation in DataFlow activity to select CustomerID from imported dataset.
Here you can remove unwanted columns.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-select

How to step through excel(xlsx) file uploaded in a BLOB field mssql

I need to get values from a certain column in a xlsx spreadsheet that was uploaded to my database in a image(blob) field. I would like to step through the rows and get values from say column 4 and insert the values into another table by using sqlserver. I can to it with CSV files by casting the image field to varbinary and then cast it again to varhar and search for ','s.
Can openrowset work on a blob field?

I doubt that this can work. Even though the data in the XLSX is stored in Microsofts Office Open-XML format (http://en.wikipedia.org/wiki/Office_Open_XML) the XML is then zipped which means that your XLSX file is a binary file. So if you want to access data in the xlsx (can't you use csv instead?) I think you need to do so programmatically. Depending on the programming logic of your choice there are various open-source projects allowing you to access xlsx file.
Java: Apache POI http://poi.apache.org/spreadsheet/
C++: http://sourceforge.net/projects/xlslib/?source=directory
...

SQL2008 Integration Services - Loading CSV files with varying file schema

I'm using SQL2008 to load sensor data in a table with Integration Services. I have to deal with hundreds of files. The problem is that the CSV files all have slightly different schemas. Each file can have a maximum of 20 data fields. All data files have these fields in common. Some files have all the fields others have some of the fields. In addition, the order of the fields can vary.
Here’s and example of what the file schemas look like.
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,RD_1,SH_1,CL_2
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,WS_1,WD_1,WSM_1,WDM_1,SH_1
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,RS_1,RI_1,PR_1,RD_1,WS_1,WD_1,WSM_1,WDM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,PW_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,WS_1,WD_1,WSM_1
I’m using a Data Flow Script Task to process the data via CreateNewOutputRows() and MyOutputBuffer.AddRow(). I have a working package to load the data however it’s not reliable and robust because as I had more files the package fails because the file schema has not been defined in CreateNewOutputRows().
I'm looking for a dynamic solution that can cope with the variation in the file schema. Doeas anyone have any ideas?

Who controls the data model for the output of the sensors? If it's not you, do they know what they are doing? If they create new and inconsistent models every time they invent a new sensor, you are pretty much up the creek.
If you can influence or control the evolution of the schemas for CSV files, try to come up with a top level data architecture. In the bad old days before there were databases, files made up of records often had, as the first field of each record, a "record type". CSV files could be organized the same way. The first field of every record could indicate what type of record you are dealing with. When you get an unknown type, put it in the "bad input file" until you can maintain your software.
If that isn't dynamic enough for you, you may have to consider artificial intelligence, or looking for a different job.

Maybe the cmd command is good. in the cmd, you can use sqlserver import csv.

If the CSV files that all have identical formats use the same file name convention or if they can be separated out in some fashion you can use the ForEach Loop Container for each file schema type.
Possible way to separate out the CSV files is run a Script (in VB) in SSIS that reads the first row of the CSV file and checks for the differing types (if the column names are in the first row) and then moves the files to the appropriate folder for use in the ForEach Loop Container.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Convert spark dataframe to DeltaLake in Databricks - database

Related

CSV vs. Parquet Files for Athena tables

Does snowflake always use a staging area?

How to modify the projection of a dataset in a ADF Dataflow

How to step through excel(xlsx) file uploaded in a BLOB field mssql

SQL2008 Integration Services - Loading CSV files with varying file schema

Categories

Resources