Read a list of csv files from datalake and union them into a single pyspark dataframe - union

I am trying to read a list of csv files from Azure datalake one by one and after some checking, I want to union all into a single dataframe.
fileList = dbutils.fs.ls(file_input_path)
for i in fileList:
try:
file_path = i.path
print(file_path)
except Exception as e:
raise Exception(str(e))
In this case, I want to read csv from file_path with a custom schema and union all of then into a single dataframe.
I could only read one csv as below. How to read each and every csv and union them all as one single dataframe?
df = spark.read.csv(file_path, header = True, schema=custom_schema)
How to achieve this diligently? Thanks.

I managed to read and union as below.
fileList = dbutils.fs.ls(file_input_path)
output_df = spark.createDataFrame([],schema=custom_schema)
for i in fileList:
try:
file_path = i.path
df = spark.read.csv(file_path, header=True, schema=custom_schema)
output_df = output_df.union(df)
except Exception as e:
raise Exception(str(e))

Related

Converting a table in one form to another using Snowflake

I am trying to load a CSV file into Snowflake. The sample format of the input csv table in s3 location is as follows (with 2 columns: ID, Location_count):
Input csv table
I need to transform it in the below format:(with 3 columns:ID, Location, Count)
Output csv table
However when I am trying to load the input file using the following query after creating database, external stage and file format, it returns LOAD_FAILED
create or replace table table_name
(
id integer,
Location_count variant
);
select parse_json(Location_count) as c;
list #stage_name;
copy into table_name from #stage_name file_format = 'fileformatname' on_error = 'continue';
you will probably need to parse_json that 2nd column as part of a copy-transformation. For example:
create file format myformat
type = csv field_delimiter = ','
FIELD_OPTIONALLY_ENCLOSED_BY = '"';
create or replace stage csv_stage file_format = (format_name = myformat);
copy into #csv_stage from
( select '1',
'{"SHS-TRN":654738,"PRN-UTN":78956,"NCT-JHN":96767}') ;
create or replace table blah (id integer, something variant);
copy into blah from (select $1, parse_json($2) from #csv_stage);

Error while retrieving data from SQL Server using pyodbc python

My table data has 5 columns and 5288 rows. I am trying to read that data into a CSV file adding column names. The code for that looks like this :
cursor = conn.cursor()
cursor.execute('Select * FROM classic.dbo.sample3')
rows = cursor.fetchall()
print ("The data has been fetched")
dataframe = pd.DataFrame(rows, columns =['s_name', 't_tid','b_id', 'name', 'summary'])
dataframe.to_csv('data.csv', index = None)
The data looks like this
s_sname t_tid b_id name summary
---------------------------------------------------------------------------
db1 001 100 careie hello this is john speaking blah blah blah
It looks like above but has 5288 such rows.
When I try to execute my code mentioned above it throws an error saying :
ValueError: Shape of passed values is (5288, 1), indices imply (5288, 5)
I do not understand what wrong I am doing.
Use this.
dataframe = pd.read_sql('Select * FROM classic.dbo.sample3',con=conn)
dataframe.to_csv('data.csv', index = None)

Csv file to a Lua table and access the lines as new table or function()

Currently my code have simple tables containing the data needed for each object like this:
infantry = {class = "army", type = "human", power = 2}
cavalry = {class = "panzer", type = "motorized", power = 12}
battleship = {class = "navy", type = "motorized", power = 256}
I use the tables names as identifiers in various functions to have their values processed one by one as a function that is simply called to have access to the values.
Now I want to have this data stored in a spreadsheet (csv file) instead that looks something like this:
Name class type power
Infantry army human 2
Cavalry panzer motorized 12
Battleship navy motorized 256
The spreadsheet will not have more than 50 lines and I want to be able to increase columns in the future.
Tried a couple approaches from similar situation I found here but due to lacking skills I failed to access any values from the nested table. I think this is because I don't fully understand how the tables structure are after reading each line from the csv file to the table and therefore fail to print any values at all.
If there is a way to get the name,class,type,power from the table and use that line just as my old simple tables, I would appreciate having a educational example presented. Another approach could be to declare new tables from the csv that behaves exactly like my old simple tables, line by line from the csv file. I don't know if this is doable.
Using Lua 5.1
You can read the csv file in as a string . i will use a multi line string here to represent the csv.
gmatch with pattern [^\n]+ will return each row of the csv.
gmatch with pattern [^,]+ will return the value of each column from our given row.
if more rows or columns are added or if the columns are moved around we will still reliably convert then information as long as the first row has the header information.
The only column that can not move is the first one the Name column if that is moved it will change the key used to store the row in to the table.
Using gmatch and 2 patterns, [^,]+ and [^\n]+, you can separate the string into each row and column of the csv. Comments in the following code:
local csv = [[
Name,class,type,power
Infantry,army,human,2
Cavalry,panzer,motorized,12
Battleship,navy,motorized,256
]]
local items = {} -- Store our values here
local headers = {} --
local first = true
for line in csv:gmatch("[^\n]+") do
if first then -- this is to handle the first line and capture our headers.
local count = 1
for header in line:gmatch("[^,]+") do
headers[count] = header
count = count + 1
end
first = false -- set first to false to switch off the header block
else
local name
local i = 2 -- We start at 2 because we wont be increment for the header
for field in line:gmatch("[^,]+") do
name = name or field -- check if we know the name of our row
if items[name] then -- if the name is already in the items table then this is a field
items[name][headers[i]] = field -- assign our value at the header in the table with the given name.
i = i + 1
else -- if the name is not in the table we create a new index for it
items[name] = {}
end
end
end
end
Here is how you can load a csv using the I/O library:
-- Example of how to load the csv.
path = "some\\path\\to\\file.csv"
local f = assert(io.open(path))
local csv = f:read("*all")
f:close()
Alternative you can use io.lines(path) which would take the place of csv:gmatch("[^\n]+") in the for loop sections as well.
Here is an example of using the resulting table:
-- print table out
print("items = {")
for name, item in pairs(items) do
print(" " .. name .. " = { ")
for field, value in pairs(item) do
print(" " .. field .. " = ".. value .. ",")
end
print(" },")
end
print("}")
The output:
items = {
Infantry = {
type = human,
class = army,
power = 2,
},
Battleship = {
type = motorized,
class = navy,
power = 256,
},
Cavalry = {
type = motorized,
class = panzer,
power = 12,
},
}

Snowflake - trying to load a row of csv data into Variant - "Error parsing JSON:"

I'm trying to load the entirety of each row in a csv file into a variant column.
my copy into statement fails with the below
Error parsing JSON:
Which is really odd as my data isn't JSON and I've never told it to try and validate it as json.
create or replace file format NeilTest
RECORD_DELIMITER = '0x0A'
field_delimiter = NONE
TYPE = CSV
VALIDATE_UTF8 = FALSE;
with
create table Stage_Neil_Test
(
Data VARIANT,
File_Name string
);
copy into Stage_Neil_Test(Data, File_Name
)
from (select
s.$1, METADATA$FILENAME
from #Neil_Test_stage s)
How do I stop snowflake from thinking it is JSON?
You need to explicitly cast the text into a VARIANT type, since it cannot auto-interpret it as it would if the data were JSON.
Simply:
copy into Stage_Neil_Test(Data, File_Name
)
from (select
s.$1::VARIANT, METADATA$FILENAME
from #Neil_Test_stage s)

Oracle to SQL Server large-scale Migration

I am trying to migrate a very large scale of rows and columns dynamically from Oracle to SQL Server. I am trying to automate this for about 20 tables daily.
My process is as follows:
Run Oracle configuration script: runs query which returns the Oracle table data, but the data has duplicates and bad data
Oracle Format Script: calls config script and then fixes the data to be accurate (deletes duplicates and bad data). Returns the data to a .txt with '~' separators
.txt to .csv: (this is my workaround/hack) Using Excel, I open all of the .txt files and use the AutoFormat to change them into perfectly formatted .csv's. (By the manual delimiter option, because when I use the auto-delimiter='~', it comes up with random one letter columns)
R: using the newly generated .csv's, I create table structures based on these headers in SQL Server
R: using the data collected from running the format script for each table, I have ~20 .txt files. I load them into R Studio and then attempt to insert new data using the following script:
Initialization:
RScript which loads the table structure of one .CSV
library(RODBC)
#set table name
tname <- "tablename"
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=dbname;
Uid=username; Pwd=pwd;trusted_connection=yes")
#get df headers from .csv
updatedtables <- read.csv(file = paste("<PATH>", tname, ".csv", sep=""), sep = ",")
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = F, rownames = F, verbose = T, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
odbcClose(conn)
return(0)
I run this code for each .csv to initialize the table structures
Dynamic things:
RScript which attempts to load data from .txt to existing table structures:
library(RODBC)
##~~~~~~~~~~~~~~ Oracle to MSSQL Automation Script ~~~~~~~~~~
#.bat script first runs which updates .txt dumps containing Oracle database info using configuration and format .sql queries for each table.
#function that writes to a table tname using data from tname.txt
writeTable <- function (tname){
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=db;
Uid=uid; Pwd=pw;trusted_connection=yes")
#remove all data entries for table
res <- sqlQuery(conn, paste("TRUNCATE TABLE dbo.", tname, sep = ""))
#load updated data from .txt files
#this skips three because of blank lines and skipping the headers
updatedtables <- read.csv(file = paste("<PATH>", tname, ".txt", sep=""),skip=3, sep = "~")
**ERROR**
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
#close connection
odbcClose(conn)
return(0)
}
#gets all file names and dynamically inputs the data for every file in the path directory
update <- function(){
#insert line separator for log organization which represents new script run
write("-UPDATE--------------------", file = "<PATH>", append = T)
path = "<PATH>"
#save all .txt file names in path to vector
file.names <- dir(path, pattern =".txt")
#go through path's files
for(i in 1:length(file.names)){
#get proper table name of file.names[i]
temp <- gsub(".txt", "", file.names[i])
#call helper func
writeTable(temp)
}
}
#run
update()
This file contains the error. Everything works and loads, except the line save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F), which returns the error
Error in `colnames<-`(`*tmp*`, value = c("QUOTENUMBER", "ENTITYPRODUCTITEMID", :
length of 'dimnames' [2] not equal to array extent
Any suggestions on what to do?
This is an example of the way the text dumps look (without spaces and with dummy data)
http://pastebin.com/kF8yR1R0

Resources