I want to create a query loop, which do interativ steps from one instance to the next. After fetching the right data i will do some calculation and write the new information back in the date base with specific relations.
Example Set up
Firt loop = [Aggregation Level](1, 2, 3)
Second loop = Product(A, B, C)
Time = (2001,2002,2003)
1. First loop
Aggregation level "1", Product "A", Fetch all years (2001, 2002, 2003)
Output = [
(2001,'10 pc','20 €')
(2002,'8 pc','18 €')
(2003,'82 pc','5000 €')]
2. Second loop
Aggregation level "1", Product "B", Fetch all years (2001, 2002, 2003)
Output = [
(2001,'15 pc','35 €')
(2002,'20 pc','100 €')
(2003,'25 pc','5000 €')]
I will use the array (row) and do some calculation. Therefore I will transform the data in a specifc structure after this i will get new figures.
In the end I poste my code and I want to use instead of varible [X, Y] in the select statement the index of the imported excel files as loop parameter or is it better to search inside the data base?
The information already stored in data base.
The variable [X1, Y1] are my new values, which must be written back in the data base with a specific relations, how should I do that to get the right relationship? Primary key or better ways?
Wishes
123GuteLaune
import pyodbc as py
import openpyxl as ol
#Connection string to access
cnxn = py.connect("DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};UID=admin;UserCommitSync=Yes#;Threads=3;SafeTransactions=0;PageTimeout=5;MaxScanRows=8;MaxBufferSize=2048;FIL={MS Access};DriverId=25;DefaultDir=C:\xxx\Desktop;DBQ=C:\xxx\Desktop\DatabaseMA_Test.mdb")
#Connection string to excel
wb = ol.load_workbook('C:\xxx\Desktop\Beisspieldatensatz.xlsx')
X = wb.get_sheet_by_name("Hierarchie")
tuple(X['A2':'A72']).value
wb = ol.load_workbook('C:\xxx\Desktop\Beisspieldatensatz.xlsx')
Y = wb.get_sheet_by_name("Mat_GR")
tuple(Y['A2':'A51']).value
#Output of the loaded Excel data
print (X,Y)
#Load Data out of MS Access
cursor.execute(
"""
SELECT Mat_ID, Monat, Jahr, Umsatzmenge, Umsatz
FROM [TW-DS]
Order BY Jahr ASC 'this is at the moment not working'
WHERE Hier_ID = ?
and Mat_ID= ?
""", [X, Y])
rows = cursor.fetchall()
print(rows)
#Mathematical model unfilled
#Write data back in MS Access
cursor.execute(
"""insert into [TW-DS]
(Outlier,
[Outlier Value])
Values
(X1,
Y2)
""")
cnxn.commit()
Related
Currently my code have simple tables containing the data needed for each object like this:
infantry = {class = "army", type = "human", power = 2}
cavalry = {class = "panzer", type = "motorized", power = 12}
battleship = {class = "navy", type = "motorized", power = 256}
I use the tables names as identifiers in various functions to have their values processed one by one as a function that is simply called to have access to the values.
Now I want to have this data stored in a spreadsheet (csv file) instead that looks something like this:
Name class type power
Infantry army human 2
Cavalry panzer motorized 12
Battleship navy motorized 256
The spreadsheet will not have more than 50 lines and I want to be able to increase columns in the future.
Tried a couple approaches from similar situation I found here but due to lacking skills I failed to access any values from the nested table. I think this is because I don't fully understand how the tables structure are after reading each line from the csv file to the table and therefore fail to print any values at all.
If there is a way to get the name,class,type,power from the table and use that line just as my old simple tables, I would appreciate having a educational example presented. Another approach could be to declare new tables from the csv that behaves exactly like my old simple tables, line by line from the csv file. I don't know if this is doable.
Using Lua 5.1
You can read the csv file in as a string . i will use a multi line string here to represent the csv.
gmatch with pattern [^\n]+ will return each row of the csv.
gmatch with pattern [^,]+ will return the value of each column from our given row.
if more rows or columns are added or if the columns are moved around we will still reliably convert then information as long as the first row has the header information.
The only column that can not move is the first one the Name column if that is moved it will change the key used to store the row in to the table.
Using gmatch and 2 patterns, [^,]+ and [^\n]+, you can separate the string into each row and column of the csv. Comments in the following code:
local csv = [[
Name,class,type,power
Infantry,army,human,2
Cavalry,panzer,motorized,12
Battleship,navy,motorized,256
]]
local items = {} -- Store our values here
local headers = {} --
local first = true
for line in csv:gmatch("[^\n]+") do
if first then -- this is to handle the first line and capture our headers.
local count = 1
for header in line:gmatch("[^,]+") do
headers[count] = header
count = count + 1
end
first = false -- set first to false to switch off the header block
else
local name
local i = 2 -- We start at 2 because we wont be increment for the header
for field in line:gmatch("[^,]+") do
name = name or field -- check if we know the name of our row
if items[name] then -- if the name is already in the items table then this is a field
items[name][headers[i]] = field -- assign our value at the header in the table with the given name.
i = i + 1
else -- if the name is not in the table we create a new index for it
items[name] = {}
end
end
end
end
Here is how you can load a csv using the I/O library:
-- Example of how to load the csv.
path = "some\\path\\to\\file.csv"
local f = assert(io.open(path))
local csv = f:read("*all")
f:close()
Alternative you can use io.lines(path) which would take the place of csv:gmatch("[^\n]+") in the for loop sections as well.
Here is an example of using the resulting table:
-- print table out
print("items = {")
for name, item in pairs(items) do
print(" " .. name .. " = { ")
for field, value in pairs(item) do
print(" " .. field .. " = ".. value .. ",")
end
print(" },")
end
print("}")
The output:
items = {
Infantry = {
type = human,
class = army,
power = 2,
},
Battleship = {
type = motorized,
class = navy,
power = 256,
},
Cavalry = {
type = motorized,
class = panzer,
power = 12,
},
}
I am trying to migrate a very large scale of rows and columns dynamically from Oracle to SQL Server. I am trying to automate this for about 20 tables daily.
My process is as follows:
Run Oracle configuration script: runs query which returns the Oracle table data, but the data has duplicates and bad data
Oracle Format Script: calls config script and then fixes the data to be accurate (deletes duplicates and bad data). Returns the data to a .txt with '~' separators
.txt to .csv: (this is my workaround/hack) Using Excel, I open all of the .txt files and use the AutoFormat to change them into perfectly formatted .csv's. (By the manual delimiter option, because when I use the auto-delimiter='~', it comes up with random one letter columns)
R: using the newly generated .csv's, I create table structures based on these headers in SQL Server
R: using the data collected from running the format script for each table, I have ~20 .txt files. I load them into R Studio and then attempt to insert new data using the following script:
Initialization:
RScript which loads the table structure of one .CSV
library(RODBC)
#set table name
tname <- "tablename"
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=dbname;
Uid=username; Pwd=pwd;trusted_connection=yes")
#get df headers from .csv
updatedtables <- read.csv(file = paste("<PATH>", tname, ".csv", sep=""), sep = ",")
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = F, rownames = F, verbose = T, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
odbcClose(conn)
return(0)
I run this code for each .csv to initialize the table structures
Dynamic things:
RScript which attempts to load data from .txt to existing table structures:
library(RODBC)
##~~~~~~~~~~~~~~ Oracle to MSSQL Automation Script ~~~~~~~~~~
#.bat script first runs which updates .txt dumps containing Oracle database info using configuration and format .sql queries for each table.
#function that writes to a table tname using data from tname.txt
writeTable <- function (tname){
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=db;
Uid=uid; Pwd=pw;trusted_connection=yes")
#remove all data entries for table
res <- sqlQuery(conn, paste("TRUNCATE TABLE dbo.", tname, sep = ""))
#load updated data from .txt files
#this skips three because of blank lines and skipping the headers
updatedtables <- read.csv(file = paste("<PATH>", tname, ".txt", sep=""),skip=3, sep = "~")
**ERROR**
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
#close connection
odbcClose(conn)
return(0)
}
#gets all file names and dynamically inputs the data for every file in the path directory
update <- function(){
#insert line separator for log organization which represents new script run
write("-UPDATE--------------------", file = "<PATH>", append = T)
path = "<PATH>"
#save all .txt file names in path to vector
file.names <- dir(path, pattern =".txt")
#go through path's files
for(i in 1:length(file.names)){
#get proper table name of file.names[i]
temp <- gsub(".txt", "", file.names[i])
#call helper func
writeTable(temp)
}
}
#run
update()
This file contains the error. Everything works and loads, except the line save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F), which returns the error
Error in `colnames<-`(`*tmp*`, value = c("QUOTENUMBER", "ENTITYPRODUCTITEMID", :
length of 'dimnames' [2] not equal to array extent
Any suggestions on what to do?
This is an example of the way the text dumps look (without spaces and with dummy data)
http://pastebin.com/kF8yR1R0
I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)
I have a SQL column named "details" and it contains the following data:
<changes><RoundID><new>8394</new></RoundID><RoundLeg><new>JAYS CLOSE AL6 Odds(1 - 5)</new></RoundLeg><SortType><new>1</new></SortType><SortOrder><new>230</new></SortOrder><StartDate><new>01/01/2009</new></StartDate><EndDate><new>01/01/2021</new></EndDate><RoundLegTypeID><new>1</new></RoundLegTypeID></changes>
<changes><RoundID><new>8404</new></RoundID><RoundLeg><new>HOLLY AREA AL6 (1 - 9)</new></RoundLeg><SortType><new>1</new></SortType><SortOrder><new>730</new></SortOrder><StartDate><new>01/01/2009</new></StartDate><EndDate><new>01/01/2021</new></EndDate><RoundLegTypeID><new>1</new></RoundLegTypeID></changes>
<changes><RoundID><new>8379</new></RoundID><RoundLeg><new>PRI PARK AL6 (1 - 42)</new></RoundLeg><SortType><new>1</new></SortType><SortOrder><new>300</new></SortOrder><StartDate><new>01/01/2009</new></StartDate><EndDate><new>01/01/2021</new></EndDate><RoundLegTypeID><new>1</new></RoundLegTypeID></changes>
What is the easiest way to separate this data out into individual columns? (that is all one column)
Try this:
SELECT DATA.query('/changes/RoundID/new/text()') AS RoundID
,DATA.query('/changes/RoundLeg/new/text()') AS RoundLeg
,DATA.query('/changes/SortType/new/text()') AS SortType
-- And so on and so forth
FROM (SELECT CONVERT(XML, Details) AS DATA
FROM YourTable) AS T
Once you get your result set from the sql (mysql or whatever) you will probably have an array of strings. As I understand your question, you wanted to know how to extract each of the xml nodes that were contained in the string that was stored in the column in question. You could loop through the results from the sql query and extract the data that you want. In php it would look like this:
// Set a counter variable for the first dimension of the array, this will
// number the result sets. So for each row in the table you will have a
// number identifier in the corresponding array.
$i = 0;
$output = array();
foreach($results as $result) {
$xml = simplexml_load_string($result);
// Here use simpleXML to extract the node data, just by using the names of the
// XML Nodes, and give it the same name in the array's second dimension.
$output[$i]['RoundID'] = $xml->RoundID->new;
$output[$i]['RoudLeg'] = $xml->RoundLeg->new;
// Simply create more array items here for each of the elements you want
$i++;
}
foreach ($output as $out) {
// Step through the created array do what you like with it.
echo $out['RoundID']."\n";
var_dump($out);
}
With RODBC, there were functions like sqlUpdate(channel, dat, ...) that allowed you pass dat = data.frame(...) instead of having to construct your own SQL string.
However, with R's DBI, all I see are functions like dbSendQuery(conn, statement, ...) which only take a string statement and gives no opportunity to specify a data.frame directly.
So how to UPDATE using a data.frame with DBI?
Really late, my answer, but maybe still helpful...
There is no single function (I know) in the DBI/odbc package but you can replicate the update behavior using a prepared update statement (which should work faster than RODBC's sqlUpdate since it sends the parameter values as a batch to the SQL server:
library(DBI)
library(odbc)
con <- dbConnect(odbc::odbc(), driver="{SQL Server Native Client 11.0}", server="dbserver.domain.com\\default,1234", Trusted_Connection = "yes", database = "test") # assumes Microsoft SQL Server
dbWriteTable(con, "iris", iris, row.names = TRUE) # create and populate a table (adding the row names as a separate columns used as row ID)
update <- dbSendQuery(con, 'update iris set "Sepal.Length"=?, "Sepal.Width"=?, "Petal.Length"=?, "Petal.Width"=?, "Species"=? WHERE row_names=?')
# create a modified version of `iris`
iris2 <- iris
iris2$Sepal.Length <- 5
iris2$Petal.Width[2] <- 1
iris2$row_names <- rownames(iris) # use the row names as unique row ID
dbBind(update, iris2) # send the updated data
dbClearResult(update) # release the prepared statement
# now read the modified data - you will see the updates did work
data1 <- dbReadTable(con, "iris")
dbDisconnect(con)
This works only if you have a primary key which I created in the above example by using the row names which are a unique number increased by one for each row...
For more information about the odbc package I have used in the DBI dbConnect statement see: https://github.com/rstats-db/odbc
Building on R Yoda's answer, I made myself the helper function below. This allows using a dataframe to specify update conditions.
While I built this to run transaction updates (i.e. single rows), it can in theory update multiple rows passing a condition. However, that's not the same as updating multiple rows using an input dataframe. Maybe somebody else can build on this...
dbUpdateCustom = function(x, key_cols, con, schema_name, table_name) {
if (nrow(x) != 1) stop("Input dataframe must be exactly 1 row")
if (!all(key_cols %in% colnames(x))) stop("All columns specified in 'key_cols' must be present in 'x'")
# Build the update string --------------------------------------------------
df_key <- dplyr::select(x, one_of(key_cols))
df_upt <- dplyr::select(x, -one_of(key_cols))
set_str <- purrr::map_chr(colnames(df_upt), ~glue::glue_sql('{`.x`} = {x[[.x]]}', .con = con))
set_str <- paste(set_str, collapse = ", ")
where_str <- purrr::map_chr(colnames(df_key), ~glue::glue_sql("{`.x`} = {x[[.x]]}", .con = con))
where_str <- paste(where_str, collapse = " AND ")
update_str <- glue::glue('UPDATE {schema_name}.{table_name} SET {set_str} WHERE {where_str}')
# Execute ------------------------------------------------------------------
query_res <- DBI::dbSendQuery(con, update_str)
DBI::dbClearResult(query_res)
return (invisible(TRUE))
}
Where
x: 1-row dataframe that contains 1+ key columns, and 1+ update columns.
key_cols: character vector, of 1 or more column names that are the keys (i.e. used in the WHERE clause)
Here is a little helper function I put together using REPLACE INTO to update a table using DBI, replacing old duplicate entries with the new values. It's basic and for my own needs, but should be easy to modify. All you need to pass to the function is the connection, table name, and dataframe. Note that the table must have a PRIMARY KEY column. I've also included a simple working example.
row_to_list <- function(Y) suppressWarnings(split(Y, f = row(Y)))
sql_val <- function(y){
if(!is.numeric(y)){
return(paste0("'",y,"'"))
}else{
if(is.na(y)){
return("NULL")
}else{
return(as.character(y))
}
}
}
to_sql_row <- function(x) paste0("(",paste(do.call("c", lapply(x, FUN = sql_val)), collapse = ", "),")")
bracket <- function(x) paste0("`",x,"`")
to_sql_string <- function(x) paste0("(",paste(sapply(x, FUN = bracket), collapse = ", "),")")
replace_into_table <- function(con, table_name, new_data){
#new_data <- data.table(new_data)
cols <- to_sql_string(names(new_data))
vals <- paste(lapply(row_to_list(new_data), FUN = to_sql_row), collapse = ", ")
query <- paste("REPLACE INTO", table_name, cols, "VALUES", vals)
rs <- dbExecute(con, query)
return(rs)
}
tb <- data.frame("id" = letters[1:20], "A" = 1:20, "B" = seq(.1,2,.1)) # sample data
dbWriteTable(con, "test_table", tb) # create table
dbExecute(con, "ALTER TABLE test_table ADD PRIMARY KEY (id)") # set primary key
new_data <- data.frame("id" = letters[19:23], "A" = 1:5, "B" = seq(101,105)) # new data
new_data[4,2] <- NA # add some NA values
new_data[5,3] <- NA
table_name <- "test_table"
replace_into_table(con, "test_table", new_data)
result <- dbReadTable(con, "test_table")