Issue with merging (or union) multiple "copy column" transformations - sql-server

I have a legacy database that I am doing some ETL work on. I have columns in the old table that are conditionally mapped to columns in my new table. The conditions are based on an associated column (a column in the same table that represents the shape of an object, we can call that column SHAPE). For example:
Column dB4D is mapped to column:
B4 if SHAPE=5
B3 if SHAPE=1
X if SHAPE=10
or else Y
I am using a condition to split the table based on SHAPE, then I am using 10-15 "copy column" transformations to take the old column (dB4D) and map it to the new column (B4, B3, X, etc).
Some of these columns "overlap". For example, I have multiple legacy columns (dB4D, dB3D, dB2D, dB1D, dC1D, dC2D, etc) and multiple new columns (A, B, C, D, etc). In one of the "copy columns" (which are broken up by SHAPE) I could have something like:
If SHAPE=10
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | B |
+--------------+--------------+
If SHAPE=5
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | C |
+--------------+--------------+
I need to now bring these all together into one final staging table (or "destination"). Not two rows will have the same size, so there is no conflict. But I need to map dB4D (and other columns) to different new columns based on a value in another column. I have tried to merge them but can't merge multiple data sources. I have tried to join them but not all columns (or output aliases) would show up in the destination. Can anyone recommended how to resolve this issue?
Here is the current design that may help:

As inputs to your data flow, you have a set of columns dB4D, dB3D, dB2D, etc.
Your destination will only have column names that do not exist in your source data.
Based on the Shape column, you'll project the dB columns into different mappings for your target table.
If the the Conditional Split logic makes sense as you have it, don't try and Union All it back together. Instead, just wire up 8 OLE DB Destinations. You'll probably have to change them from the "fast load" option to the table name option. This means it will perform singleton inserts so hopefully the data volumes won't be an issue. If they are, then create 8 staging table that you do use the "Fast Load" option for and then have a successor task to your Data Flow to perform set based inserts into the final table.
The challenge you'll run into with the Union All component is that if you make any changes to the source, the Union All rarely picks up on the change (the column changed from varchar to int, sorry!).

Related

Condensing a bunch of columns into one array column mapped to a key

I'm doing a project that analyzes covid data and I'm trying to clean the data of null values and the like, but first need to make it usable. It currently has an individual column for every date and the amount of new cases that day. The Combined_Key column is unique so that was what I was going to try to map the dates and cases to.
Also every column is of type String so I imagine I'll need to insert the data into a dataframe that's setup with the correct types but I also don't know how to do that without making 450 date columns all typed separately, even more exciting is that there isn't an inherent date type in spark/scala so not sure how to handle that.
UID,iso2,iso3,code3,FIPS,Admin2,Province_State,Country_Region,Lat,Long_,Combined_Key,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,1/31/20,2/1/20,2/2/20,2/3/20,2/4/20,2/5/20,2/6/20,2/7/20,2/8/20,2/9/20,2/10/20,2/11/20,2/12/20,2/13/20,2/14/20,2/15/20,2/16/20,2/17/20,2/18/20,2/19/20,2/20/20,2/21/20,2/22/20,2/23/20,2/24/20,2/25/20
84001001,US,USA,840,1001.0,Autauga,Alabama,US,32.53952745,-86.64408227,"Autauga, Alabama, US",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,6,6,6,6,8,8,10,12,12,12,12,12,12,12,17,18,19,19,19,23,24,24,24,25,26,28,30,32,33,36,36,36,37,39,41,42,43,47,51,54,54,56,58,62,63,72,81
There's part of top 2 rows of the data, a whole lot of date columns have been left out. I'm working in the spark shell, I've tried something like this after turning the data into a table but that gets either a "error: 5 more arguments than can be applied to method ->: (y: B)(String, B)" or "error: type mismatch;" respectively.
var covidMap = scala.collection.mutable.Map[String, ArrayBuffer[Int]]()
table.foreach{x => covidMap += (x(10)).toString -> (x(11),x(20),x(30),x(40),x(50),x(60))}
table.foreach{x => covidMap += (x(10)).toString -> (x(11))}
Honestly I don't know if these are even close to what I need to be doing, I've been coding for 5 weeks in a training program and it's incredibly difficult for me thus far, so, I'm here. Any help is appreciated!
Starting with an example DataFrame (taking your first two example date columns and adding today's date to show it'll work in the future):
val df = List(
(84001001,"US","USA",840,1001.0,"Autauga","Alabama","US",32.53952745,-86.64408227,"Autauga, Alabama, US",0,0,50)
)
.toDF("UID","iso2","iso3","code3","FIPS","Admin2","Province_State","Country_Region","Lat","Long_","Combined_Key","1/22/20","1/23/20","4/2/22")
.show()
gives:
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
| UID|iso2|iso3|code3| FIPS| Admin2|Province_State|Country_Region| Lat| Long_| Combined_Key|1/22/20|1/23/20|4/2/22|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
|84001001| US| USA| 840|1001.0|Autauga| Alabama| US|32.53952745|-86.64408227|Autauga, Alabama, US| 0| 0| 50|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+-------+-------+------+
We can then create a new column, which I've called dates but you can easily rename. Here the array function is used to combine all of the values of the date columns into column, which is an array:
import org.apache.spark.sql.functions.{array, col}
val dateRegex = "\\d+/\\d+/\\d+" // matches all columns in x/y/z format
val dateColumns = df.columns.filter(_.matches(dateRegex))
df
// select all date columns and combine into a new column: `dates`
.withColumn("dates", array(dateColumns.map(df(_)): _*))
// drop the original date columns, keeping `dates`
.drop(dateColumns: _*)
.show(false)
gives:
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
|UID |iso2|iso3|code3|FIPS |Admin2 |Province_State|Country_Region|Lat |Long_ |Combined_Key |dates |
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
|84001001|US |USA |840 |1001.0|Autauga|Alabama |US |32.53952745|-86.64408227|Autauga, Alabama, US|[0, 0, 50]|
+--------+----+----+-----+------+-------+--------------+--------------+-----------+------------+--------------------+----------+
A downside to this is that the output DataFrame doesn't retain the original date values.

Left-Joining multiple (40) files into a single table

As a total newcomer to database management, i am currently Running Postgresql 9.3 through PgAdmin. My goal is to condense 40 files into one table, where my setup is as follows:
A table that contains a standalone Master Key Column with ~400k unique integer observations.
|Master Key|
20 files, three columns each. First column contains an integer key that is guaranteed to match an observation on the "master" column. Second and third columns contain integer values.
|Master Key-like Value| IntValue1 | IntValue 2|
20 files with multiple columns containing text details, where first column contains an integer key in the same fashion.
|Master Key-Like Value| Multiple Data |
I am currently thinking about importing all of the files into a corresponding table each and left joining them, where the final output would be:
Master Key | File 1 IntValue 1 | File 1 Intvalue 2| File 2 Intvalue 1 ... | File 20 Intvalue 2 | Multiple Data |
Placing null values if no corresponding value is found. (This is a very possible scenario, where the int values are organized in an implicit date-like fashion for every file in the sequence)
Will a left join get me such output? Is there a more efficient way to concatenate such table?
Importing each file in a separate table and doing a big join is a good approach.
Database engines are optimized for just this kind of computation.
You could achieve something similar with the unix join command, but it can only process two files at a time and would likely take more time.

Birt report : How to hide in's and out of the table value in Birt Report

I have a following employee table value as below :
name | cost
john | 1000
john | -1000
john | 5000
when we add the cost column total will be 5000.
I need to print only the 3rd row in BIRT report, since the 1st and 2nd row get cancelled out each other.
I'm stuck at filtering the table for above scenario.
Couldn't you just solve this using SQL like
select name, sum(cost)
from employee
group by name
order by name
?
Or do you just want to exclude two rows if they have exactly the same cost, but with different signs? Note that this is actually something different, take for example the three rows [ john|1, john|2, john|-3 ]? In this case, a pure SQL solution can be achieved using the SQL analytic functions (at least if you are using Oracle).
Elaborate your question. Its not clear if these are columns or rows.
If These are columns:
Create a computed column in your dataset
In Expression builder of that column add/sub values using dataSetRow['col1'] and dataSetRow['col2']
Add only that computed column to your table.
If these are rows
Select rows you don't want to print
Go to properties tab
Find Visibility property and click on it
Check Hide Element option

How to merge two Excel sheets

I have an Excel document with 10000 rows of data in two sheets, the thing is one of these sheets have the product costs, and the other has category and other information. These two are imported automatically from the sql server so I don't want to move it to Access but still I want to link the product codes so that when I merge the product tables as product name and cost on the same table, I can be sure that I'm getting the right information.
For example:
Code | name | category
------------------------------
1 | mouse | OEM
4 | keyboard | OEM
2 | monitor | screen
Code | cost |
------------------------------
1 | 123 |
4 | 1234 |
2 | 1232 |
7 | 587 |
Let's say my two sheets have tables like these, as you can see the next one has one that doesn't exist on the other- I put it there because in reality one has a few more, preventing a perfect match. Therefore I couldn't just sort both tables to A-Z and get the costs that way- as I said there are more than 10000 products in that database and I wouldn't want to risk a slight shift of costs -with those extra entries on the other table- that would ruin the whole table.
So what would be a good solution to get the entry from another sheet and inserting it to the right row when merging? Linking two tables with field name??... checking field and trying to match it with the other sheet??... Anything at all.
Note: When I use Access I would make relationships and when I would run a query it would match them automatically... I was wondering if there's a way to do that in excel too.
Why not use a vlookup? If there is a match, it will list the cost. Assuming the top is sheet1 and the other sheet2 and they both start on cell A1. You just need this in cell D2.
=VLOOKUP(A2,Sheet2!A:B,2,0)
You can then drag it down. Easiest way to fill all your 10000 rows is to hover over the bottom left corner of the cell with your cursor. It will turn from a white plus sign into a thin black one. Then simply double click.
Just use VLOOKUP - you can add a row to your first sheet, and find the cost based on code in the other sheet.

How to represent a 2-D data matrix in a database

I have a data set which consists of an ID and a matrix (n x n) of data related to that ID.
Both the column names (A,B,C,D) and the Row names (1,2,3) are also important and need to be held for each individual ID, as well as the data (a1,b1,c1,d1,...)
for example:
ID | A | B | C | D |
1 | a1 | b1 | c1 | d1 |
2 | ... | ... | ... | ... |
3 | ... | ... | ... | ... |
I am trying to determine the best way of modelling this data set in a database, however, it seems like something that is difficult given the flat nature of RDBMS.
Am I better off holding the ID and an XML blob representing the data matrix, or am i overlooking a simpler solution here.
Thanks.
RDBMSes aren't flat. The R part sees to that. What you need is:
Table Entity
------------
ID
Table EntityData
----------------
EntityID
MatrixRow (1, 2, 3...)
MatrixColumn (A, B, C, D...)
Value
Entity:EntityData is a one-to-many relationship; each cell in the matrix has an EntityData row.
Now you have a schema that can be analyzed at the SQL level, instead of just being a data dump where you have to pull and extract everything at the application level in order to find out anything about it.
This is one of the reasons why PostgreSQL supports arrays as a data type. See
http://www.postgresql.org/docs/8.4/static/functions-array.html
and
http://www.postgresql.org/docs/8.4/static/arrays.html
Where it shows you can use syntax like ARRAY[[1,2,3],[4,5,6],[7,8,9]] to define the values of a 3x3 matrix or val integer[3][3] to declare a column type to be a 3x3 matrix.
Of course this is not at all standard SQL and is PostgreSQL specific. Other databases may have similar-but-slightly-different implementations.
If you want a truly relational solution:
Matrix
------
id
Matrix_Cell
-----------
matrix_id
row
col
value
But constraints to make sure you had valid data would be hideous.
I would consider a matrix as a single value as far as the DB is concerned and store it as
csv:
Matrix
------
id
cols
data
Which is somewhat lighter than XML.
I'd probably implement it like this:
Table MatrixData
----------------
id
rowName
columnName
datapoint
If all you're looking for is storing the data, this structure will hold any size matrix and allow you to reconstitute any matrix from the ID. You will need some post-processing to present it in "matrix format", but that's what the front-end code is for.
can the data be thought of as "row data"? if so then maybe you could store each row as a Object (or XML Blob) with data A,B,C,D and then, in your "representation", you use something like a LinkedHashMap (assuming Java) to get the objects with an ID key.
Also, it seems that by its very basic nature, a typical database table already does what you need doesn't it?
Or even better what you can do is, create a logical array like structure.
Say u want to store an m X n array..
Create m attributes in the table.
In each attribute store n elements separated by delimiters ...
while retrieving the data, simply do reverse parsing to easily get back the data..

Resources