Left-Joining multiple (40) files into a single table - database

As a total newcomer to database management, i am currently Running Postgresql 9.3 through PgAdmin. My goal is to condense 40 files into one table, where my setup is as follows:
A table that contains a standalone Master Key Column with ~400k unique integer observations.
|Master Key|
20 files, three columns each. First column contains an integer key that is guaranteed to match an observation on the "master" column. Second and third columns contain integer values.
|Master Key-like Value| IntValue1 | IntValue 2|
20 files with multiple columns containing text details, where first column contains an integer key in the same fashion.
|Master Key-Like Value| Multiple Data |
I am currently thinking about importing all of the files into a corresponding table each and left joining them, where the final output would be:
Master Key | File 1 IntValue 1 | File 1 Intvalue 2| File 2 Intvalue 1 ... | File 20 Intvalue 2 | Multiple Data |
Placing null values if no corresponding value is found. (This is a very possible scenario, where the int values are organized in an implicit date-like fashion for every file in the sequence)
Will a left join get me such output? Is there a more efficient way to concatenate such table?

Importing each file in a separate table and doing a big join is a good approach.
Database engines are optimized for just this kind of computation.
You could achieve something similar with the unix join command, but it can only process two files at a time and would likely take more time.

Related

Creating a Postgres sequence for each foreign key as a default parameter?

I am trying to build a journal that keeps track of accounts. It's append-only, and each account should have its own sequence. For example:
sequence_nbr | account_id
1 | act_1
2 | act_1
1 | act_2
1 | act_3
2 | act 2
3 | act_1
I'd like sequence_nbr to be a permanent column in my journal table, and I'd like it to be automatically incremented, that is, when I do an insert I don't have to specify a value for it, Postgres automatically computes its correct sequence number for me.
I have tried two different ways:
Creating a sequence, but I couldn't get it to depend on the value of account_id
Creating a function as in Postgres Dynamically Create Sequences, but I can't figure out how to pass the argument to the function to create a default on the column definition for the journal table.
Is there a way to accomplish what I want in Postgres?

Read data from two CSC file with using Join to store a table by using SSIS

I am new to SSIS. (I am a learning stage but had some task). I have two CSV file. Both files have 3 columns. One file has FYON, Family, Number.
Another file has Family, Number, Description.
The both Family and Number columns are relational columns for both files.
I want to read the values from those files and need to store the data in SQL server table as below columns
+----+------+--------+--------+-------------+
| ID | Fyon | Family | Number | Description |
+----+------+--------+--------+-------------+
| 1 | 50 | AP | 01 | SV32 |
+----+------+--------+--------+-------------+
Also I want to store the error in error table if the data is null or duplicate
I don't know how can I achieve this.
A simple merge transformation in SSIS would do this for you. You need to read data from both files, sort them and then MERGE them (just number field should be sufficient) and then push all the required columns fom both the sources to the SQL table (OleDB Destination/SQL Server Destination).
Look at the second example here: http://www.learnmsbitutorials.net/ssis-merge-and-mergejoin-example.php
I have done it by myself with help of #deepak answer. finally the flow is

Issue with merging (or union) multiple "copy column" transformations

I have a legacy database that I am doing some ETL work on. I have columns in the old table that are conditionally mapped to columns in my new table. The conditions are based on an associated column (a column in the same table that represents the shape of an object, we can call that column SHAPE). For example:
Column dB4D is mapped to column:
B4 if SHAPE=5
B3 if SHAPE=1
X if SHAPE=10
or else Y
I am using a condition to split the table based on SHAPE, then I am using 10-15 "copy column" transformations to take the old column (dB4D) and map it to the new column (B4, B3, X, etc).
Some of these columns "overlap". For example, I have multiple legacy columns (dB4D, dB3D, dB2D, dB1D, dC1D, dC2D, etc) and multiple new columns (A, B, C, D, etc). In one of the "copy columns" (which are broken up by SHAPE) I could have something like:
If SHAPE=10
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | B |
+--------------+--------------+
If SHAPE=5
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | C |
+--------------+--------------+
I need to now bring these all together into one final staging table (or "destination"). Not two rows will have the same size, so there is no conflict. But I need to map dB4D (and other columns) to different new columns based on a value in another column. I have tried to merge them but can't merge multiple data sources. I have tried to join them but not all columns (or output aliases) would show up in the destination. Can anyone recommended how to resolve this issue?
Here is the current design that may help:
As inputs to your data flow, you have a set of columns dB4D, dB3D, dB2D, etc.
Your destination will only have column names that do not exist in your source data.
Based on the Shape column, you'll project the dB columns into different mappings for your target table.
If the the Conditional Split logic makes sense as you have it, don't try and Union All it back together. Instead, just wire up 8 OLE DB Destinations. You'll probably have to change them from the "fast load" option to the table name option. This means it will perform singleton inserts so hopefully the data volumes won't be an issue. If they are, then create 8 staging table that you do use the "Fast Load" option for and then have a successor task to your Data Flow to perform set based inserts into the final table.
The challenge you'll run into with the Union All component is that if you make any changes to the source, the Union All rarely picks up on the change (the column changed from varchar to int, sorry!).

Retrieving data from 2 tables that have a 1 to many relationship - more efficient with 1 query or 2?

I need to selectively retrieve data from two tables that have a 1 to many relationship. A simplified example follows.
Table A is a list of events:
Id | TimeStamp | EventTypeId
--------------------------------
1 | 10:26... | 12
2 | 11:31... | 13
3 | 14:56... | 12
Table B is a list of properties for the events. Different event types have different numbers of properties. Some event types have no properties at all:
EventId | Property | Value
------------------------------
1 | 1 | dog
1 | 2 | cat
3 | 1 | mazda
3 | 2 | honda
3 | 3 | toyota
There are a number of conditions that I will apply when I retrieve the data, however they all revolve around table A. For instance, I may want only events on a certain day, or only events of a certain type.
I believe I have two options for retrieving the data:
Option 1
Perform two queries: first query table A (with a WHERE clause) and store data somewhere, then query table B (joining on table A in order to use same WHERE clause) and "fill in the blanks" in the data that I retrieved from table A.
This option requires SQL Server to perform 2 searches through table A, however the resulting 2 data sets contain no duplicate data.
Option 2
Perform a single query, joining table A to table B with a LEFT JOIN.
This option only requires one search of table A but the resulting data set will contain many duplicated values.
Conclusion
Is there a "correct" way to do this or do I need to try both ways and see which one is quicker?
Ex
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
More REF

How to represent a 2-D data matrix in a database

I have a data set which consists of an ID and a matrix (n x n) of data related to that ID.
Both the column names (A,B,C,D) and the Row names (1,2,3) are also important and need to be held for each individual ID, as well as the data (a1,b1,c1,d1,...)
for example:
ID | A | B | C | D |
1 | a1 | b1 | c1 | d1 |
2 | ... | ... | ... | ... |
3 | ... | ... | ... | ... |
I am trying to determine the best way of modelling this data set in a database, however, it seems like something that is difficult given the flat nature of RDBMS.
Am I better off holding the ID and an XML blob representing the data matrix, or am i overlooking a simpler solution here.
Thanks.
RDBMSes aren't flat. The R part sees to that. What you need is:
Table Entity
------------
ID
Table EntityData
----------------
EntityID
MatrixRow (1, 2, 3...)
MatrixColumn (A, B, C, D...)
Value
Entity:EntityData is a one-to-many relationship; each cell in the matrix has an EntityData row.
Now you have a schema that can be analyzed at the SQL level, instead of just being a data dump where you have to pull and extract everything at the application level in order to find out anything about it.
This is one of the reasons why PostgreSQL supports arrays as a data type. See
http://www.postgresql.org/docs/8.4/static/functions-array.html
and
http://www.postgresql.org/docs/8.4/static/arrays.html
Where it shows you can use syntax like ARRAY[[1,2,3],[4,5,6],[7,8,9]] to define the values of a 3x3 matrix or val integer[3][3] to declare a column type to be a 3x3 matrix.
Of course this is not at all standard SQL and is PostgreSQL specific. Other databases may have similar-but-slightly-different implementations.
If you want a truly relational solution:
Matrix
------
id
Matrix_Cell
-----------
matrix_id
row
col
value
But constraints to make sure you had valid data would be hideous.
I would consider a matrix as a single value as far as the DB is concerned and store it as
csv:
Matrix
------
id
cols
data
Which is somewhat lighter than XML.
I'd probably implement it like this:
Table MatrixData
----------------
id
rowName
columnName
datapoint
If all you're looking for is storing the data, this structure will hold any size matrix and allow you to reconstitute any matrix from the ID. You will need some post-processing to present it in "matrix format", but that's what the front-end code is for.
can the data be thought of as "row data"? if so then maybe you could store each row as a Object (or XML Blob) with data A,B,C,D and then, in your "representation", you use something like a LinkedHashMap (assuming Java) to get the objects with an ID key.
Also, it seems that by its very basic nature, a typical database table already does what you need doesn't it?
Or even better what you can do is, create a logical array like structure.
Say u want to store an m X n array..
Create m attributes in the table.
In each attribute store n elements separated by delimiters ...
while retrieving the data, simply do reverse parsing to easily get back the data..

Resources