Importing CSV to database (duplicate entries) - database

My job requires that I look up information on a long spreadsheet that's updated and sent to me once or twice a week. Sometimes the newest spreadsheet leaves off information that was in the last spreadsheet causing me to have to look through several different spreadsheets to find the info I need. I recently discovered that I could convert the spreadsheet to a CSV file and then upload it to a database table. With a few lines of script all I have to do is type in what I'm looking for and Voila! Now I just got the newest spreadsheet and I'm wondering if I can just Import it on top of the old one. There is a unique number for each row that I have set to primary in the database. If I try to import it on top of the current info will it just skip the rows where the primary would be duplicated or would it just mess up my database?
Thought I'd ask the experts before I tried it. Thanks for your input!
Details:
the spreadsheet consists of clients of ours. Each row contains the client's name, a unique id number, their address and contact info. I can set the row containing the unique ID to primary, then upload it. My concern is that there is nothing to signify a new row in a csv file (i think). when I upload it it it gives me the option to skip duplicates but will it skip the entire row or just that cell causing my data to be placed in the wrong rows.. It's apache server IDK what versions of mysql. I'm using 000webhost for this.

Higgs,
This issue in database/ETL terminology is called deduplication strategy.
There is not a template answer for this, but I suggest these helpful readings:
Academic paper - Joint Deduplication of Multiple Record Types
in Relational Data
Deduplication article
Some open source tools:
Duke tool
Data cleaner

there's a little checkbox when you click on import near the bottom that says 'ignore duplicates' or something like that. simpler than i thought.

Related

Performance issue, flat file vs database storage

I am working on a project that asked me to develop a system to provide JSON output, the following is the flow:
1a) Some tables will be updated via the administration panel (my company side)
1b) Some related tables will be updated via the administration panel (partner side)
-- Let say SuperHeroes & Males were updated in 1a), Studios & Years were updated in 1b)
2) Client browse our site and request an information set which:
Has an enabled and not deleted row in SuperHeroes (Ant-man)
Has an enabled and not deleted row in Males linked to SuperHeroes (Scott Lang)
Joined the above records then look if they are linked to and exists in Studios (Marvel)
Linked to an existing row in Years (2015)
3) A very small data will be outputted to a JSON string as the following: { id:1,type:marvel },{ id:1,type:dc }
All rows in the above 4 tables will be updated/deleted at anytime without notification, [No Foreign Key as well]
I am thinking to update the information in a flat file every time 1a is performed (since we can update the system of my company side but not the partner side, and they are rejected to save some extra information into a flat file, so the situation is we have no easy way to know if the Studios or Years tables are modified)
Then while the JSON request will first load the information from the flat file (all outputting data will be stored in this file), then use a simple SQL statement to filter if a linked record exists in Studios & Years
I have done my research and getting confused, I concluded when the data amount is small then flat file will be great but beware the file comes larger and larger (The flat file we are talking will noway more than 50 rows at 1 time, and that should not be modified frequently)
Some answer said database is good at Query data (I think so and the requirement will perform SQL check too)
So I don't know if its good when my data amounts are small but still need some communication with the database..
I appreciate your time and your help, all idea & hints are welcome, thanks!
Your conclusion regarding amount of data is absolutely correct, and file should handle those 50 rows, but..
Using database as storage should give you more options in the future, e.g.:
you'll be able to produce any output due to separation of data
representation (today it's JSON, but what if you have to produce XML
at some point? will you add files that store XML's next to JSON
files? and then CSV or any other?)
transactions will guarantee the ACID
scalability and performance - if your dataset get bigger (you never know :)), many
DB's offer you many possibilities like table partitioning, partitioning-based clustering
or replication
Wrong choices regarding architecture and technology made at the beginning of the project will always backfire.

Do nothing with (no) match output in the lookup transformation (SSIS)

I'm new to SSIS as well as Stackoverflow.
Here's my situation.
I'm building a database and an archive database which need to be synced daily. The records in the database need to be copied to the archieve. I use SSIS and daily jobs to do this. Obviously, I don't want SSIS to load all the data everytime, only the new records (that aren't in the archive yet). I want to use the lookup transformation to achieve this. I'm testing it and it works, it only copies the new data to the "no match output" (which is my archive). But I linked the "match output" to a new destination. But as there are many columns and records, it would be way too much to keep all those redundant data (ofcourse I can purge the data but I don't want to have those extra columns in the first place!). I actually don't want the "match output" to be send to anywhere! How to do this? Or some solution that is more efficient than what I'm doing now (sending the matched outputs to new destinations and deleting those columns or records later on).
P.S.
I already found this question on stackoverflow which is a similar question (except the other way around, the TS wants to do nothing with "no match output"): Sending no match output rows to nowhere
But the thing is that I don't want to download/use "thrash destination", I'd rather use everything that is already built in SSIS itself. And I don't understand how the derived column transformation could solve the problem.
There are no other answers on that question, so that's why I made a new thread.
Can anyone help me with this? (and excuse my English, it's not my native language)
Just don't map the match output. In case that gives an error map it to a row count, that way you can keep track of the amount of duplicate data being handled.
Would be even better to filter this in the source component though, for performance reasons

Import data from .xls to table by removing unwanted columns? [duplicate]

I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.

SSIS Package that removes or ignores multiple rows in flat file

I'm learning how to develop SSIS packages for ETL systems this week. One of my first objectives is to discover different ways to import flat files into a database. As this is pretty straight forward for the most part, I've been playing around with different flat files that contain a variety of data.
One issue I ran into today was with a Excel document that contained data in the first row, the header information in the second row and foot information in the last couple of rows. What I want to import into the database is the header and all the rows leading up to the footer. I do not want the first row and I do not want the footer.
My current solution is to create a Data Flow task in Advance Settings and OpenRowSet with "Sheet1$A2:I20000". This allows me to open the sheet I want, select the second row (where my header resides) and then select all other rows that are between A2 and I20000.
This solution also allows me to read the header information (which I want) and all the rows that follow for importation. Unfortunately, this also selects the footer rows and doesn't seem optimize for good performance as the package has to scan a massive range of rows regardless if there is data in those rows or not.
The screenshot below contains the Excel sheet that I'm trying to import based on the MS SQL sample database. The rows I want to remove or ignore are circles with the red box. Everything else not circled is what I want to import.
Any thoughts on how I can ignore the first row, read the second row for my header information, read the rows that follow the header for my data set and then ignore the last couple of rows that I'm deeming as the footer?
Addition Information About This File
The first row will never change.
The header row will never change.
The data set after the header will change values, not data types.
The first column of footer will never change.
The second column of footer will change values, not data types.
The rest of the footer columns will never change.
I figured out the solution to my own question.
I used the Conditional Split as shown in my diagram to filter out the rows I didn't need. For example, I put a condition that checks if the first column of data (member_no) was < (less than) a number. If TRUE, it goes to my OLE DB. If False, it goes nowhere. This prevented the "SUM TOTAL" from being passed to the database.
I also edited my start range with 'Sheet1$A2:I' as opposed to 'Sheet1$A2:I20000'. That way the package scans until there is no records to scan and stops (I assume).

Manual Entered Data On Excel Ms Query Is Misaligned After Refresh

I have done an MS SQL Query in excel.
I have added extra colums in the excel sheet which I want to enter manual
data in.
When I refresh the data, these manually inputted columns become misaligned
to the imported data they refer to.
Is there any around this happening.
I have tried to link the imported data sheet to a manual data sheet via
vlookup but this isn't working as there are no unique fields to link together.
Please help!
Thanks
Excel version is 2010.
MS SQL version is 2005.
There is no unique data.
Because excel firstly looks like this.
when we entered a new order in to database Excel looks like this
Try this: in the External Data Range Properties, select "Insert entire rows for new data".
Not sure, but worth a try. And keep us updated of the result !
edit: And make sure you provide a consistent sort order.
There is no relationship to the spreadsheets external data and the columns you are entering. When refreshing typically the data is cleared and updated though there are other options in the external data refresh menu you could play with. You could play around with the External data options in the menu to see if changing the settings on what happens with the new data would help.
If you want your manually entered data to link to the data in the embedded dataset, you have to establish the lookup with a vlookup or some formula to find the rows info and show it.
Basically you are thinking the SQL data on the spreadsheet is static, but it isn't unless you never refresh it or disconnect it from the database
note that Marcel Beug has given a full solution to this problem in a more recent post in this forum # Inserting text manually in a custom column and should be visible on refresh of the report
he has even taken the time to record an example in a video # https://www.youtube.com/watch?v=duNYHfvP_8U&feature=youtu.be

Resources