I have a pipe delimited file that is too large to open in Excel. I'm trying to import this file into MSSQL using the import wizard in SSMS.
Normally when I do this, I open the file in Excel and use an array function =MAX(LEN(An:Annnn)) to get the max length of each column. Then I use that to specify the size of each field in my table.
This file is too large to open in Excel and SQL doesn't check all of the data to give an accurate suggestion (I think it's a crazy small sample like 200 records).
Anyone have a solution to this (I'm not opposed to doing something in Linux especially if it's free).
Thanks in advance for any help.
When I import text data into a database, typically I first read the data into a staging table where are the columns are long-enough character fields (say varchar(8000)).
Then, I load from the staging table into the final table:
create table RealTable (
RealTableId int identity(1, 1) primary key,
Column1 int,
Column2 datetime,
Column3 varchar(12),
. . .
);
insert into RealTable(<all columns but id>)
select (case when column1 not like '[^0-9]' then cast(column1 as int) end),
(case when isdate(column2) = 1 then cast(column2 as datetime),
. . .
I find it much easier to debug type issues inside the database rather than when inserting into the database.
Related
I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.
We are importing an external excel files to our SQL database but we need to do some integrity checks using sql script
Here is my sample data
Row_no Student_area Student_subject Code
1 Science Science and Tech ABC
2 Science Science and Teck ABC
3 Arts Pschycolgy DEF
4 Arts Pscycology DEF
I need to identifythe anomalies
How do I do that?
Cheers
Oracle SQL has many neat features but it does not exhibit human style intelligence (yet). So it cannot identify "anomalies" in data. We must declare the rules for correctness.
In your case you need to define a set of correct values, preferably as reference data tables:
create table student_area (student_area varchar2(30));
insert into student_area values ('Science');
insert into student_area values ('Arts');
create table student_subject (student_area varchar2(30),
student_subject varchar2(128),
subject_code varchar2(3));
insert into student_subject values ('Science', 'Science and Tech', 'ABC');
insert into student_subject values ('Arts', 'Psychology', 'DEF');
Now you are ready to evaluate the contents of your file. The easiest way to do this is to convert the Excel file to CSV and build an external table over it. This is a special type of table where the data resides in an external OS file rather than in the database. Find out more.
If you create an external table with a column mapped to every column in the spreadsheet you can identify the anomalies like this:
select * from external_table ext
where ext.student_area not in ( select s.student_area
from student_area s )
/
select * from external_table ext
where (ext.student_area, ext.student_subject) not in
( select s.student_area, s.student_subject
from student_subject s )
/
i need your help in this as i am stuck .
i have 2 tables the collation of some fields is arabic ( address , client Name ) bith these fields collation are arabic .
table one ( uploaded_data ) fields -----> client_name , client_address
i use upload excel file to table1 ( uploaded_data) and its succesfully work 100% and the address and name coming in arabic
means there is no problems in this table
i add trigger on insert for table 1 to save data in table 2 by using
select * from insrted
table two ( client_Files ) fields ------>client_name , client_address
the problem that when the trigger fired and the data saved to table 2 the records not showing known characters its coming rabish because i use the parameter to save the data
if i use the field direct without using the parameter its working fine
so can any one advice and note taht all fields name and address are arabic collation
please advice
Check if your column type is NVARCHAR instead of Varchar .Because NVARCHAR supports unicode.
Is there any way to use the Bulk Insert statement and disable MAXERRORS?
I would like to allow for an infinite number of errors, as the number of errors can be high in the file I'm bulk inserting (I don't have control of this file, and am currently working with the vendor to fix their issues on certain rows).
If there isn't a way to disable it, what is the maximum number that MAXERRORS can handle? Is it 2147483647?
Normally, when I import data from external sources, I am very wary of problems in the data. SQL Server offers several solutions. Many people use SSIS. I avoid SSIS. One of the reasons is getting it to open an Excel file that is already opened by a user. It has a few other shortcomings as well.
In any case, once the data is in a text file, I create a staging table that has all the columns of the original table with the data type varchar(8000). This tends to be larger than necessary, but should be sufficient.
Then, I create a table with the proper columns, and population it using something like:
insert into RealTable (CharColumn, IntColumn, FloatColumn, DateTimeColumn)
select CharColumn,
(case when isnumeric(IntColumn) = 1 and IntColumn not like '%.%' then cast(IntColumn as int end),
(case when isnumeric(FloatColumn) = 1 then cast(FloatColumn as float) end),
(case when isdate(DateColumn) = 1 then cast(DzteColumn as date)
from StagingTable st
That is, I do the type checks in SQL code, using a case statement to avoid errors. The result are NULL values when the types don't match. I can then investigate the values in the database, using the StagingTable, to understand any issues.
Also, in the RealTable, I always have the following columns:
RealTableId int identity(1,1)
CreatedBy varchar(255) default system_user,
CreatedAt datetime default getdate()
These provide tracking information about the data that often comes in useful.
i have a table like this :
CREATE TABLE [Mytable](
[Name] [varchar](10),
[number] [nvarchar](100) )
i want to find [number]s that include Alphabet character?
data must format like this:
Name | number
---------------
Jack | 2131546
Ali | 2132132154
but some time number insert informed and there is alphabet char and other no numeric char in it, like this:
Name | number
---------------
Jack | 2[[[131546ddfd
Ali | 2132*&^1ASEF32154
i wanna find this informed row.
i can't use 'Like' ,because 'Like' make my query very slow.
Updated to find all non numeric characters
select * from Mytable where number like '%[^0-9]%'
Regarding the comments on performance maybe using clr and regex would speed things up slightly but the bulk of the cost for this query is going to be the number of logical reads.
A bit outside the box, but you could do something like:
bulk copy the data out of your table into a flat file
create a table that has the same structure as your original table but with a proper numeric type (e.g. int) for the [number] column.
bulk copy your data into this new table, making sure to specify a batch size of 1 and an error file (where rows that won't fit the schema will go)
rows that end up in the error file are the rows that have non-numerics in the [number] column
Of course, you could do the same thing with a cursor and a temp table or two...