I have two data sets (190 Million Rows and 100 Million Rows). I need to find the most effective way to merge the two together with no duplicate rows. I can do this via TSQL commands or an SSIS process. Does anyone have any input/experience as to what the most effective way to complete this process is?
Both tables have the same formatting:
CREATE TABLE [dbo].[Table01](
[StudentId] [char](10) NOT NULL,
[CollegeId] [char](6) NOT NULL,
[TermId] [char](3) NOT NULL,
[CourseId] [char](12) NULL,
[Title] [char](68) NULL,
[SectionId] [char](6) NULL,
[UnitsEarned] [decimal](5, 2) NULL,
[Grade] [char](3) NULL,
[CreditFlag] [char](1) NULL,
[UnitsAttempted] [decimal](5, 2) NULL,
[TopCode] [char](6) NULL,
[TransferStatus] [char](1) NULL,
[UnitsMax] [decimal](5, 2) NULL,
[BSStatus] [char](1) NULL,
[SamCode] [char](1) NULL,
[ClassCode] [char](1) NULL,
[CollegeLevel] [char](1) NULL,
[NCRCategory] [char](1) NULL,
[CCLongTermId] [char](5) NULL,
[batch_id] [int] NULL
)
These are the fields that need to be distinct to eliminate duplicates:
[StudentId]
[CollegeId]
[TermId]
[CourseId]
The server that will be running this process has 8 cores, 32GB RAM, and SQL Server 2012.
Create a composite clustered index on all 4 columns on both tables. Create the destination table with a standard identity field primary key clustered index, but with a composite nonclustered index on those 4 columns. Insert into destination table from table01, using a derived table that also exposes ROW_NUMBER() OVER (PARTITION BY StudentId, CollegeId, TermId, CourseId), and filter out all rows except for those with ROW_NUMBER of 1. That will dedupe table01. Then do the same thing with table02, but also using a NOT EXISTS to check against the destination table to make sure that a row does not already exist.
Related
For a school project I have made a .csv import via C# and imported all the data from the file into a table containing only strings. We have to do some validation on the imported code using SQL server which I have already done. The table I have imported my data into looks like this:
CREATE TABLE [dbo].[StoreData]
(
[StoreName] [nvarchar](max) NULL,
[Street] [nvarchar](max) NULL,
[StreetNumber] [nvarchar](max) NULL,
[City] [nvarchar](max) NULL,
[ZipCode] [nvarchar](max) NULL,
[TelephoneNumber] [nvarchar](max) NULL,
[Country] [nvarchar](max) NULL
)
With this table filled, I have to Insert this data into the [Stores] table :
CREATE TABLE [dbo].[Stores]
(
[Id] [nvarchar](450) NOT NULL, <- GUID
[Name] [nvarchar](85) NOT NULL,
[CountryCode] [nvarchar](max) NOT NULL,
[AddressId] [nvarchar](450) NULL <- FK to [Address] Table
)
And here is my problem, the [Stores] contains a FK to the [Addresses] table:
CREATE TABLE [dbo].[Addresses]
(
[Id] [nvarchar](450) NOT NULL, <- GUID
[Street] [nvarchar](100) NOT NULL,
[HouseNumber] [nvarchar](4) NOT NULL,
[Addition] [nvarchar](10) NULL,
[ZipCode] [nvarchar](6) NOT NULL,
[City] [nvarchar](85) NOT NULL,
[SeriesIndicationStart] [int] NOT NULL,
[SeriesIndicationEnd] [int] NOT NULL
CONSTRAINT [PK_Addresses] PRIMARY KEY CLUSTERED
)
So now I have [StoreData] that contains the data I have to put in [Addresses] and in [Stores], and I have to keep in mind that the FK has to be set in [Stores]. This is our first database semester, and I am clueless, and tomorrow is the deadline..
I hope someone can help me out.. thanks in advance!
I am using SQL Server version 2012. I have a table which has more than 10 million rows. I have to count records using a SQL filter.
My query is this:
select count(*)
from reconcil
where tenantid = 101
which is taking more than 5 minutes for 5 millions records.
Is there any fastest way to count records?
Reconcil table structure is
CREATE TABLE [dbo].[RECONCIL]
(
[AckCode] [nvarchar](50) NULL,
[AckExpireTime] [int] NULL,
[AckFileName] [nvarchar](255) NULL,
[AckKey] [int] NULL,
[AckState] [int] NULL,
[AppMsgKey] [nvarchar](30) NULL,
[CurWrkActID] [nvarchar](50) NULL,
[Date_Time] [datetime] NULL,
[Direction] [nvarchar](1) NULL,
[ErrorCode] [nvarchar](50) NULL,
[FGLOGKEY] [int] NOT NULL,
[FolderID] [int] NULL,
[FuncGCtrlNo] [nvarchar](14) NULL,
[INLOGKEY] [int] NULL,
[InputFileName] [nvarchar](255) NULL,
[IntCtrlNo] [nvarchar](14) NULL,
[IsAssoDataPresent] [nvarchar](1) NULL,
[JobState] [int] NULL,
[LOGDATA] [nvarchar](max) NULL,
[MessageID] [nvarchar](25) NULL,
[MessageState] [int] NULL,
[MessageType] [int] NULL,
[NextWrkActID] [nvarchar](50) NULL,
[NextWrkHint] [nvarchar](20) NULL,
[NONFAERRORLOG] [nvarchar](max) NULL,
[NumberOfBytes] [int] NULL,
[NumberOfSegments] [int] NULL,
[OutputFileName] [nvarchar](255) NULL,
[Priority] [nvarchar](1) NULL,
[ReceiverID] [nvarchar](30) NULL,
[RecNo] [int] NULL,
[RecordID] [int] IDENTITY(1,1) NOT NULL,
[RelationKey] [int] NULL,
[SEGLOG] [nvarchar](max) NULL,
[SenderID] [nvarchar](30) NULL,
[ServerID] [nvarchar](255) NULL,
[Standard] [int] NULL,
[TenantID] [int] NULL,
[TPAgreementKey] [int] NULL,
[TSetCtrlNo] [nvarchar](35) NULL,
[UserKey1] [nvarchar](255) NULL,
[UserKey2] [nvarchar](255) NULL,
[UserKey3] [nvarchar](255) NULL,
CONSTRAINT [RECONCIL_PK]
PRIMARY KEY CLUSTERED ([RecordID] ASC)
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Unless you materialized the count, this non-clustered index on TenentID will provide better performance because it is narrower than the clustered primary key index and will scan only the matching rows:
CREATE INDEX idx ON [dbo].[RECONCIL](TenantID);
If performance of the aggregate query with this index isn't acceptable, you could create an indexed view with the count. The indexed view will provide the fastest performance for this query but will incur additional costs for storage and index maintenance for inserts and deletes. Also, queries that modify the table must have required SET options for indexed views. Those costs may be justified if the count query is executed often.
SQL Server can use the indexed view automatically in Enterprise (or Developer) editions even if not directly referenced in the query as long as the optimizer can match the semantics of the query using the view. In lesser editions, you'll need to query the indexed view directly and specify the NOEXPAND hint.
CREATE VIEW dbo.VW_RECONCIL_COUNT
WITH SCHEMABINDING
AS
SELECT
TenantID
, COUNT_BIG(*) AS TenentRowCount
FROM [dbo].[RECONCIL]
GROUP BY TenantID;
GO
CREATE UNIQUE CLUSTERED INDEX cdx ON dbo.VW_RECONCIL_COUNT(TenantID);
GO
--Enterprise Edition can use the view index automatically
SELECT COUNT_BIG(*) AS TenentRowCount
FROM [dbo].[RECONCIL]
WHERE TenantID = 101
GROUP BY TenantID;
GO
--other editions require the view to be specified plus the NOEXPAND hint
SELECT TenentRowCount
FROM dbo.VW_RECONCIL_COUNT WITH (NOEXPAND)
WHERE TenantID = 101;
GO
As being suggested, create an index or even partition your table by tenantId if you have so many items. This way you would have one data file per partition which increases performance.
select count(tenantid)
from reconcil
where tenantid = 101 group by tenantid ;
not sure but try using this.
I have a database that can have data updated from two external parties.
Each of those parties sends a pipe delimited text file that is BULK INSERTED into the staging table.
I now want to change the scheme for one of the parties by adding a few columns, but this is unfortunately breaking the BULK INSERT for the other party even though the new columns are all added as NULLABLE.
Is there any obvious solution to this?
TABLE SCHEMA:
CREATE TABLE [dbo].[CUSTOMER_ENTRY_LOAD](
[CARD_NUMBER] [varchar](12) NULL,
[TITLE] [varchar](6) NULL,
[LAST_NAME] [varchar](34) NULL,
[FIRST_NAME] [varchar](40) NULL,
[MIDDLE_NAME] [varchar](40) NULL,
[NAME_ON_CARD] [varchar](26) NULL,
[H_ADDRESS_PREFIX] [varchar](50) NULL,
[H_FLAT_NUMBER] [varchar](5) NULL,
[H_STREET_NUMBER] [varchar](10) NULL,
[H_STREET_NUMBER_SUFFIX] [varchar](5) NULL,
[H_STREET] [varchar](50) NULL,
[H_SUBURB] [varchar](50) NULL,
[H_CITY] [varchar](50) NULL,
[H_POSTCODE] [varchar](4) NULL,
[P_ADDRESS_PREFIX] [varchar](50) NULL,
[P_FLAT_NUMBER] [varchar](5) NULL,
[P_STREET_NUMBER] [varchar](10) NULL,
[P_STREET_NUMBER_SUFFIX] [varchar](5) NULL,
[P_STREET] [varchar](50) NULL,
[P_SUBURB] [varchar](50) NULL,
[P_CITY] [varchar](50) NULL,
[P_POSTCODE] [varchar](4) NULL,
[H_STD] [varchar](3) NULL,
[H_PHONE] [varchar](7) NULL,
[C_STD] [varchar](3) NULL,
[C_PHONE] [varchar](10) NULL,
[W_STD] [varchar](3) NULL,
[W_PHONE] [varchar](7) NULL,
[W_EXTN] [varchar](5) NULL,
[DOB] [smalldatetime] NULL,
[EMAIL] [varchar](50) NULL,
[DNS_STATUS] [bit] NULL,
[DNS_EMAIL] [bit] NULL,
[CREDITCARD] [char](1) NULL,
[PRIMVISACUSTID] [int] NULL,
[PREFERREDNAME] [varchar](100) NULL,
[STAFF_NUMBER] [varchar](50) NULL,
[CUSTOMER_ID] [int] NULL,
[IS_ADDRESS_VALIDATED] [varchar](50) NULL
) ON [PRIMARY]
BULK INSERT STATEMENT:
SET #string_temp = 'BULK INSERT customer_entry_load FROM '+char(39)+#inpath
+#current_file+'.txt'+char(39)+' WITH (FIELDTERMINATOR = '+char(39)+'|'+char(39)
+', MAXERRORS=1000, ROWTERMINATOR = '+char(39)+'\n'+char(39)+')'
SET DATEFORMAT dmy
EXEC(#string_temp)
The documentation describes how to use a format file to handle the scenario where the target table has more columns than the source file. An alternative that can sometimes be easier is to create a view on the table and BULK INSERT into the view instead of the table; this possibility is described in the same documentation.
And please always mention your SQL Server version.
Using OPENROWSET with BULK allows you to use your file in a query. You can use that to format the data and select only the columns you need.
In the end I have handled the two different cases with two different BULK INSERT statements (depending on which file is being processed). It seems like there isn't a way to do what I was trying to do with one statement.
You could use the format file idea supplied by #Pondlife.
Adapt your insert dynamically based on the input file name (provided there are unique differneces between the external parties). Using a CASE statement, simply select the correct format file based on the unique identifier in the file name.
DECLARE #formatFile varchar (max);
Set #formatFile =
CASE
WHEN #current_file LIKE '%uniqueIdentifier%'
THEN 'file1'
ELSE 'file2'
END
SET #string_temp = 'BULK INSERT customer_entry_load FROM '+char(39)+#inpath
+#current_file+'.txt'+char(39)+' WITH (FORMATFILE = '+char(39)+#formatFile+char(39)
')'
SET DATEFORMAT dmy
EXEC(#string_temp)
Hope that helps!
I have been playing around with a quite complex SQL Statement for a few days, and have gotten most of it working correctly.
I am having trouble with one last part, and was wondering if anyone could shed some light on the issue, as I have no idea why it isnt working:
INSERT INTO ExistingClientsAccounts_IMPORT
SELECT DISTINCT
cca.AccountID, cca.SKBranch, cca.SKAccount, cca.SKName, cca.SKBase,
cca.SyncStatus, cca.SKCCY, cca.ClientType, cca.GFCID, cca.GFPID, cca.SyncInput,
cca.SyncUpdate, cca.LastUpdatedBy, cca.Deleted, cca.Branch_Account, cca.AccountTypeID
FROM ClientsAccounts AS cca
INNER JOIN
(SELECT DISTINCT ClientAccount, SKAccount, SKDesc,
SKBase, SKBranch, ClientType, SKStatus, GFCID,
GFPID, Account_Open_Date, Account_Update
FROM ClientsAccounts_IMPORT) AS ccai
ON cca.Branch_Account = ccai.ClientAccount
Table definitions follow:
CREATE TABLE [dbo].[ExistingClientsAccounts_IMPORT](
[AccountID] [int] NOT NULL,
[SKBranch] [varchar](2) NOT NULL,
[SKAccount] [varchar](12) NOT NULL,
[SKName] [varchar](255) NULL,
[SKBase] [varchar](16) NULL,
[SyncStatus] [varchar](50) NULL,
[SKCCY] [varchar](5) NULL,
[ClientType] [varchar](50) NULL,
[GFCID] [varchar](10) NULL,
[GFPID] [varchar](10) NULL,
[SyncInput] [smalldatetime] NULL,
[SyncUpdate] [smalldatetime] NULL,
[LastUpdatedBy] [varchar](50) NOT NULL,
[Deleted] [tinyint] NOT NULL,
[Branch_Account] [varchar](16) NOT NULL,
[AccountTypeID] [int] NOT NULL
) ON [PRIMARY]
CREATE TABLE [dbo].[ClientsAccounts_IMPORT](
[NEWClientIndex] [bigint] NOT NULL,
[ClientGroup] [varchar](255) NOT NULL,
[ClientAccount] [varchar](255) NOT NULL,
[SKAccount] [varchar](255) NOT NULL,
[SKDesc] [varchar](255) NOT NULL,
[SKBase] [varchar](10) NULL,
[SKBranch] [varchar](2) NOT NULL,
[ClientType] [varchar](255) NOT NULL,
[SKStatus] [varchar](255) NOT NULL,
[GFCID] [varchar](255) NULL,
[GFPID] [varchar](255) NULL,
[Account_Open_Date] [smalldatetime] NULL,
[Account_Update] [smalldatetime] NULL,
[SKType] [varchar](255) NOT NULL
) ON [PRIMARY]
The error message I get is:
Msg 8152, Level 16, State 14, Line 1
String or binary data would be truncated.
The statement has been terminated.
The error is because you are trying to insert data into a column in ExistingClientsAccounts_IMPORT where the column size is smaller than the length of data attempting to be inserted into it.
e.g.
SKAccount column is VARCHAR(12) in the ExistingClientsAccounts_IMPORT table but is VARCHAR(255) in ClientsAccounts_IMPORT.
So if ClientsAccounts_IMPORT contains any rows where that field is longer than 12 characters, you will get that error as obv. e.g. 100 characters will not fit into a 12 character field.
You need to make sure all the columns in the table you are inserting into, are big enough - make sure each column definition matches the source table.
The third column of your SELECT column list means that ExistingClientsAccounts_IMPORT.SKAccount is populated from ClientsAccounts.SKAccount - however, the source is up to 255 characters, while the destination has a capacity of 12. If there's any data that wouldn't fit, you'll get this message.
I haven't gone through all the other columns.
You are trying to insert values which are greater than tha max length specified for a column. Use a profiler to check the data being passed to this query and verify the length of data against the permissible length for all columns.
There is a clear mismatch in the column lenghts of common columns of these two tables.
ClientsAccounts_IMPORT.SKBase is 10 whereas it is 16 in the other table.
Check the field definitions. You can see some are smaller than the original ones. Now run a query on the old data - you will find some of the larger fields were used, so the insert is not possible without losing data.
Example: SKAccount - from 255 length to 12.
Just to find out my would be DB size on production environment, I just populated my tables with 1.5 million rows of nearly same data (Except Primary key). It currently shows 261 MB...
Now, Whether I can rely on this, or since the Data is almost similar on all other columns the SQL server has compressed the size. ie. Will the size be different if the values in each rows are different...
Further.. Does even columns will null values contribute to the size of the DB ?
Thanks for your time...
Edit : Here is my schema... And I have made some indexes too...
CREATE TABLE [dbo].[Trn_Tickets](
[ObjectID] [bigint] IDENTITY(1,1) NOT NULL,
[TicketSeqNo] [bigint] NULL,
[BookSeqNo] [bigint] NULL,
[MatchID] [int] NULL,
[TicketNumber] [varchar](20) NULL,
[BarCodeNumber] [varchar](20) NULL,
[GateNo] [varchar](5) NULL,
[EntryFrom] [varchar](10) NULL,
[MRP] [decimal](9, 2) NULL,
[Commission] [decimal](9, 2) NULL,
[Discount] [decimal](9, 2) NULL,
[CashPrice] [decimal](9, 2) NULL,
[CashReceived] [decimal](9, 2) NULL,
[BalanceDue] [decimal](9, 2) NULL,
[CollectibleFrom] [char](1) NULL,
[PlaceOfIssue] [varchar](20) NULL,
[DateOfIssue] [datetime] NULL,
[PlaceOfSale] [varchar](20) NULL,
[AgentID] [int] NULL,
[BuyerID] [int] NULL,
[SaleTypeID] [tinyint] NULL,
[SaleDate] [smalldatetime] NULL,
[ApprovedBy] [varchar](15) NULL,
[ApprovedDate] [smalldatetime] NULL,
[InvoiceStatus] [char](1) NULL,
[InvoiceRefNo] [varchar](15) NULL,
[InvoiceDate] [smalldatetime] NULL,
[BookPosition] [char](2) NULL,
[TicketStatus] [char](2) NULL,
[RecordStatus] [char](1) NULL,
[ClosingStatus] [char](2) NULL,
[ClosingDate] [datetime] NULL,
[UpdatedDate] [datetime] NULL,
[UpdatedUser] [varchar](10) NULL,
CONSTRAINT [PK_Trn_Tickets] PRIMARY KEY CLUSTERED
(
[ObjectID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Hope this helps
SQL Server 2005 and 2008 Express will not compress your data. SQL Server 2008 can use page compression, but only on Enterprise Edition. NULL columns occupy one bit in the row.
From the description of your data, sounds more like a problem of ordinary normalization. Separate the repeat values into a lookup table, store only distinct combinations, join agaisnt the lookup table. This will save data by schema design and will work on all DB platforms, all versions, all SKUs.
Replace ApprovedBy etc (varchar) with lookups to other tables
Do you need datetime?
Do you expect more then 4 billion rows? Why are the 1st 3 cols bigint?
Save a few bytes here and there = a big difference. Higher page density (eg more rows per 8k page) = less space + smaller indexes.
Compress when you have 1.5 billion rows.