CSV to SQL Server - sql-server

In my csv files that I want to bulk insert into SQL Server, there's text (serial number) that's not in the .csv column format that I would LOVE to use as a primary key.
EX.
Data from Engine SQL03423,
version 21.04,
time, speed, temp
june 3 1:00, 90, 200
june 3 1:01, 69, 392
The SQL03423 I want to use as a primary key in my database.
However I get reports from this particular engine daily and if I get to use it as a primary key I'm sure I'll run into the issue of using the same primary key the next time I insert new data which will give me an error.
How do I get around this?
I need the serial number regardless even if it's not the primary key.
Also if I can't use it as a primary key, how do I create a "dummy" primary key into the target table that will autoincrement even if that particular column is obviously not in the csv files I'm importing? Is this even possible?
I am aware of stored procedures, views, etc in SQL. I have basic knowledge if that helps.

I'd suggest using an "artifical" primary key most of the time, which for your question means that you should go for an additional column instead of using the machine's serial number. Preferably, this should be an UUID (GUID) value instead of a simple INT. Why? Please have a look at this great post from Jeff Atwood explaining it: https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Just add a line like this to your table declaration:
THE_IDCOL_NAME UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY
Find more about NewID() here: https://learn.microsoft.com/en-us/sql/t-sql/functions/newid-transact-sql
One of the great advantages of using UUIDs is that you do not have to worry about number ranges overlapping for ID generation. Example scenario from my experience: customer rents machines worldwide and the software logs usage information. Each site uses an idependent database that was being consolidated to the HQ database. If any of the sites would use overlapping number ranges as primary keys, they would have been into trouble. Using GUIDs solved this.

Related

Find foreign keys based on data

I am looking at a database which has almost no foreign keys defined.
Is there a tool that can perform some data analysis/heuristics and "guess" the relations based on data. I am looking for some kind of report, which can be used as a manual guide/checklist.
I had a similar problem - Every Table had a Object_ID column... But had secondary IDs too.
All were of a wierd GUID-ish form.
I ended up writing a brute force scanner (using Dynamic sql from informtion_schema.columns)
Of course this approach relied on the values being globally unique... If you have a bunch of int identity cols and no way to connect the Tables then you are in a bit of trouble!
Perhaps there is a timestamp column or a DateTime defaulting to GetDate() - you could use this to identidy records in different tables that are created at approx the same time.
A lot depends on your schema...

How SQL Server creates uniqueidentifier using NEWID()

I have a user_info table in five different locations. Now I need to integrate all the rows into a central user_info table. To do this I need a unique id for each row in source tables. Because when all the rows come into the central table, then each user must have a unique ID.
Now my questions are:
if I use uniqueidentifier NEWID() for each source table then is it will be unique for globally & life time or have any chance to be duplicate?
How does SQL Server create the NEWID()? I need to know the key generation structure.
Yes, there is no chance of a duplicate between machines.
NEWID() is based on a combination of a pseudo random number (from the clock) and the MAC address of the primary NIC.
However, inserting random numbers like this as the clustered key on a table is terrible for performance. You should consider either NEWSEQUENTIALID() or a COMB-type function for generating GUIDs that still offer the collision-avoidance benefits of NEWID() while still maintaining acceptable INSERT performance.

data distribution based on primary key

Currently in one of my projects, we're supporting 32k entities, however it's reaching its limits for performance, and hence we're thinking of distributing it to different databases based on their integer primary keys. E.g. the first 35k will go to one db, the next 35k to the next db and so on (based on (primary key % #db) logic).
However, this will present a problem when we're inserting an entity into db. Since we don't know its primary key value beforehand, how do we figure out which db to insert it into?
One possibility is maintaining a global id table in only one db. So we insert into it first, get the primary key value and then use it to choose a db for further detailed insertion. But this solution is not uniform and hence difficult to maintain and extend. So any other thoughts on how to go about it?
Found this nice article that talks about how Flickr solved this problem:
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

SQL Server (2005) - "Deleted On" DATETIME and Indexing

I have a question related to database design. The database that I'm working with
requires data to treated in some way that it is never physically deleted. We started going
down a path of adding a "DeleteDateTime" column to some tables, that is NULL by default but
once stamped would mark a record as deleted.
This gives us the ability archive our data easily but I still feel in the dark on a few areas, specifically
whether this would be considered in line with best practices and also how to go about indexing these tables efficiently.
I'll give you an example: We have a table called "Courses" with a composite primary key made up of the columns "SiteID" and "CourseID".
This table also has a column called "DeleteDateTime" that is used in accordance with my description above.
I can't use the SQL Server 2008 filtered view feature because we have to be
SQL Server 2005 compatible. Should I include "DeleteDateTime" in the clustered index for this table? If so should it be
the first column in the index (i.e. "DeleteDateTime, SiteID, CourseID")...
Does anyone have any reasons why I should or shouldn't follow this approach?
Thanks!
Is there a chance you could transfer those "dead" records into a separate table? E.g. for your Courses table, have a Courses_deleted table or something like that, with an identical structure.
When you "delete" a record, you basically just move it to the "dead table". That way, the index on your actual, current data stays small and zippy....
If you need to have an aggregate view, you can always define a Courses_View which unions the two tables together.
Your clustered index on your real table should be as small, static and constant and possible, so I would definitely NOT recommend putting such a date time column into it. Not a good idea.
For excellent info on how to choose a good clustering key, and what it takes, check out Kimberly Tripp's blog entries:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
what's your requirements on data retention? have you looked into an audit log instead of keeping all non-current data in the database?
I think you have it right on the head for the composite indexes including your "DeleteDateTime" column.
I would create a view that is basically
select {List all columns except the delete flag}
from mytable
where deletflag is null
This is what I would use for all my queries on the table. The reason why is to prevent people from forgetting to consider the deleted flag. SQL Server 2005 can easily handle this kind of view and it is necessary if you are goin to use thisdesign for delting records. I would have a separate index on the delted column. I likely would not make it part of the clustered index.

Choice of primary key type

I have a table that potentially will have high number of inserts per second, and I'm trying to choose a type of primary key I want to use. For illustrative purposes let's say, it's users table. I am trying to chose between using GUID and BIGINT as primary key and ultimately as UserID across the app. If I use GUID, I save a trip to database to generate a new ID, but GUID is not "user-friendly" and it's not possible to partition table by this ID (which I'm planning to do). Using BIGINT is much more convenient, but generating it is a problem - I can't use IDENTITY (there is a reason fro that), so my only choice is to have some helper table that would contain last used ID and then I call this stored proc:
create proc GetNewID #ID BIGINT OUTPUT
as
begin
update HelperIDTable set #ID=id, id = id + 1
end
to get the new id. But then this helper table is an obvious bottleneck and I'm concerned with how many updates per second it can do.
I really like the idea of using BIGINT as pk, but the bottleneck problem concerns me - is there a way to roughly estimate how many id's it could produce per second? I realize it highly depends on hardware, but are there any physical limitations and what degree are we looking at? 100's/sec? 1000's/sec?
Any ideas on how to approach the problem are highly appreciated! This problem doesn't let me sleep for many night now!
Thanks!
Andrey
GUID seem to be a natural choice - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table - the single value that uniquely identifies the row in the database.
What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
So to sum it up: unless you have a really good reason, I would always recommend a INT IDENTITY field as the primary / clustered key on your table.
Marc
I try to use GUID PKs for all tables except small lookup tables. The GUID concept ensures that the identity of the object can safely be created in memeory without a roundtrip to the database and saving later without changing the identity.
When you need a "human readable" id you can use an auto increment int when saved. For partitioning you could also create the BIGINTs later by a database schedule for many users in one shot.
Do you want a primary key, for business reasons, or a clustred key, for storage concerns?
See stackoverflow.com/questions/1151625/int-vs-unique-identifier-for-id-field-in-database for a more elaborate post on the topic of PK vs. clustered key.
You really have to elaborate why can't you use IDENTITY. Generating the IDs manually, and specially on the server with an extra rountrip and an update just to generate each ID for the insert it won't scale. You'd be lucky to reach lower 100s per second. The problem is not just the rountrip and update time, but primarily from the interaction of ID generation update with insert batching: the insert batching transaction will serialize ID generation. The woraround is to separate the ID generation on separate session so it can autocommit, but then the insert batching is pointless because the ID genartion is not batched: it has to wait for log flush after each ID genrated in order to commit. Compared to this uuid will be running circles around your manual ID generation. But uuid are horrible choice for clustred key because of fragmentation.
try to hit your db with a script, perhaps with the use of jmeter to simulate concurrent hits. Perhaps you can then just measure yourself how much load you can handle. Also your DB could cause a bottle neck. Which one is it? I would prefure PostgreSQL for heavy load, like yahoo and skype also do
An idea that requires serious testing: try creating (inserting) new rows in batches -- say 1000 (10,000? 1M?) a time. You could have a master (aka bottleneck) table listing the next one to use, or you might have a query that does something like
select min(id) where (name = '')
Generate a fresh batch of emtpy rows in the morning, every hour, or whenever you're down to a certain number of free ones. This only addresses the issue of generating new IDs, but if that's the main bottleneck it might help.
A table partitioning option: Assuming a bigint ID column, how are you defining the partition? If you are allowing for 1G rows per day, you could set up the new partition in the evening (day1 = 1,000,000,000 through 1,999,999,999, day2 = 2,000,000,000 through 2,999,999,999, etc.) and then swap it in when it's ready. You are of course limited to 1000 partitions, so with bigints you'll run out of partitions before you run out of IDs.

Resources