I have a MS SQL 2008 database which stores data for creating a weighted, undirected graph. The data is stored in tables with the following structure:
[id1] [int] NOT NULL,
[id2] [int] NOT NULL,
[weight] [float] NOT NULL
where [id1] and [id2] represents the two connected nodes and [weight] the weight of the edge that connects these nodes.
There are several different algorithms, that create the graph from some basic data. For each algorithm, I want to store the graph-data in a separate table. Those tables all have the same structure (as shown above) and use a specified prefix (similarityALB, similaritybyArticle, similaritybyCategory, ...) so I can identify them as graph-tables.
The client program can select, which table (i.e. by which algorithm the graph is created) to use for the further operations.
Access to the data is done by stored procedures. As I have different tables, I would need to use a variable tablename e.g.:
SELECT id1, id2, weight FROM #tableName
This doesn't work because SQL doesn't support variable tablenames in the statement. I have searched the web and all solutions to this problem use the dynamic SQL EXEC() statement e.g.:
EXEC('SELECT id1, id2, weight FROM ' + #tableName)
As most of them mentioned, this makes the statement prone to SQL-injection, which I'd like to avoid. A simple redesign idea would be to put all the different graphs in one table and add a column to identify the different graphs.
[graphId] [int] NOT NULL,
[id1] [int] NOT NULL,
[id2] [int] NOT NULL,
[weight] [float] NOT NULL
My problem with this solution is, that the graphs can be very large depending on the used algorithm (up to 500 Million entries). I need to index the table over (id1, id2) and (id2, id1). Now putting them all in one big table would makes the table even huger (and requests slower). Adding a new graph would result in bad performance, because of the active indicees. Deleting a graph could not be done by TRUNCATE anymore, I would need to use
DELETE * FROM myTable WHERE graphId=#Id
which performs very bad with large tables and creates a very large logfile (which would exceed my disk space when the graph is big enough). So I'd like to keep the independent tables for each graph.
Any suggestions how to solve this problems by either find a way to parametrize the tablename or to redesign the database structure while avoiding the aforementioned problems?
SQL injection can easily be avoided in this case by comparing #tableName to the names of the existing tables. If it isn't one of them, it's bad input. (Obligatory xkcd reference: That is, unless you have a table called "bobby'; drop table students;")
Anyway, regarding your performance problems, with partitioned tables (since SQLServer 2005), you can have the same advantages like having several tables, but without the need for dynamic SQL.
Maybe I did not understand everything, but:
CREATE PROCEDURE dbo.GetMyData (
#TableName AS varchar(50)
)
AS
BEGIN
IF #TableName = 'Table_1'
BEGIN
SELECT id1
,id2
,[weight]
FROM dbo.Table_1
END
IF #TableName = 'Table_2'
BEGIN
SELECT id1
,id2
,[weight]
FROM dbo.Table_2
END
END
and then:
EXEC dbo.GetMyData #TableName = 'Table_1'
A different technique involves using synonyms dynamically, for example:
DECLARE #TableName varchar(50)
SET #TableName = 'Table_1'
-- drop synonym if it exists
IF object_id('dbo.MyCurrentTable', 'SN') IS NOT NULL
DROP SYNONYM MyCurrentTable ;
-- create synonym for the current table
IF #TableName = 'Table_1'
CREATE SYNONYM dbo.MyCurrentTable FOR dbo.Table_1 ;
IF #TableName = 'Table_2'
CREATE SYNONYM dbo.MyCurrentTable FOR dbo.Table_2 ;
-- use synonym
SELECT id1, id2, [weight]
FROM dbo.MyCurrentTable
Partioned Table may be the answer to your problem. I've got another idea, that's "the other way around":
each graph has it's own table (so you can still truncate table)
define a view (with the structured you mentioned for your redefined table) as a UNION ALL over all graph-tables
I have no idea of the performance of a select on this view and so on, but it may give you what you are looking for. I'd be interested in the results if try this out ..
Related
I am creating procedure for creating temporary table's but I want to make this more general instead of for just one original table's temporary
ALTER PROCEDURE [dbo].[CREATE_TEMP_TABLES]
AS
IF NOT EXISTS (SELECT * FROM SYS.TABLES
WHERE NAME=N'TEMP_FactAdditional' AND type ='U')
BEGIN
CREATE TABLE dbo.TEMP_FactAdditional
(
[ProductKey] [int] NOT NULL,
[CultureName] [nvarchar](50) NOT NULL,
[ProductDescription] [nvarchar](max) NOT NULL
)
END
GO
so IF NOT EXISTS (SELECT * FROM SYS.TABLES WHERE NAME=N'TEMP_FactAdditional' AND type ='U') in here I see its only creating TEMP_FactAdditional table and I have like 8 or 9 original tables which each of them is different, CREATE TABLE dbo.TEMP_FactAdditional I am thinking to add some function or method to make more general but I don't have that much knowledge about it, I am designing warehouse, and my first step is to make more efficient to transfer data's from DB to DB. such as here in the photo . Can someone help me to make procedure more efficient?
Well, I am not going to comment on the overall approach, but if you want a general way to create tables to move data into, you could do something like this....
SELECT *
INTO my_new_table_name
FROM my_old_table_name
WHERE 1 = 0
This will create the new table with the same structure and no data. Then you could theoretically write the procedure to use dynamic SQL to run this code for a list of tables, thus "creating" a number of tables in an abstract case. You could pass in two parameters, the two table names.
This is my scenario, I have a table like this:
CREATE TABLE [MyTable]
(
[Id] BIGINT PRIMARY KEY,
[Value] NVARCHAR(100) NOT NULL,
[IndexColumnA] NVARCHAR(100) NOT NULL,
[IndexColumnB] NVARCHAR(100) NOT NULL
)
CREATE INDEX [IX_A] ON [MyTable] ([IndexColumnA])
CREATE INDEX [IX_B] ON [MyTable] ([IndexColumnB])
And have two use cases with two different update commands
UPDATE [MyTable] SET [Value] = '...' WHERE [IndexColumnA] = '...'
and
UPDATE [MyTable] SET [Value] = '...' WHERE [IndexColumnB] = '...'
Both update commands may update multiple rows and these commands caused a deadlock when executed concurrently.
My speculation is that the two update commands use different index when scanning the rows, the order placing locks on rows are therefore different. As the result, one update command may try place an U lock on a row which already has a X lock placed by another update command. (I am not a database expert, correct me if I was wrong)
One possible solution to this should be forcing database to place locks in the same order. According to https://dba.stackexchange.com/questions/257217/why-am-i-getting-a-deadlock-for-a-single-update-query, it seems we can do this by SELECT ... ORDER BY ... FOR UPDATE in PostgreSQL.
Can we (and should we) do this in SQL Server? if not, is the only solution to this is handling the deadlock in application code?
I am using SQL Sever 2016 and I have created user-defined table-type as below:
CREATE TYPE [dbo].[UDTT_Items] AS TABLE(
[ItemId] int identity(1, 1),
[ItemCode] [varchar](10) NULL,
[ItemName] [varchar](255) NULL,
[StockQty] decimal(18,3 ) NULL,
PRIMARY KEY CLUSTERED
(
[ItemId] ASC
)WITH (IGNORE_DUP_KEY = OFF)
)
GO
In my stored procedure I can create table variable like this:
declare #tblItems UDTT_Items
I can insert data in this table variable and can make select queries.
select * from #tblItems
The problem I faced when I need to put this table in dynamic sql. For example, if I try to run the above select statement from execute caluse:
EXECUTE SP_EXECUTESQL N'select * from #tblItems'
It gives me error message:
Must declare the table variable "#tblItems".
I tried to use temporary table variabe (with #) inside dynamic sql, and it works fine, but I dont know if I can create temporary table with already user-defined-table-type. I need something like this:
create #tblItems UDTT_Items
But it also does not work.
Can anybody suggest how to make any work around this issue, either by using table variable in dynamic sql, or creating temp table from user-defined-table-type?
I can think of the following workarounds to solve this using your UDTT:
1. Declare the UDTT variable within your dynamic script and then you can as well retrieve results from there:
EXECUTE SP_EXECUTESQL
N'
DECLARE #dynvariable [UDTT];
insert #dynvariable values (1);
select * from #dynvariable';
2. Pass the UDTT variable to the SP_EXECUTESQL, but then it is readonly, meaning you can only select within the dynamic script:
DECLARE #variable [UDTT];
insert #variable values (1);
EXECUTE SP_EXECUTESQL
N'select * from #dynvariable',
N'#dynvariable [UDTT] READONLY',
#dynvariable=#variable;
3. I think it's not possible to 'create a temp table from UDTT' so your approach would be to dynamically create the temp table using system information for your UDTT (columns, types, etc.).
4. Reading that you want to have a "dynamic" pivot code, the most appropriate would be to dynamically generate the pivot statement based on the columns info and values of the target table.
I am a bit rusty with my SQL since I have not worked with it beyond basic querying of existing databases that were already setup.
I am trying to create an event logging database, and want to take a "extreme" approach to normalization. I would have a main table comprised of mostly 'smallint' fields that point to child tables which contain strings.
Example:
I have an external system that i would like to enable some logging in via SQL, user fills in some key parameters which build and insert/update statement and gets pushed to the logging tables so they can be viewed at a later time if they need to know what XYZ value was at runtime, or sometime in the past.
I have a main table which consists of:
SELECT [log_id] - bigint (auto-increment) PK
,[date_time] - smalldatetime
,[cust_id] - smallint FK
,[recloc] - char(8)
,[alert_level] - smallint FK
,[header] - varchar(100)
,[body] - varchar(1000)
,[process_id] - smalint FK
,[routine_id] - smallint FK
,[workflow_id] - smallint FK
FROM [EventLogs].[dbo].[eventLogs]
All of the 'smallint' field point to a child table which contains the expanded data:
Example:
SELECT [routine_id] PK/FK
,[routine_name]
,[description]
FROM [EventLogs].[dbo].[cpRoutine]
SELECT [process_id] PK/FK
,[process_name]
,[description]
FROM [EventLogs].[dbo].[cpProcess]
My goal here, is to have the external system do an update/insert statement that reaches all these tables. I have all the 'smallint' fields linked up as FK's currently.
How do i go about crafting the update/insert statements that touches all these tables? If a child table already contains a key-value pair, i do not want to touch it. The idea of the child tables is to house repetitive data there and assign it a key in the main logging table to keep size down. Do i need to check for existence of a records in child tables, save the index number, then build my insert statement for the main table? Trying to be as efficient as possible here.
Example:
I want to log the following from the external system:
- date_time - GETDATE()
- customer_number - '0123456789'
- recloc - 'ABC123'
- alert_level - 'info'
- header - 'this is a header'
- body - 'this is a body'
- process_name - 'the process'
- routine_name - 'the routine'
- workflow_name - 'the workflow'
Do I need to create my insert statement for the main table (eventLogs) but check each child table first and add missing values, then save the id for my insert statement in the main table?
Select process_id, process_name From cpProcess where process_name = 'the process'
If no values returned, do an insert statement with the process_name
Now query the table again to get the ID so i can build the "main insert statement" that feeds the master log table
Repeat for all other child tables
final insert statement looks something like:
SQL code:
INSERT INTO eventLogs (date_time, cust_id, recloc, alert_level, header, body, process_id, routine_id, workflow_id)
VALUES('2017-12-31', '1', 'ABC123', '3', 'this is a header', 'this is a body', '13', '19', '12')
It just seems like i am doing too much back and forth with the server checking for values in the child tables to do my insert....
The end goal here is to create a friendly view that pulls in all the data assigned to the 'smallint' keys.
You're close:
Select process_id from cpProcess where process_name = 'the process'
If no values returned, do an insert statement with the process_name, get ID through IDENT_CURRENT, SCOPE_IDENTITY, or IDENTITY (or use a subordinate "load" procedure and get the ID from an output parameter).
Repeat for each child table until you get the values required to do your final insert into [eventLogs].
This works fine if it is a relatively low speed process. As you increase the speed you can have issues, but if you are doing INSERT only, as you should, it still isn't terrible. I've used SQL Server Service Broker in the past to decouple processes such as these to improve performance, but that obviously adds complexity.
Depending on the load you might also decide to build aggregate tables in a fact/dimension star so that the INSERT OLTP process is segregated from the SELECT OLAP process.
What you're seeing is the complexity involved in building a normalized data structure. You're approach "to take a "extreme" approach to normalization" is often bypassed because it's "too hard". That doesn't mean you shouldn't do it, but you should weigh the ROI. I have made decisions to just dump everything into a log table such as this below in the past where there were only going to be perhaps less than ten thousand records at any given time. You just have to look at the requirements and make the best choice.
CREATE TABLE [log].[data]
(
[id] INT IDENTITY(1, 1)
, [timestamp] DATETIME DEFAULT sysdatetime()
, [entry] XML NOT NULL
);
One option that I frequently use during the build out phase of a design is to build placeholders behind adapters as shown below. Use the getter and setter methods ALWAYS and later, when you need better performance or data storage, you can refactor the underlying data structure as required, modify the adapters to the new data structures, and you've saved yourself some time. Otherwise you can end up chasing a lot of rabbits down holes early in the project. Often you'll find that your design for the underlying structures changes based on requirements as the project moves forward and you'd have spent a lot of time on changes. Using this approach you get a working mechanism in place immediately.
Later on if you need to collapse this structure to provide better performance it will be trivial compared to constantly changing the structure during design (in my opinion).
Oh, and yes, you could use a standard relational table. I use a lot of XML in applications and event logging because it allows ad hoc structured data. The concept is the same. You could use your top level table, just with the [process_name], etc. columns directly in the table and no child columns for now.
Just remember you should NOT allow access to the underlying tables directly! One way to prevent this is to actually put them in a dedicated schema such as [log_secure], and secure that schema to all but admin and the accessor/mutator methods.
IF schema_id(N'log') IS NULL
EXECUTE (N'CREATE SCHEMA log');
go
IF object_id(N'[log].[data]', N'U') IS NOT NULL
DROP TABLE [log].[data];
go
CREATE TABLE [log].[data]
(
[id] BIGINT IDENTITY(1, 1)
, [timestamp] DATETIMEOFFSET NOT NULL -- DATETIME if timezone isn't needed
CONSTRAINT [log__data__timestamp__df] DEFAULT sysdatetimeoffset()
, [entry] XML NOT NULL,
CONSTRAINT [log__data__id__pk] PRIMARY KEY CLUSTERED ([id])
);
IF object_id(N'[log].[get_entry]', N'P') IS NOT NULL
DROP PROCEDURE [log].[get_entry];
go
CREATE PROCEDURE [log].[get_entry] #id BIGINT
, #entry XML output
, #begin DATETIMEOFFSET
, #end DATETIMEOFFSET
AS
BEGIN
SELECT #entry
FROM [log].[data]
WHERE [id] = #id;
END;
go
IF object_id(N'[log].[set_entry]', N'P') IS NOT NULL
DROP PROCEDURE [log].[set_entry];
go
CREATE PROCEDURE [log].[set_entry] #entry XML
, #timestamp DATETIMEOFFSET = NULL
, #id BIGINT output
AS
BEGIN
INSERT INTO [log].[entry]
([timestamp]
, [entry])
VALUES ( COALESCE(#timestamp, sysdatetimeoffset()),#entry );
SET #id = SCOPE_IDENTITY();
END;
go
UPDATE: This issue is note related to the XML, I duplicated the table using an nvarchar(MAX) instead and still same issue. I will repost a new topic.
I have a table with about a million records, the table has an XML field. The query is running extremely slow, even when selecting just an ID. Is there anything I can do to increase the speed of this, I have tried setting text in row on, but SQL server will not allow me to, I receive the error "Cannot switch to in row text in table".
I would appreciate any help in a fix or knowledge that I seem to be missing.
Thanks
TABLE
/****** Object: Table [dbo].[Audit] Script Date: 08/14/2009 09:49:01 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Audit](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParoleeID] [int] NOT NULL,
[Page] [int] NOT NULL,
[ObjectID] [int] NOT NULL,
[Data] [xml] NOT NULL,
[Created] [datetime] NULL,
CONSTRAINT [PK_Audit] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
QUERY
DECLARE #ID int
SET #ID = NULL
DECLARE #ParoleeID int
SET #ParoleeID = 158
DECLARE #Page int
SET #Page = 2
DECLARE #ObjectID int
SET #ObjectID = 93
DECLARE #Created datetime
SET #Created = NULL
SET NOCOUNT ON;
Select TOP 1 [Audit].* from [Audit]
where
(#ID IS NULL OR Audit.ID = #ID) AND
(#ParoleeID IS NULL OR Audit.ParoleeID = #ParoleeID) AND
(#Page IS NULL OR Audit.Page = #Page) AND
(#ObjectID IS NULL OR Audit.ObjectID = #ObjectID) AND
(#Created is null or(Audit.Created > #Created and Audit.Created < DATEADD (d, 1, #Created )) )
You need to create a primary XML index on the column. Above anything else having this will assist ALL your queries.
Once you have this, you can create indexing into the XML columns on the xml data.
From experience though, if you can store some information in the relation tables, SQL is much better at searching and indexing that than XML. Ie any key columns and commonly searched data should be stored relationally where possible.
Sql Server 2005 – Twelve Tips For Optimizing Query Performance by Tony Wright
Turn on the execution plan, and statistics
Use Clustered Indexes
Use Indexed Views
Use Covering Indexes
Keep your clustered index small.
Avoid cursors
Archive old data
Partition your data correctly
Remove user-defined inline scalar functions
Use APPLY
Use computed columns
Use the correct transaction isolation level
http://tonesdotnetblog.wordpress.com/2008/05/26/twelve-tips-for-optimising-sql-server-2005-queries/
I had the very same scenario - and the solution in our case is computed columns.
For those bits of information that you need frequently from your XML, we created a computed column on the "hosting" table, which basically reaches into the XML and pulls out the necessary value from the XML using XPath. In most cases, we're even able to persist this computed column, so that it becomes part of the table and can be queried and even indexed and query speed is absolutely no problem anymore (on those columns).
We also tried XML indices in the beginning, but their disadvantage is the fact that they're absolutely HUGE on disk - this may or may not be a problem. Since we needed to ship back and forth the whole database frequently (as a SQL backup), we eventually gave up on them.
OK, to setup a computed column to retrieve from information bits from your XML, you first need to create a stored function, which will take the XML as a parameter, extract whatever information you need, and then pass that back - something like this:
CREATE FUNCTION dbo.GetShopOrderID(#ShopOrder XML)
RETURNS VARCHAR(100)
AS BEGIN
DECLARE #ShopOrderID VARCHAR(100)
SELECT
#ShopOrderID = #ShopOrder.value('(ActivateOrderRequest/ActivateOrder/OrderHead/OrderNumber)[1]', 'varchar(100)')
RETURN #ShopOrderID
END
Then, you'll need to add a computed column to your table and connect it to this stored function:
ALTER TABLE dbo.YourTable
ADD ShopOrderID AS dbo.GetShipOrderID(ShopOrderXML) PERSISTED
Now, you can easily select data from your table using this new column, as if it were a normal column:
SELECT (fields) FROM dbo.YourTable
WHERE ShopOrderID LIKE 'OSA%'
Best of all - whenever you update your XML, all the computed columns are updated as well - they're always in sync, no triggers or other black magic needed!
Marc
Some information like the query you run, the table structure, the XML content etc would definitely help. A lot...
Without any info, I will guess. The query is running slow when selecting just an ID because you don't have in index on ID.
Updated
There are at least a few serious problems with your query.
Unless an ID is provided, the table can only be scanned end-to-end because there are no indexes
Even if an ID is provided, the condition (#ID is NULL OR ID = #ID) is no guaranteed to be SARGable so it may still result in a table scan.
And most importantly: the query will generate a plan 'optimized' for the first set of parameters it sees. It will reuse this plan on any combination of parameters, no matter which are NULL or not. That would make a difference if there would be some variations on the access path to choose from (ie. indexes) but as it is now, the query can only choose between using a scan or a seek, if #id is present. Due to the ways is constructed, it will pretty much always choose a scan because of the OR.
With this table design your query will run slow today, slower tomorrow, and impossibly slow next week as the size increases. You must look back at your requirements, decide which fields are impoortant to query on, index them and provide separate queryes for them. OR-ing together all possible filters like this is not going to work.
The XML you're trying to retrieve has absolutely nothing to do with the performance problem. You are simply brute forcing a table scan and expect SQL to magically find the records you want.
So if you want to retrieve a specific ParoleeID, Page and ObjectID, you index the fields you search on and run a run a query for those and only those:
CREATE INDEX idx_Audit_ParoleeID ON Audit(ParoleeID);
CREATE INDEX idx_Audit_Page ON Audit(Page);
CREATE INDEX idx_Audit_ObjectID ON Audit(ObjectID);
GO
DECLARE #ParoleeID int
SET #ParoleeID = 158
DECLARE #Page int
SET #Page = 2
DECLARE #ObjectID int
SET #ObjectID = 93
SET NOCOUNT ON;
Select TOP 1 [Audit].* from [Audit]
where Audit.ParoleeID = #ParoleeID
AND Audit.Page = #Page
AND Audit.ObjectID = #ObjectID;