We are importing an external excel files to our SQL database but we need to do some integrity checks using sql script
Here is my sample data
Row_no Student_area Student_subject Code
1 Science Science and Tech ABC
2 Science Science and Teck ABC
3 Arts Pschycolgy DEF
4 Arts Pscycology DEF
I need to identifythe anomalies
How do I do that?
Cheers
Oracle SQL has many neat features but it does not exhibit human style intelligence (yet). So it cannot identify "anomalies" in data. We must declare the rules for correctness.
In your case you need to define a set of correct values, preferably as reference data tables:
create table student_area (student_area varchar2(30));
insert into student_area values ('Science');
insert into student_area values ('Arts');
create table student_subject (student_area varchar2(30),
student_subject varchar2(128),
subject_code varchar2(3));
insert into student_subject values ('Science', 'Science and Tech', 'ABC');
insert into student_subject values ('Arts', 'Psychology', 'DEF');
Now you are ready to evaluate the contents of your file. The easiest way to do this is to convert the Excel file to CSV and build an external table over it. This is a special type of table where the data resides in an external OS file rather than in the database. Find out more.
If you create an external table with a column mapped to every column in the spreadsheet you can identify the anomalies like this:
select * from external_table ext
where ext.student_area not in ( select s.student_area
from student_area s )
/
select * from external_table ext
where (ext.student_area, ext.student_subject) not in
( select s.student_area, s.student_subject
from student_subject s )
/
Related
In CitusDB, I can create an empty table with:
CREATE TABLE table1 (col1 text, col2 text);
I can tell table1 how to partition the data, which will later be loaded into the table, by running this:
SELECT create_distributed_table('table1', 'col1');
In this moment, I will then know how my table is distributed across CitusDB nodes.
However, if I come across a new table that I didn't create, but I know it is distributed, how do I know what column the table is distributed on?
You want to use the citusDB column_to_column function described in the citus db docs: http://docs.citusdata.com/en/v9.3/develop/api_udf.html
SELECT column_to_column_name(logicalrelid, partkey) AS dist_col_name
FROM pg_dist_partition
WHERE logicalrelid='<table>'::regclass;
I have a very large DataTable-Object which I need to import from a client into an MS SQL-Server database via ODBC.
The original Data-Table has two columns:
* First column is the Office Location (quite a long string)
* Second column is a booking value (integer)
Now I am looking for the most efficient way to insert these data into an external SQL-Server. My goal is to replace each office location automatically by an index instead using the full string because each location occurs VERY often in the initial table.
Is this possible via a trigger or via a view on the SQL-server?
At the end I want to insert the data without touching them in my script because this is very slow for these large amount of data and let the optimization done by the SQL Server.
I expect that if I do INSERT the data including the Office location, that SQL Server looks up an index for an already imported location and then use just this index. And if the location did not already exist in the index table / view then it should create a new entry here and then use the new index.
Here a sample of the data I need to import via ODBC into the SQL-Server:
OfficeLocation | BookingValue
EU-Germany-Hamburg-Ostend1 | 12
EU-Germany-Hamburg-Ostend1 | 23
EU-Germany-Hamburg-Ostend1 | 34
EU-France-Paris-Eifeltower | 42
EU-France-Paris-Eifeltower | 53
EU-France-Paris-Eifeltower | 12
What I do need on the SQL-Server is something like these 2 tables as a result:
OId|BookingValue OfficeLocation |Oid
1|12 EU-Germany-Hamburg-Ostend1 | 1
1|23 EU-France-Paris-Eifeltower | 2
1|43
2|42
2|53
2|12
My initial idea was, to write the data into a temp-table and have something like an intelligent TRIGGER (or a VIEW?) to react on any INSERT into this table to create the 2 desired (optimized) tables.
Any hint are more than welcome!
Yes, you can create a view with an INSERT trigger to handle this. Something like:
CREATE TABLE dbo.Locations (
OId int IDENTITY(1,1) not null PRIMARY KEY,
OfficeLocation varchar(500) not null UNIQUE
)
GO
CREATE TABLE dbo.Bookings (
OId int not null,
BookingValue int not null
)
GO
CREATE VIEW dbo.CombinedBookings
WITH SCHEMABINDING
AS
SELECT
OfficeLocation,
BookingValue
FROM
dbo.Bookings b
INNER JOIN
dbo.Locations l
ON
b.OId = l.OId
GO
CREATE TRIGGER CombinedBookings_Insert
ON dbo.CombinedBookings
INSTEAD OF INSERT
AS
INSERT INTO Locations (OfficeLocation)
SELECT OfficeLocation
FROM inserted where OfficeLocation not in (select OfficeLocation from Locations)
INSERT INTO Bookings (OId,BookingValue)
SELECT OId, BookingValue
FROM
inserted i
INNER JOIN
Locations l
ON
i.OfficeLocation = l.OfficeLocation
As you can see, we first add to the locations table any missing locations and then populate the bookings table.
A similar trigger can cope with Updates. I'd generally let the Locations table just grow and not attempt to clean it up (for no longer referenced locations) with triggers. If growth is a concern, a periodic job will usually be good enough.
Be aware that some tools (such as bulk inserts) may not invoke triggers, so those will not be usable with the above view.
it it possible to write script in hana that crate temporary table that is based
on existing table (with no need to define columns and columns types hard coded ):
create local temporary table #mytemp (id integer, name varchar(20));
create temporary table with the same columns definitions and contain the
same data ? if so ..i ill be glad to get some examples
i am searching the internet for 2 days and i couldn't find anything useful
thanks
Creating local temporary tables based on dynamic structure definition is not supported in SQLScript.
The question would be: for what do you want to use it?
Instead of a local temp. table you can use a table variable in most cases.
By querying sys.table_columns view, you can get the list and properties of source table and build a dynamic CREATE script then Execute to create the table.
You can find SQL codes for a sample case at Create Table Dynamically on HANA Database
For table columns read
select * from sys.table_columns where table_name = 'TABLENAME';
Seems to work in the hana version I have. I'm not sure how to find out what the version.
PROCEDURE "xxx.yyy.zzz::MY_TEST"(
OUT "OUT_COL" NVARCHAR(200)
)
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER
AS
BEGIN
create LOCAL TEMPORARY TABLE #LOCALTEMPTABLE
as
(
SELECT distinct 'Cola' as out_col
FROM "SYNONYMS1"
);
select * from #LOCALTEMPTABLE ;
DROP TABLE #LOCALTEMPTABLE;
END
The newer HANA version (HANA 2 SPS 04 Patch 5 ( Build 4.4.17 )) supports your request:
create local temporary table #tempTableName' like "tableTypeName";
This should inherit the data types and all exact values from whatever query is in the parenthesis:
CREATE LOCAL COLUMN TEMPORARY TABLE #mytemp AS (
SELECT
"COLUMN1",
"COLUMN2",
"COLUMN3"
FROM MyTable
);
-- Now you can add the rest of your query here as such:
SELECT * FROM #mytemp
I suppose you can just write :
create column table #MyTempTable as ( select * from MySourceTable);
BR,
I am very new to SQL and SQL server, would appreciate any help with the following problem.
I am trying to update a share price table with new prices.
The table has three columns: share code, date, price.
The share code + date = PK
As you can imagine, if you have thousands of share codes and 10 years' data for each, the table can get very big. So I have created a separate table called a share ID table, and use a share ID instead in the first table (I was reliably informed this would speed up the query, as searching by integer is faster than string).
So, to summarise, I have two tables as follows:
Table 1 = Share_code_ID (int), Date, Price
Table 2 = Share_code_ID (int), Share_name (string)
So let's say I want to update the table/s with today's price for share ZZZ. I need to:
Look for the Share_code_ID corresponding to 'ZZZ' in table 2
If it is found, update table 1 with the new price for that date, using the Share_code_ID I just found
If the Share_code_ID is not found, update both tables
Let's ignore for now how the Share_code_ID is generated for a new code, I'll worry about that later.
I'm trying to use a merge query loosely based on the following structure, but have no idea what I am doing:
MERGE INTO [Table 1]
USING (VALUES (1,23-May-2013,1000)) AS SOURCE (Share_code_ID,Date,Price)
{ SEEMS LIKE THERE SHOULD BE AN INNER JOIN HERE OR SOMETHING }
ON Table 2 = 'ZZZ'
WHEN MATCHED THEN UPDATE SET Table 1.Price = 1000
WHEN NOT MATCHED THEN INSERT { TO BOTH TABLES }
Any help would be appreciated.
http://msdn.microsoft.com/library/bb510625(v=sql.100).aspx
You use Table1 for target table and Table2 for source table
You want to do action, when given ID is not found in Table2 - in the source table
In the documentation, that you had read already, that corresponds to the clause
WHEN NOT MATCHED BY SOURCE ... THEN <merge_matched>
and the latter corresponds to
<merge_matched>::=
{ UPDATE SET <set_clause> | DELETE }
Ergo, you cannot insert into source-table there.
You could use triggers for auto-insertion, when you insert something in Table1, but that will not be able to insert proper Shared_Name - trigger just won't know it.
So you have two options i guess.
1) make T-SQL code block - look for Stored Procedures. I think there also is a construct to execute anonymous code block in MS SQ, like EXECUTE BLOCK command in Firebird SQL Server, but i don't know it for sure.
2) create updatable SQL VIEW, joining Table1 and Table2 to show last most current date, so that when you insert a row in this view the view's on-insert trigger would actually insert rows to both tables. And when you would update the data in the view, the on-update trigger would modify the data.
I have a database infrastructure where we are regularly (at least once a day) replicating the full content of tables from a source database to approximately 20 target databases. Due to the replication code in use (we have to use regular oracle queries, no control or direct access to source database) - this results in 20 full-table sorts of the source table.
Is there any way to optimize for this in the query? I'm looking for something that would basically tell oracle "I'm going to be repeatedly sorting this entire table"? MySQL had an option with myisamchk where you could tell it to sort a table and keep it in sorted order, but obviously that wouldn't apply here for multiple reasons.
Currently, there are also some intermediate tables involved (sync from A to B, then from B to C.) We do have control over the intermediate tables, so if there are tuning options there, that would be useful as well.
Generally, the queries are almost all of the very simplistic form:
select a, b, c, d, e, ... z from tbl1 order by a, b, c, d, e, ... z;
I'm aware of streams, but as described above, the primary source tables are outside of our control, so we won't be able to use streams there. (Additionally, those source tables are rebuilt completely from a snapshot daily, so streams wouldn't really work anyway.)
you could look into the multi-table INSERT feature. It should perform a single FULL SCAN and will insert into multiple tables. Consider (10gR2):
SQL> CREATE TABLE t1 (ID NUMBER);
Table created
SQL> CREATE TABLE t2 (ID NUMBER);
Table created
SQL> INSERT ALL
2 INTO t1 VALUES (d_id)
3 INTO t2 VALUES (d_id)
4 /* your select goes here */
5 SELECT ROWNUM d_id FROM dual d CONNECT BY LEVEL <= 5;
10 rows inserted
SQL> SELECT COUNT(*) FROM t1;
COUNT(*)
----------
5
SQL> SELECT COUNT(*) FROM t2;
COUNT(*)
----------
5
You will have to check if it works over database links.
Some things that would help the sorting issue is to have indexes on the columns that you are sorting on (and also joining the tables on, if they're not there already). You could also create materialized views which are already sorted, and Oracle would keep the sorted results cached.
You don't say exactly how the replication is done or the data volumes involved (or why you are sorting the data).
If the aim is to minimise the impact on the source database, your best bet may be to extract into an intermediate file and load the file into the destination databases. The sort could be done on the intermediate file (if plain text), or as part of either the export or import into the destination databases.
In source database :
create table export_emp_info
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
) as select emp_id, emp_name, dept_id from emp order by dept_id
/
Copy file then, import in dest database:
create table import_emp_info
(EMP_ID NUMBER(12),
EMP_NAME VARCHAR2(100),
DEPT_ID NUMBER)
organization external
( type oracle_datapump
default directory DATA_PUMP_DIR
location ('emp.dmp')
)
/
insert into emp_info select * from import_emp_info;
If you don't want or can't have the external table on the source db, you can use a straight expdp of the emp table (possibly using NETWORK_LINK if you have limited access to the source database directory structure) and QUERY to do the ordering.
You could load data from source table A to an intermediate table B and then do a partition exchange between B and destination table C. Exact replication, no sorting involved.
This I/U/D form of replication is what the MERGE command is there for. It's very doubtful that an expensive sort-merge would be required, and I'd expect to see hash joins instead. As long as the hash table can be stored in memory the hash join is barely more expensive than scanning the tables.
A handy optimisation is to store a hash value based on the non-key attributes, so that you can join between source and target tables on the key column(s) and compare small hash values instead of the full set of columns - change detection made easy.