Oracle Data Masking using random names from a temp table - database

We need to mask some Personally Identifiable Information in our Oracle 10g database. The process I'm using is based on another masking script that we are using for Sybase (which works fine), but since the information in the Oracle and Sybase databases is quite different, I've hit a bit of a roadblock.
The process is to select all data out of the PERSON table, into a PERSON_TRANSFER table. We then use a random number to select a random name from the PERSON_TRANSFER table, and then update the PERSON table with that random name. This works fine in Sybase because there is only one row per person in the PERSON table.
The issue I've encountered is that in the Oracle DB, there are multiple rows per PERSON, and the name may or may not be different for each row, e.g.
|PERSON|
:-----------------:
|PERSON_ID|SURNAME|
|1 |Purple |
|1 |Purple |
|1 |Pink | <--
|2 |Gray |
|2 |Blue | <--
|3 |Black |
|3 |Black |
The PERSON_TRANSFER is a copy of this table. The table is in the millions of rows, so I'm just giving a very basic example here :)
The logic I'm currently using would just update all rows to be the same for that PERSON_ID, e.g.
|PERSON|
:-----------------:
|PERSON_ID|SURNAME|
|1 |Brown |
|1 |Brown |
|1 |Brown | <--
|2 |White |
|2 |White | <--
|3 |Red |
|3 |Red |
But this is incorrect as the name that is different for that PERSON_ID needs to be masked differently, e.g.
|PERSON|
:-----------------:
|PERSON_ID|SURNAME|
|1 |Brown |
|1 |Brown |
|1 |Yellow | <--
|2 |White |
|2 |Green | <--
|3 |Red |
|3 |Red |
How do I get the script to update the distinct names separately, rather than just update them all based on the PERSON_ID? My script currently looks like this
DECLARE
v_SURNAME VARCHAR2(30);
BEGIN
select pt.SURNAME
into v_SURNAME
from PERSON_TRANSFER pt
where pt.PERSON_ID = (SELECT PERSON_ID FROM
( SELECT PERSON_ID FROM PERSON_TRANSFER
ORDER BY dbms_random.value )
WHERE rownum = 1);
END;
Which causes an error because too many rows are returned for that random PERSON_ID.
1) Is there a more efficient way to update the PERSON table so that names are randomly assigned?
2) How do I ensure that the PERSON table is masked correctly, in that the various surnames are kept distinct (or the same, if they are all the same) for any single PERSON_ID?
I'm hoping this is enough information. I've simplified it a fair bit (the table has a lot more columns, such as First Name, DOB, TFN, etc.) in the hope that it makes the explanation easier.
Any input/advice/help would be greatly appreciated :)
Thanks.

One of the complications is that the same surname may appear under different person_id's in the PERSON table. You may be better off using a separate, auxiliary table holding surnames that are distinct (for example you can populate it by selecting distinct surnames from PERSONS).
Setup:
create table persons (person_id, surname) as (
select 1, 'Purple' from dual union all
select 1, 'Purple' from dual union all
select 1, 'Pink' from dual union all
select 2, 'Gray' from dual union all
select 2, 'Blue' from dual union all
select 3, 'Black' from dual union all
select 3, 'Black' from dual
);
create table mask_names (person_id, surname) as (
select 1, 'Apple' from dual union all
select 2, 'Banana' from dual union all
select 3, 'Grape' from dual union all
select 4, 'Orange' from dual union all
select 5, 'Pear' from dual union all
select 6, 'Plum' from dual
);
commit;
CTAS to create PERSON_TRANSFER:
create table person_transfer (person_id, surname) as (
select ranked.person_id, rand.surname
from ( select person_id, surname,
dense_rank() over (order by surname) as rk
from persons
) ranked
inner join
( select surname, row_number() over (order by dbms_random.value()) as rnd
from mask_names
) rand
on ranked.rk = rand.rnd
);
commit;
Outcome:
SQL> select * from person_transfer order by person_id, surname;
PERSON_ID SURNAME
---------- -------
1 Pear
1 Pear
1 Plum
2 Banana
2 Grape
3 Apple
3 Apple
Added at OP's request: The scope has been extended - the requirement now is to update surname in the original table (PERSONS). This can be best done with the merge statement and the join (sub)query I demonstrated earlier. This works best when the PERSONS table has a PK, and indeed the OP said the real-life table PERSONS has such a PK, made up of the person_id column and an additional column, date_from. In the script below, I drop persons and recreate it to include this additional column. Then I show the query and the result.
Note - a mask_names table is still needed. A tempting alternative would be to just shuffle the surnames already present in persons so there would be no need for a "helper" table. Alas that won't work. For example, in a trivial example persons has only one row. To obfuscate surnames, one MUST come up with surnames not in the original table. More interestingly, assume every person_id has exactly two rows, with distinct surnames, but those surnames in every case are 'John' and 'Mary'. It doesn't help to just shuffle those two names. One does need a "helper" table like mask_names.
New setup:
drop table persons;
create table persons (person_id, date_from, surname) as (
select 1, date '2016-01-04', 'Purple' from dual union all
select 1, date '2016-01-20', 'Purple' from dual union all
select 1, date '2016-03-20', 'Pink' from dual union all
select 2, date '2016-01-24', 'Gray' from dual union all
select 2, date '2016-03-21', 'Blue' from dual union all
select 3, date '2016-04-02', 'Black' from dual union all
select 3, date '2016-02-13', 'Black' from dual
);
commit;
select * from persons;
PERSON_ID DATE_FROM SURNAME
---------- ---------- -------
1 2016-01-04 Purple
1 2016-01-20 Purple
1 2016-03-20 Pink
2 2016-01-24 Gray
2 2016-03-21 Blue
3 2016-04-02 Black
3 2016-02-13 Black
7 rows selected.
New query and result:
merge into persons p
using (
select ranked.person_id, ranked.date_from, rand.surname
from (
select person_id, date_from, surname,
dense_rank() over (order by surname) as rk
from persons
) ranked
inner join (
select surname, row_number() over (order by dbms_random.value()) as rnd
from mask_names
) rand
on ranked.rk = rand.rnd
) t
on (p.person_id = t.person_id and p.date_from = t.date_from)
when matched then update
set p.surname = t.surname;
commit;
select * from persons;
PERSON_ID DATE_FROM SURNAME
---------- ---------- -------
1 2016-01-04 Apple
1 2016-01-20 Apple
1 2016-03-20 Orange
2 2016-01-24 Plum
2 2016-03-21 Grape
3 2016-04-02 Banana
3 2016-02-13 Banana
7 rows selected.

Related

Using STRING_SPLIT for 2 columns in a single table

I've started from a table like this
ID | City | Sales
1 | London,New York,Paris,Berlin,Madrid| 20,30,,50
2 | Istanbul,Tokyo,Brussels | 4,5,6
There can be an unlimited amount of cities and/or sales.
I need to get each city and their salesamount their own record. So my result should look something like this:
ID | City | Sales
1 | London | 20
1 | New York | 30
1 | Paris |
1 | Berlin | 50
1 | Madrid |
2 | Istanbul | 4
2 | Tokyo | 5
2 | Brussels | 6
What I got so far is
SELECT ID, splitC.Value, splitS.Value
FROM Table
CROSS APLLY STRING_SPLIT(Table.City,',') splitC
CROSS APLLY STRING_SPLIT(Table.Sales,',') splitS
With one cross apply, this works perfectly. But when executing the query with a second one, it starts to multiply the number of records a lot (which makes sense I think, because it's trying to split the sales for each city again).
What would be an option to solve this issue? STRING_SPLIT is not neccesary, it's just how I started on it.
STRING_SPLIT() is not an option, because (as is mentioned in the documantation) the output rows might be in any order and the order is not guaranteed to match the order of the substrings in the input string.
But you may try with a JSON-based approach, using OPENJSON() and string transformation (comma-separated values are transformed into a valid JSON array - London,New York,Paris,Berlin,Madrid into ["London","New York","Paris","Berlin","Madrid"]). The result from the OPENJSON() with default schema is a table with columns key, value and type and the key column is the 0-based index of each item in this array:
Table:
CREATE TABLE Data (
ID int,
City varchar(1000),
Sales varchar(1000)
)
INSERT INTO Data
(ID, City, Sales)
VALUES
(1, 'London,New York,Paris,Berlin,Madrid', '20,30,,50'),
(2, 'Istanbul,Tokyo,Brussels', '4,5,6')
Statement:
SELECT d.ID, a.City, a.Sales
FROM Data d
CROSS APPLY (
SELECT c.[value] AS City, s.[value] AS Sales
FROM OPENJSON(CONCAT('["', REPLACE(d.City, ',', '","'), '"]')) c
LEFT OUTER JOIN OPENJSON(CONCAT('["', REPLACE(d.Sales, ',', '","'), '"]')) s
ON c.[key] = s.[key]
) a
Result:
ID City Sales
1 London 20
1 New York 30
1 Paris
1 Berlin 50
1 Madrid NULL
2 Istanbul 4
2 Tokyo 5
2 Brussels 6
STRING_SPLIT has no context of what oridinal positions are. In fact, the documentation specifically states that it doesn't care about it:
The order of the output may vary as the order is not guaranteed to match the order of the substrings in the input string.
As a result, you need to use something that is aware of such basic things, such as DelimitedSplit8k_LEAD.
Then you can do something like this:
WITH Cities AS(
SELECT ID,
DSc.Item,
DSc.ItemNumber
FROM dbo.YourTable YT
CROSS APPLY dbo.DelimitedSplit8k_LEAD(YT.City,',') DSc)
Sales AS(
SELECT ID,
DSs.Item,
DSs.ItemNumber
FROM dbo.YourTable YT
CROSS APPLY dbo.DelimitedSplit8k_LEAD(YT.Sales,',') DSs)
SELECT ISNULL(C.ID,S.ID) AS ID,
C.Item AS City,
S.Item AS Sale
FROM Cities C
FULL OUTER JOIN Sales S ON C.ItemNumber = S.ItemNumber;
Of course, however, the real solution is fix your design. This type of design is going to only cause you 100's of problems in the future. Fix it now, not later; you'll reap so many rewards sooner the earlier you do it.

Split two columns in one row

I have a table with two columns: Salary and Department_id
|Salary|Department_id|
|---------------------
|1000 |10 |
|2000 |90 |
|3000 |10 |
|4000 |90 |
Now I need to split this colums in one row and calculate sum of salary for every department.
Output:
|Dep10|Dep90|
|-----------|
|4000 |6000 |
NOTE: "Dep10" and "Dep90" are aliases.
I try to use decode or case
SELECT DECODE(department_id, 10, SUM(salary),NULL) AS "Dep10",
DECODE(department_id, 90, SUM(salary), NULL) AS "Dep90"
FROM employees
GROUP BY department_id
but I obtain this:
select
sum(case when Department_id = '10' then Salary end) as Dep10,
sum(case when Department_id = '90' then Salary end) as Dep90
from employees
Use PIVOT:
Oracle Setup:
CREATE TABLE test_data ( Salary, Department_id ) AS
SELECT 1000, 10 FROM DUAL UNION ALL
SELECT 2000, 90 FROM DUAL UNION ALL
SELECT 3000, 10 FROM DUAL UNION ALL
SELECT 4000, 90 FROM DUAL
Query:
SELECT *
FROM test_data
PIVOT ( SUM( salary ) FOR Department_id IN ( 10 AS Dep10, 90 AS Dep90 ) )
Output:
DEP10 | DEP90
----: | ----:
4000 | 6000
db<>fiddle here
I think you should:
1 - use GROUP BY clause on your first table.
2 - use PIVOT feature you can learn about it here. In a few words, you can transpose columns and rows using it.
Good luck!

Which aggregate function to use in the following pivot clause?

I have a table Players which has two columns : Name and Sport_Played.
Sample data would be like:
Name. Sport _played
Ravi Cricket
Raju Cricket
Ronaldo Football
Messi Football
Anand Chess
I want to pivot the table having columns as sport played and the columns should contain the names of players sorted ascendingly.
Cricket Football Chess
Raju Messi Anand
Ravi Ronaldo Null
The problem is that pivot requires an aggregate function. What aggregate function to use to display the names of players as part of column of sport played. Thanks.
Without an example of how you want your output it is difficult to know what you are intending but:
having columns as sport played and the columns should contain the names of players sorted ascendingly
You do not need to use PIVOT, you can use LISTAGG:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE players ( Name, Sport_played ) AS
SELECT 'Ravi', 'Cricket' FROM DUAL UNION ALL
SELECT 'Raju', 'Cricket' FROM DUAL UNION ALL
SELECT 'Ronaldo', 'Football' FROM DUAL UNION ALL
SELECT 'Messi', 'Football' FROM DUAL UNION ALL
SELECT 'Anand', 'Chess' FROM DUAL;
Query 1:
SELECT sport_played,
LISTAGG( name, ',' ) WITHIN GROUP ( ORDER BY name ) As names
FROM players
GROUP BY sport_played
Results:
| SPORT_PLAYED | NAMES |
|--------------|---------------|
| Chess | Anand |
| Cricket | Raju,Ravi |
| Football | Messi,Ronaldo |
Update:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE players ( Name, Sport_played ) AS
SELECT 'Ravi', 'Cricket' FROM DUAL UNION ALL
SELECT 'Raju', 'Cricket' FROM DUAL UNION ALL
SELECT 'Ronaldo', 'Football' FROM DUAL UNION ALL
SELECT 'Messi', 'Football' FROM DUAL UNION ALL
SELECT 'Anand', 'Chess' FROM DUAL;
Query 1:
SELECT *
FROM ( SELECT p.*,
ROW_NUMBER() OVER ( PARTITION BY Sport_played
ORDER BY name ) AS rn
FROM players p )
PIVOT (
MAX( Name )
FOR Sport_Played IN (
'Cricket' As Cricket,
'Football' As Football,
'Chess' AS Chess
)
)
Results:
| RN | CRICKET | FOOTBALL | CHESS |
|----|---------|----------|--------|
| 1 | Raju | Messi | Anand |
| 2 | Ravi | Ronaldo | (null) |
You can use any (string) aggregation function in the PIVOT including MAX(name), MIN(name) or even LISTAGG( name, ',' ) WITHIN GROUP ( ORDER BY Name ). The ROW_NUMBER() analytic function will generate a unique number-per-sport so the aggregation function will only ever work on a single value so it does not matter what aggregation function is used.

Using an INSERT query that one column came from another table

With these tables:
Table1
Date | Barcode | Sold | AmountSold
Table2
Barcode | Description | RetailPrice
00001 Item1 1.00
00002 Item2 2.00
00003 Item3 3.00
00004 Item4 4.00
00005 Item5 5.00
Is there a way to use an INSERT to Table1, like this:
INSERT INTO dbo.Table1
VALUES ('07/11/2017', '00003', 5, (? * 5))
With the ? being the RetailPrice (which is 3.00) of 00003 from Table2, then multiplied with Sold (which is 5)?
I have stumbled upon INSERT INTO SELECT, but this requires that all column that will be inserted will have a matching value from SELECT, which I do not need.
Note: the first three values will come from an external source, so the 4th value will be the only one that need to come from another table
I can of course use another query first to get the RetailPrice before inserting, but I'm avoiding to use this way to reduce loading time.
I believe that you are after something like this one:
INSERT INTO dbo.Table1 (Date, Barcode , Sold , AmountSold)
SELECT '07/11/2017', '00003', 5, 5 * RetailPrice
FROM Table2
-- WHERE Barcode = 'XXX'
INSERT INTO dbo.table1
VALUES ('07/11/2017', '00003', 5, ((SELECT RetailPrice
FROM dbo.table2
WHERE dbo.table2.Barcode = '00003') * 5))

How do I perform the following multi-layered pivot with TSQL in Access 2010?

I have looked at the following relevant posts:
How to create a PivotTable in Transact/SQL?
SQL Server query - Selecting COUNT(*) with DISTINCT
SQL query to get field value distribution
Desire: The have data change from State #1 to State #2.
Data: This data is a collection of the year(s) in which a person (identified by their PersonID) has been recorded performing a certain activity, at a certain place.
My data currently looks as follows:
State #1
Row | Year | PlaceID | ActivityID | PersonID
001 2011 Park Read 201a
002 2011 Library Read 202b
003 2012 Library Read 202b
004 2013 Library Read 202b
005 2013 Museum Read 202b
006 2011 Park Read 203c
006 2010 Library Read 203c
007 2012 Library Read 204d
008 2014 Library Read 204d
Edit (4/2/2014): I decided that I want State #2 to just be distinct counts.
What I want my data to look like:
State #2
Row | PlaceID | Column1 | Column2 | Column3
001 Park 2
002 Library 1 1 1
003 Museum 1
Where:
Column1: The count of the number of people that attended the PlaceID to read on only one year.
Column2: The count of the number of people that attended the PlaceID to read on two different years.
Column3: The count of the number of people that attended the PlaceID to read on three different years.
In the State #2 schema, a person cannot be counted in more than one column for each row (place). If a person reads at a particular place for 2010, 2011, 2012, they appear in Row 001, Column3 only. However, that person can appear in other rows, but once again, in only one column of that row.
My methodology (please correct me if I am doing this wrong):
I believe that the first step is to extract distinct counts of the number of years each person attended the place to perform the activity of interest (please correct me on this methodology if incorrect).
As such, this is where I am with the T-SQL:
SELECT
PlaceID
,PersonID
,[ActivityID]
,COUNT(DISTINCT [Year]) AS UNIQUE_YEAR_COUNT
FROM (
SELECT
Year
,PlaceID
,ActivityID
,PersonID
FROM [my].[linkeddatabasetable]
WHERE ActivityID = 'Read') t1
GROUP BY
PlaceID
,PersonID
,[ActivityID]
ORDER BY 1,2
Unfortunately, I do not know where to take it from here.
I think you have two options.
A traditional pivot
select placeID,
, Column1 = [1]
, Column2 = [2]
, Column3 = [3]
from
(
SELECT
PlaceID
,COUNT(DISTINCT [Yearvalue]) AS UNIQUE_YEAR_COUNT
FROM (
SELECT
yearValue
,PlaceID
,ActivityID
,PersonID
FROM #SO
WHERE ActivityID = 'Read') t1
GROUP BY
PlaceID
,PersonID
,[ActivityID]) up
pivot (count(UNIQUE_YEAR_COUNT) for UNIQUE_YEAR_COUNT in ([1],[2],[3]) ) as pvt
or as a case/when style pivot.
select I.PlaceID
, Column1 = count(case when UNIQUE_YEAR_COUNT = 1 then PersonID else null end)
, Column2 = count(case when UNIQUE_YEAR_COUNT = 2 then PersonID else null end)
, Column3 = count(case when UNIQUE_YEAR_COUNT = 3 then PersonID else null end)
from (
SELECT
PlaceID
, PersonID
,COUNT(DISTINCT [Yearvalue]) AS UNIQUE_YEAR_COUNT
FROM (
SELECT
yearValue
,PlaceID
,ActivityID
,PersonID
FROM #SO
WHERE ActivityID = 'Read') t1
GROUP BY
PlaceID
,PersonID
,[ActivityID]) I
group by I.PlaceID
Since you are in Access I would think using an aggregate functions would do the work.
Try with DCOUNT() to begin with http://office.microsoft.com/en-us/access-help/dcount-function-HA001228817.aspx.
Replace your count() with dcount("year", "linkeddatabasetable", "placeid=" & [placeid])

Resources