Convert SQLite column data into rows in related tables - database

I have a SQLite database that is currently just one table - it has been imported from a csv file. Two of the columns are semicolon separated lists of either categories or tags imported as TEXT fields. A typical row might look like this:
1 | Article Title | photography;my work | tips;lenses;gear | In this article I'll talk about...
How can I extract the category and tags columns, uniquely insert them into their own respective tables, and then create a relational table to tie them all together? So the end result would be something like:
Content
1 | Article Title | photography;my work | tips;lenses;gear | In this article I will talk about...
Categories
1 | photography
2 | my work
ContentCategories
1 | 1 | 1
2 | 2 | 1
This would effectively convert my one table database into a truly relational database.
I'm hoping this can be done both efficiently and quickly as there is a very large number of rows this solution would be used on.
This solution needs to be compatible with SQLite version 3.36 or later.

I believe that the following demonstrates how this can be done. However it is a 2 stage process and just for the categories. Similar two stage processes could be used for other columns.
Table/column names may differ.
/* Create Demo Environment */
DROP TABLE IF EXISTS contentcategories;
DROP TABLE IF EXISTS content;
DROP TABLE IF EXISTS category;
CREATE TABLE IF NOT EXISTS content (content_id INTEGER PRIMARY KEY,title TEXT, categories TEXT);
INSERT INTO content (title,categories) VALUES
('Article1','photography;my work;something;another;blah'),
('Article2','photography;thier work;not something;not another;not blah'),
('Article3','A;B;C;D;E;F;G;;');
CREATE TABLE IF NOT EXISTS category (category_id INTEGER PRIMARY KEY,category_name TEXT UNIQUE);
CREATE TABLE IF NOT EXISTS contentcategories (content_id_map,category_id_map, PRIMARY KEY (content_id_map,category_id_map));
/* Stage 1 populate the category table */
WITH
sep AS (SELECT ';'), /* The value separator */
justincase AS (SELECT 100), /* limiter for the number of iterations */
splt(value,rest) AS
(
SELECT
substr(categories,1,instr(categories,(SELECT * FROM sep))-1),
substr(categories,instr(categories,(SELECT * FROM sep))+1)||(SELECT * FROM sep)
FROM content
UNION ALL SELECT
substr(rest,1,instr(rest,(SELECT * FROM sep))-1),
substr(rest,instr(rest,(SELECT * FROM sep))+1)
FROM splt
WHERE length(rest) > 0
LIMIT (SELECT * FROM justincase) /* just in case limit iterations*/
)
INSERT OR IGNORE INTO category (category_name) SELECT value FROM splt WHERE length(value) > 0;
/* Show the resulktant Category table */
SELECT * FROM category;
/* Stage 2 populate the contentcategories mapping table */
WITH
sep AS (SELECT ';'), /* The value separator */
justincase AS (SELECT 100), /* limiter for the number of iterations */
splt(value,rest,contentid,categoryid) AS
(
SELECT
substr(categories,1,instr(categories,(SELECT * FROM sep))-1),
substr(categories,instr(categories,(SELECT * FROM sep))+1)||(SELECT * FROM sep),
content_id,
(SELECT category_id FROM category WHERE category_name = substr(categories,1,instr(categories,(SELECT * FROM sep))-1))
FROM content
UNION ALL SELECT
substr(rest,1,instr(rest,(SELECT * FROM sep))-1),
substr(rest,instr(rest,(SELECT * FROM sep))+1),
contentid,
(SELECT category_id FROM category WHERE category_name = substr(rest,1,instr(rest,(SELECT * FROM sep))-1))
FROM splt
WHERE length(rest) > 0
LIMIT (SELECT * FROM justincase) /* just in case limit iterations */
)
INSERT OR IGNORE INTO contentcategories SELECT contentid,categoryid FROM splt WHERE length(value) > 0;
/* Show the result of content joined via the mapping table with the category table */
SELECT content.*,category.*
FROM content
JOIN contentcategories ON content_id = content_id_map JOIN category ON category_id_map = category_id
;
/* Cleanup Demo Environment */
DROP TABLE IF EXISTS contentcategories;
DROP TABLE IF EXISTS content;
DROP TABLE IF EXISTS category;
So the content table has three rows each with a varying number of categories.
The first Stage uses recursion to split the values dropping the separators (the separator is coded as a CTE just the once so could be passed, like wise a value to limit the number of recursions can also be passed as it is a CTE).
The resulting CTE (splt) is then used for a SELECT INSERT to load the new category table with the extracted/split categories (OR IGNORE used to ignore any duplicates such as photography).
The second stage then splits the values again this time getting the id of the category from the new category table so that the mapping table contentcategories can be loaded.
After each stage a SELECT is used to show the result of the stage (these are included just to demonstrate).
So when the above is run then,
The **first result&& (after loading the category table) is:-
The second result is :-
i.e. everything is extracted via the joins as expected (not thoroughly checked though).
note that the erroneous ;; i.e. no value between the separators is discarded by WHERE length(value) > 0

Related

Select a large volume of data with like SQL server

I have a table with ID column
ID column is like this : IDxxxxyyy
x will be 0 to 9
I have to select row with ID like ID0xxx% to ID3xxx%, there will be around 4000 ID with % wildcard from ID0000% to ID3999%.
It is like combining LIKE with IN
Select * from TABLE where ID in (ID0000%,ID0001%,...,ID3999%)
I cannot figure out how to select with this condition.
If you have any idea, please help.
Thank you so much!
You can use pattern matching with LIKE. e.g.
WHERE ID LIKE 'ID[0-3][0-9][0-9][0-9]%'
Will match an string that:
Starts with ID (ID)
Then has a third character that is a number between 0 and 3 [0-3]
Then has 3 further numbers ([0-9][0-9][0-9])
This is not likely to perform well at all. If it is not too late to alter your table design, I would separate out the components of your Identifier and store them separately, then use a computed column to store your full id e.g.
CREATE TABLE T
(
NumericID INT NOT NULL,
YYY CHAR(3) NOT NULL, -- Or whatever type makes up yyy in your ID
FullID AS CONCAT('ID', FORMAT(NumericID, '0000'), YYY),
CONSTRAINT PK_T__NumericID_YYY PRIMARY KEY (NumericID, YYY)
);
Then your query is a simple as:
SELECT FullID
FROM T
WHERE NumericID >= 0
AND NumericID < 4000;
This is significantly easier to read and write, and will be significantly faster too.
This should do that, it will get all the IDs that start with IDx, with x that goes form 0 to 4
Select * from TABLE where ID LIKE 'ID[0-4]%'
You can try :
Select * from TABLE where id like 'ID[0-3][0-9]%[a-zA-Z]';

How to select all fields from one table that contain a substring from any row in another column

I'm trying to export a dictionary of words in sqlite made up only of words that start with, contain, or end with specific filters.
If one filter was 'ment' and could be found anywhere in the word; it would include words such as 'moment', 'mentioned' and 'implemented'.
If another was 'under' and could only be a prefix; it would match words such as 'underachieve' and 'undercharged' but not 'plunder'.
I've found a few similar questions around - however I haven't been able to get any to work, or they are for full versions of sql and contain functions not in sqlite. Mostly my issue is with the fact that it's not just 'match every substring' - there's prefixes, suffixes and phrases(matches anywhere in word)
Already Tried:
* Select rows from a table that contain any word from a long list of words in another table
* Search SQL Server string for values from another table
* SQL select rows where field contains word from another table's fields
* https://social.msdn.microsoft.com/Forums/sqlserver/en-US/b9bb1003-80f2-4e61-ad58-f6856666bf85/how-to-select-rows-that-contain-substrings-from-another-table?forum=transactsql
My database looks like this:
dictionary_full
------------------
word
------------------
abacuses
abalone
afterthought
auctioneer
before
biologist
crafter
...
------------------
filters
------------------
name | type_id
------------------
after | 1
super | 1
tion | 2
ses | 3
logist | 3
...
type
------------------
name
------------------
prefix
phrase
suffix
I can select all phrases from the db by using this query:
SELECT name FROM filters WHERE type_id = (SELECT ROWID FROM type WHERE name='phrase');
however I haven't been able to work that successfully into the solutions I've found. It will either return no results, or duplicate results.
e.g.
Duplicates:
SELECT d.word FROM dictionary_full d
JOIN filters f ON instr(d.word, (
SELECT name FROM filters WHERE type_id = (SELECT ROWID FROM type WHERE name='phrase')
)) > 0
Expected Results:
A comination of all words that:
- start with the prefixes 'after' / 'super'
- OR contain anywhere the phrase 'tion'
- OR end with the suffix 'ses' / 'logist'
------------------
word
------------------
abacuses
afterthought
auctioneer
biologist
Sounds like you want LIKE.
After creating some sample data (skipping mapping filter type names to integers for the sake of brevity and clarity):
CREATE TABLE words(word TEXT PRIMARY KEY) WITHOUT ROWID;
INSERT INTO words(word) VALUES ('abacuses'), ('abalone'), ('afterthought'),
('auctioneer'), ('before'), ('biologist'), ('crafter');
CREATE TABLE filters(name TEXT, type TEXT, PRIMARY KEY(name, type)) WITHOUT ROWID;
INSERT INTO filters(name, type) VALUES ('after', 'prefix'), ('super', 'prefix'),
('tion', 'phrase'), ('ses', 'suffix'), ('logist', 'suffix');
This query
SELECT *
FROM words AS w
JOIN filters AS f ON (CASE f.type
WHEN 'prefix' THEN w.word LIKE f.name || '%'
WHEN 'suffix' THEN w.word LIKE '%' || f.name
WHEN 'phrase' THEN w.word LIKE '%' || f.name || '%'
END)
GROUP BY w.word -- eliminate duplicate matches
ORDER BY w.word;
results in
word name type
------------ ---------- ----------
abacuses ses suffix
afterthought after prefix
auctioneer tion phrase
biologist logist suffix

postgresql: Insert two values in table b if both values are not in table a

I'm doing an assignment where I am to make an sql-database of a tournament result. Players can be added by their name, and when the database has at least two or more players who has not already been assigned to a match, two players should be matched against each other.
For instance, if the tables currently are empty I add Joe as a player. I then also add James and since the table then has two players, who also are not in the matches-table, a new row in the matches-table is created with their p_id set to left_player_P_id and right_player_P_id.
I thought it would be a good idea to create a function and a trigger so that every time a row is added to the player-table, the sql-code would run and create the row in the matches as needed. I am open to other ways of doing this.
I've tried multiple different approaches including SQL - Insert if the number of rows is greater than and Using IF ELSE statement based on Count to execute different Insert statements but I am now at a loss.
Problematic code:
This approach returns a syntax error.
IF ((select count(*) from players_not_in_any_matches) >= 2)
begin
insert into matches values (
(select p_id from players_not_in_any_matches limit 1),
(select p_id from players_not_in_any_matches limit 1 offset 1)
)
end;
Alternative approach (still problematic code):
This approach seems more promising (but less readable). However, it inserts even if there are no rows returned inside the where not exists.
insert into matches (left_player_p_id, right_player_p_id)
select
(select p_id from players_not_in_any_matches limit 1),
(select p_id from players_not_in_any_matches limit 1 offset 1)
where not exists (
select * from players_not_in_any_matches offset 2
);
Tables
CREATE TABLE players (
p_id serial PRIMARY KEY,
full_name text
);
CREATE TABLE matches(
left_player_P_id integer REFERENCES players,
right_player_P_id integer REFERENCES players,
winner integer REFERENCES players
);
Views
-- view for getting all players not currently assigned to a match
create view players_not_in_any_matches as
select * from players
where p_id not in (
select left_player_p_id from matches
) and
p_id not in (
select right_player_p_id from matches
);
Try:
insert into matches (left_player_p_id, right_player_p_id)
select p1.p_id, p2.p_id
from players p1
join players p2
on p1.p_id <> p2.p_id
and not exists(
select 1 from matches m
where p1.p_id in (m.left_player_p_id, m.right_player_p_id)
)
and not exists(
select 1 from matches m
where p2.p_id in (m.left_player_p_id, m.right_player_p_id)
)
limit 1
Anti joins (not-exists operators) in the above query could be further simplified a bit using LEFT JOINs:
insert into matches (left_player_p_id, right_player_p_id)
select p1.p_id, p2.p_id
from players p1
join players p2
left join matches m1
on p1.p_id in (m1.left_player_p_id, m1.right_player_p_id)
left join matches m2
on p2.p_id in (m2.left_player_p_id, m2.right_player_p_id)
where m1.left_player is null
and m2.left_player is null
limit 1
but in my opinion the former query is more readable, while the latter one looks tricky.

ON CONFLICT DO UPDATE/DO NOTHING not working on FOREIGN TABLE

ON CONFLICT DO UPDATE/DO NOTHING feature is coming in PostgreSQL 9.5.
Creating Server and FOREIGN TABLE is coming in PostgreSQL 9.2 version.
When I'm using ON CONFLICT DO UPDATE for FOREIGN table it is not working,
but when i'm running same query on normal table it is working.Query is given below.
// For normal table
INSERT INTO app
(app_id,app_name,app_date)
SELECT
p.app_id,
p.app_name,
p.app_date FROM app p
WHERE p.app_id=2422
ON CONFLICT (app_id) DO
UPDATE SET app_date = excluded.app_date ;
O/P : Query returned successfully: one row affected, 5 msec execution time.
// For foreign table concept
// foreign_app is foreign table and app is normal table
INSERT INTO foreign_app
(app_id,app_name,app_date)
SELECT
p.app_id,
p.app_name,
p.app_date FROM app p
WHERE p.app_id=2422
ON CONFLICT (app_id) DO
UPDATE SET app_date = excluded.app_date ;
O/P : ERROR: there is no unique or exclusion constraint matching the ON CONFLICT specification
Can any one explain why is this happening ?
There are no constraints on foreign tables, because PostgreSQL cannot enforce data integrity on the foreign server – that is done by constraints defined on the foreign server.
To achieve what you want to do, you'll have to stick with the “traditional” way of doing this (e.g. this code sample).
I know this is an old question, but in some cases there is a way to do it with ROW_NUMBER OVER (PARTION). In my case, my first take was to try ON CONFLICT...DO UPDATE, but that doesn't work on foreign tables (as stated above; hence my finding this question). My problem was very specific, in that I had a foreign table (f_zips) to be populated with the best zip code (postal code) information possible. I also had a local table, postcodes, with very good data and another local table, zips, with lower-quality zip code information but much more of it. For every record in postcodes, there is a corresponding record in zips but the postal codes may not match. I wanted f_zips to hold the best data.
I solved this with a union, with a value of ind = 0 as the indicator that a record came from the better data set. A value of ind = 1 indicates lesser-quality data. Then I used row_number() over a partion to get the answer (where get_valid_zip5() is a local function to return either a five-digit zip code or a null value):
insert into f_zips (recnum, postcode)
select s2.recnum, s2.zip5 from (
select s1.recnum, s1.zip5, s1.ind, row_number()
over (partition by recnum order by s1.ind) as rn from (
select recnum, get_valid_zip5(postcode) as zip5, 0 as ind
from postcodes
where get_valid_zip5(postcode) is not null
union
select recnum, get_valid_zip5(zip9) as zip5, 1 as ind
from zips
where get_valid_zip5(zip9) is not null
order by 1, 3) s1
) s2 where s2.rn = 1
;
I haven't run any performance tests, but for me this runs in cron and doesn't directly affect the users.
Verified on more than 900,000 records (SQL formatting omitted for brevity) :
/* yes, the preferred data was entered when it existed in both tables */
select t1.recnum, t1.postcode, t2.zip9 from postcodes t1 join zips t2 on t1.recnum = t2.recnum where t1.postcode is not null and t2.zip9 is not null and t2.zip9 not in ('0') and length(t1.postcode)=5 and length(t2.zip9)=5 and t1.postcode <> t2.zip9 order by 1 limit 5;
recnum | postcode | zip9
----------+----------+-------
12022783 | 98409 | 98984
12022965 | 98226 | 98225
12023113 | 98023 | 98003
select * from f_zips where recnum in (12022783, 12022965, 12023113) order by 1;
recnum | postcode
----------+----------
12022783 | 98409
12022965 | 98226
12023113 | 98023
/* yes, entries came from the less-preferred dataset when they didn't exist in the better one */
select t1.recnum, t1.postcode, t2.zip9 from postcodes t1 right join zips t2 on t1.recnum = t2.recnum where t1.postcode is null and t2.zip9 is not null and t2.zip9 not in ('0') and length(t2.zip9)= 5 order by 1 limit 3;
recnum | postcode | zip9
----------+----------+-------
12021451 | | 98370
12022341 | | 98501
12022695 | | 98597
select * from f_zips where recnum in (12021451, 12022341, 12022695) order by 1;
recnum | postcode
----------+----------
12021451 | 98370
12022341 | 98501
12022695 | 98597
/* yes, entries came from the preferred dataset when the less-preferred one had invalid values */
select t1.recnum, t1.postcode, t2.zip9 from postcodes t1 left join zips t2 on t1.recnum = t2.recnum where t1.postcode is not null and t2.zip9 is null order by 1 limit 3;
recnum | postcode | zip9
----------+----------+------
12393585 | 98118 |
12393757 | 98101 |
12393835 | 98101 |
select * from f_zips where recnum in (12393585, 12393757, 12393835) order by 1;
recnum | postcode
----------+----------
12393585 | 98118
12393757 | 98101
12393835 | 98101

Split Single Column into multiple and Load it to a Table or a View

I'm using SQL Server 2008. I have a source table with a few columns (A, B) containing string data to split into a multiple columns. I do have function that does the split already written.
The data from the Source table (the source table format cannot be modified) is used in a View being created. But I need to have my View have already split data for Column A and B from the Source table. So, my view will have extra columns that are not in the Source table.
Then the View populated with the Source table is used to Merge with the Other Table.
There two questions here:
Can I split column A and B from the Source table when creating a View, but do not change the Source Table?
How to use my existing User Defined Function in the View "Select" statement to accomplish this task?
Idea in short:
String to split is also shown in the example in the commented out section. Pretty much have Destination table, vStandardizedData View, SP that uses the View data to Merge to tblStandardizedData table. So, in my Source column I have column A and B that I need to split before loading to tblStandardizedData table.
There are five objects that I'm working on:
Source File
Destination Table
vStandardizedData View
tblStandardizedData table
Stored procedure that does merge
(Update and Insert) form the vStandardizedData View.
Note: all the 5 objects a listed in the order they are supposed to be created and loaded.
Separately from this there is an existing UDFunction that can split the string which I was told to use
Example of the string in column A (column B has the same format data) to be split:
6667 Mission Street, 4567 7rd Street, 65 Sully Pond Park
Desired result:
User-defined function returns a table variable:
CREATE FUNCTION [Schema].[udfStringDelimeterfromTable]
(
#sInputList VARCHAR(MAX) -- List of delimited items
, #Delimiter CHAR(1) = ',' -- delimiter that separates items
)
RETURNS #List TABLE (Item VARCHAR(MAX)) WITH SCHEMABINDING
/*
* Returns a table of strings that have been split by a delimiter.
* Similar to the Visual Basic (or VBA) SPLIT function. The
* strings are trimmed before being returned. Null items are not
* returned so if there are multiple separators between items,
* only the non-null items are returned.
* Space is not a valid delimiter.
*
* Example:
SELECT * FROM [Schema].[udfStringDelimeterfromTable]('abcd,123, 456, efh,,hi', ',')
*
* Test:
DECLARE #Count INT, #Delim CHAR(10), #Input VARCHAR(128)
SELECT #Count = Count(*)
FROM [Schema].[udfStringDelimeterfromTable]('abcd,123, 456', ',')
PRINT 'TEST 1 3 lines:' + CASE WHEN #Count=3
THEN 'Worked' ELSE 'ERROR' END
SELECT #DELIM=CHAR(10)
, #INPUT = 'Line 1' + #delim + 'line 2' + #Delim
SELECT #Count = Count(*)
FROM [Schema].[udfStringDelimeterfromTable](#Input, #Delim)
PRINT 'TEST 2 LF :' + CASE WHEN #Count=2
THEN 'Worked' ELSE 'ERROR' END
What I'd ask you, is to read this: How to create a Minimal, Complete, and Verifiable example.
In general: If you use your UDF, you'll get table-wise data. It was best, if your UDF would return the item together with a running number. Otherwise you'll first need to use ROW_NUMBER() OVER(...) to create a part number in order to create your target column names via string concatenation. Then use PIVOT to get the columns side-by-side.
An easier approach could be a string split via XML like in this answer
A quick proof of concept to show the principles:
DECLARE #tbl TABLE(ID INT,YourValues VARCHAR(100));
INSERT INTO #tbl VALUES
(1,'6667 Mission Street, 4567 7rd Street, 65 Sully Pond Park')
,(2,'Other addr1, one more addr, and another one, and even one more');
WITH Casted AS
(
SELECT *
,CAST('<x>' + REPLACE(YourValues,',','</x><x>') + '</x>' AS XML) AS AsXml
FROM #tbl
)
SELECT *
,LTRIM(RTRIM(AsXml.value('/x[1]','nvarchar(max)'))) AS Address1
,LTRIM(RTRIM(AsXml.value('/x[2]','nvarchar(max)'))) AS Address2
,LTRIM(RTRIM(AsXml.value('/x[3]','nvarchar(max)'))) AS Address3
,LTRIM(RTRIM(AsXml.value('/x[4]','nvarchar(max)'))) AS Address4
,LTRIM(RTRIM(AsXml.value('/x[5]','nvarchar(max)'))) AS Address5
FROM Casted
If your values might include forbidden characters (especially <,> and &) you can find an approach to deal with this in the linked answer.
The result
+----+---------------------+-----------------+--------------------+-------------------+----------+
| ID | Address1 | Address2 | Address3 | Address4 | Address5 |
+----+---------------------+-----------------+--------------------+-------------------+----------+
| 1 | 6667 Mission Street | 4567 7rd Street | 65 Sully Pond Park | NULL | NULL |
+----+---------------------+-----------------+--------------------+-------------------+----------+
| 2 | Other addr1 | one more addr | and another one | and even one more | NULL |
+----+---------------------+-----------------+--------------------+-------------------+----------+

Resources