Selective PostgreSQL database querying - database

Is it possible to have selective queries in PostgreSQL which select different tables/columns based on values of rows already selected?
Basically, I've got a table in which each row contains a sequence of two to five characters (tbl_roots), optionally with a length field which specifies how many characters the sequence is supposed to contain (it's meant to be made redundant once I figure out a better way, i.e. by counting the length of the sequences).
There are four tables containing patterns (tbl_patterns_biliteral, tbl_patterns_triliteral, ...etc), each of which corresponds to a root_length, and a fifth table (tbl_patterns) which is used to synchronise the pattern tables by providing an identifier for each row—so row #2 in tbl_patterns_biliteral corresponds to the same row in tbl_patterns_triliteral. The six pattern tables are restricted such that no row in tbl_patterns_(bi|tri|quadri|quinqui)literal can have a pattern_id that doesn't exist in tbl_patterns.
Each pattern table has nine other columns which corresponds to an identifier (root_form).
The last table in the database (tbl_words), contains a column for each of the major tables (word_id, root_id, pattern_id, root_form, word). Each word is defined as being a root of a particular length and form, spliced into a particular pattern. The splicing is relatively simple: translate(pattern, '12345', array_to_string(root, '')) as word_combined does the job.
Now, what I want to do is select the appropriate pattern table based on the length of the sequence in tbl_roots, and select the appropriate column in the pattern table based on the value of root_form.
How could this be done? Can it be combined into a simple query, or will I need to make multiple passes? Once I've built up this query, I'll then be able to code it into a PHP script which can search my database.
EDIT
Here's some sample data (it's actually the data I'm using at the moment) and some more explanations as to how the system works: https://gist.github.com/823609
It's conceptually simpler than it appears at first, especially if you think of it as a coordinate system.

I think you're going to have to change the structure of your tables to have any hope. Here's a first draft for you to think about. I'm not sure what the significance of the "i", "ii", and "iii" are in your column names. In my ignorance, I'm assuming they're meaningful to you, so I've preserved them in the table below. (I preserved their information as integers. Easy to change that to lowercase roman numerals if it matters.)
create table patterns_bilateral (
pattern_id integer not null,
root_num integer not null,
pattern varchar(15) not null,
primary key (pattern_id, root_num)
);
insert into patterns_bilateral values
(1,1, 'ya1u2a'),
(1,2, 'ya1u22a'),
(1,3, 'ya12u2a'),
(1,4, 'me11u2a'),
(1,5, 'te1u22a'),
(1,6, 'ina12u2a'),
(1,7, 'i1u22a'),
(1,8, 'ya1u22a'),
(1,9, 'e1u2a');
I'm pretty sure a structure like this will be much easier to query, but you know your field better than I do. (On the other hand, database design is my field . . . )
Expanding on my earlier answer and our comments, take a look at this query. (The test table isn't even in 3NF, but the table's not important right now.)
create table test (
root_id integer,
root_substitution varchar[],
length integer,
form integer,
pattern varchar(15),
primary key (root_id, length, form, pattern));
insert into test values
(4,'{s,ş,m}', 3, 1, '1o2i3');
This is the important part.
select root_id
, root_substitution
, length
, form
, pattern
, translate(pattern, '12345', array_to_string(root_substitution, ''))
from test;
That query returns, among other things, the translation soşim.
Are we heading in the right direction?

Well, that's certainly a bizarre set of requirements! Here's my best guess, but obviously I haven't tried it. I used UNION ALL to combine the patterns of different sizes and then filtered them based on length. You might need to move the length condition inside each of the subqueries for speed reasons, I don't know. Then I chose the column using the CASE expression.
select word,
translate(
case root_form
when 1 then patinfo.pattern1
when 2 then patinfo.pattern2
... up to pattern9
end,
'12345',
array_to_string(root.root, '')) as word_combined
from tbl_words word
join tbl_root root
on word.root_id = root.root_id
join tbl_patterns pat
on word.pattern_id = pat.pattern_id
join (
select 2 as pattern_length, pattern_id, pattern1, ..., pattern9
from tbl_patterns_biliteral bi
union all
select 3, pattern_id, pattern1, pattern2, ..., pattern9
from tbl_patterns_biliteral tri
union all
...same for quad and quin...
) patinfo
on
patinfo.pattern_id = pat.pattern_id
and length(root.root) = patinfo.pattern_length
Consider combining all the different patterns into one pattern_details table with a root_length field to filter on. I think that would be easier than combining them all together with UNION ALL. It might be even easier if you had multiple rows in the pattern_details table and filtered based on root_form. Maybe the best would be to lay out pattern_details with fields for pattern_id, root_length, root_form, and pattern. Then you just join from the word table through the pattern table to the pattern detail that matches all the right criteria.
Of course, maybe I've completely misunderstood what you're looking for. If so, it would be clearer if you posted some example data and an example result.

Related

MSSQL select query with prioritized OR

I need to build one MSSQL query that selects one row that is the best match.
Ideally, we have a match on street, zip code and house number.
Only if that does not deliver any results, a match on just street and zip code is sufficient
I have this query so far:
SELECT TOP 1 * FROM realestates
WHERE
(Address_Street = '[Street]'
AND Address_ZipCode = '1200'
AND Address_Number = '160')
OR
(Address_Street = '[Street]'
AND Address_ZipCode = '1200')
MSSQL currently gives me the result where the Address_Number is NOT 160, so it seems like the 2nd clause (where only street and zipcode have to match) is taking precedence over the 1st. If I switch around the two OR clauses, same result :)
How could I prioritize the first OR clause, so that MSSQL stops looking for other results if we found a match where the three fields are present?
The problem here isn't the WHERE (though it is a "problem"), it's the lack of an ORDER BY. You have a TOP (1), but you have nothing that tells the data engine which row is the "top" row, so an arbitrary row is returned. You need to provide logic, in the ORDER BY to tell the data engine which is the "first" row. With the rudimentary logic you have in your question, this would like be:
SELECT TOP (1)
{Explicit Column List}
realestates
WHERE Address_Street = '[Street]'
AND Address_ZipCode = '1200'
ORDER BY CASE Address_Number WHEN '160' THEN 1 ELSE 2 END;
You can't prioritize anything in the WHERE clause. It always results in ALL the matching rows. What you can do is use TOP or FETCH to limit how many results you will see.
However, in order for this to be effective, you MUST have an ORDER BY clause. SQL tables are unordered sets by definition. This means without an ORDER BY clause the database is free to return rows in any order it finds convenient. Mostly this will be the order of the primary key, but there are plenty of things that can change this.

Need help understanding alternatives to scd in SSIS

I am working on a data warehouse project that will involve integrating data from multiple source systems. I have set up an SSIS package that populates the customer dimension and uses the slowly changing dimension tool to keep track of updates to the customer.
I'm running into some issues. Take this example:
Source system A might have a record like that looks like this:
First Name, Last Name, Zipcode
Jane, Doe, 14222
Source system B might have a record for the same client that looks like this:
First Name, Last Name, Zipcode
Jane, Doe, Unknown
If I first import the record from system A, I'll have the first name, last name, and ethnicity. Great. Now, if I import the client record from system B, I can do fuzzy matching to recognize that this is the same person and use the slowly changing dimension tool to update the information. But in this case, I'm going to lose the zipcode because the 'unknown' will overwrite the valid data.
I am wondering if I am approaching this problem in the wrong way. The SCD tool doesn't seem to offer any way of selectively updating attributes based on whether the new data is valid or not. Would a merge statement work better? Am I making some kind of fundamental design mistake that I'm not seeing?
Thanks for any advice!
In my experience the built-in SCD tool is not flexible enough to handle this requirement.
Either a couple of MERGE statements, or a series of UPDATE and INSERT statements will probably give you most flexibility with logic, and performance.
There are probably models out there for MERGE statement for SCD Type 2 but here is the pattern I use:
Merge Target
Using Source
On Target.Key = Source.Key
When Matched And
Target.NonKeyAttribute <> Source.NonKeyAttribute
Or IsNull(Target.NonKeyNullableAttribute, '') <> IsNull(Source.NonKeyNullableAttribute, '')
Then Update Set SCDEndDate = GetDate(), IsCurrent = 0
When Not Matched By Target Then
Insert (Key, ... , SCDStartDate, IsCurrent)
Values (Source.Key, ..., GetDate(), 1)
When Not Matched By Source Then
Update Set SCDEndDate = GetDate(), IsCurrent = 0;
Merge Target
Using Source
On Target.Key = Source.Key
-- These will be the changing rows that were expired in first statement.
When Not Matched By Target Then
Insert (Key, ... , SCDStartDate, IsCurrent)
Values (Source.Key, ... , GetDate(), 1);

one function or many for doing two-way lookups on multi-column junction table?

In SQL Server, we have a junction table associating our property IDs with IDs (text or numeric) of the same properties in several external databases (with sentinel value -1 or '-1' for no link):
SEPID SEPCODE AEID MRIID ABSID PEDID
2087 '140800' 26077 '140800' '-1' 3162
2088 '140900' -1 '140900' '-1' 3167
2089 'F21610' 25744 '-1' '-1' 3184
2090 '15402' -1 '-1' '-1' 3185
2094 '141200' 26085 '141200' '83296' 3198
As databases and business needs grow, we write new specific functions to translate one set of IDs to another, taking a comma-separated string list of input IDs and returning either a table of output IDs or another comma-separated list of output IDs, with different use cases for the table-valued (TVF) vs scalar-valued (SVF) functions. For example (using hard-coded rather than the normal programmatically-managed ID lists):
SELECT * FROM dbo.tvf_SepIdsToAbsIds('2088,2090,2094')
SELECT * FROM dbo.tvf_MriInfo(dbo.svf_AeIds2MriIds('26077,25744'))
would return:
ABSID
'-1'
'-1'
'83296'
and
MRIID NAME ETC...
'140800' 'Foo' ...
I'm wondering if there is a sensible, practical, reasonably performant way to replace these proliferating specific functions with a single generic TVF and SVF that would take two column names and a list of IDs:
SELECT * FROM dbo.tvf_Ids('SEPID', '2088,2090,2094', 'ABSID')
SELECT * FROM dbo.MriInfo(dbo.svf_Ids('AEID', '26077,25744', 'MRIID'))
Is it possible? Does it require dynamic SQL? Do the sometimes-text/sometimes-numeric external IDs complicate things greatly? Is it worth the struggle? Dynamic SQL always feels a bit dodgy and in this case I'm not even sure how to write it.
If I understand your question you have a number of databases each mastering a set of IDs. In you chosen database you relate some\all of these with local data. You then want a way of translating between the local\remote IDS.
You could use Synonyms to point to the database and then a view to consume your synonym.
So you would have
Select *
from MyNewView
and in your MyNewView
Select *
from MyNewSynonym s
inner join Mytable.ABSID = s.ABSID
Your local data just looks at the view and doesn't know anything about the different databases or any particular conversion issues. Its all hidden in the Synonym and view.
Not really sure why you are using TVFs but they are better than using SVFs which you definitely want to avoid for performance reasons. Even better if you can hide it in a view as it may give you indexing advantages.
As for splitting strings and converting them to a rowset, you can use a number of methods to do this. Including a TVF, CLR function or some horrid looping. There a number of examples to assist out there. You then just join the result in with the results from your view above.

Multi-valued Database (UniVerse) -- SM (MV) vs SM (VS) and ASSOC()

I have a question taken from pg 16 of IBM's Nested Relational Database White Paper, I'm confused why in the below CREATE command they use MV/MS/MS rather than MV/MV/MS, when both ORDER_#, and PART_# are one-to-many relationships.. I don't understand what value, vs sub-value means in non-1nf database design. I'd also like to know to know more about the ASSOC () clause.
Pg 16 of IBM's Nested Relational Database White Paper (slight whitespace modifications)
CREATE TABLE NESTED_TABLE (
CUST# CHAR (9) DISP ("Customer #),
CUST_NAME CHAR (40) DISP ("Customer Name"),
ORDER_# NUMBER (6) DISP ("Order #") SM ("MV") ASSOC ("ORDERS"),
PART_# NUMBER (6) DISP (Part #") SM ("MS") ASSOC ("ORDERS"),
QTY NUMBER (3) DISP ("Qty.") SM ("MS") ASSOC ("ORDERS")
);
The IBM nested relational databases implement nested tables as repeating attributes and
repeating groups of attributes that are associated. The SM clauses specify that the attribute is either repeating (multivalued--"MV") or a repeating group (multi-subvalued--"MS"). The ASSOC clause associates the attributes within a nested table. If desired, the IBM nested relational databases can support several nested tables within a base table. The following standard SQL statement would be required to process the 1NF tables of Figure 5 to produce the report shown in Figure 6:
SELECT CUSTOMER_TABLE.CUST#, CUST_NAME, ORDER_TABLE.ORDER_#, PART_#, QTY
FROM CUSTOMER_TABLE, ORDER_TABLE, ORDER_CUST
WHERE CUSTOMER_TABLE.CUST_# = ORDER_CUST.CUST_# AND ORDER_CUST.ORDER_# =
ORDER _TABLE.ORDER_#;
Nested Table
Customer # Customer Name Order # Part # Qty.
AA2340987 Zedco, Inc. 93-1123 037617 81
053135 36
93-1154 063364 32
087905 39
GV1203948 Alphabravo 93-2321 006776 72
055622 81
067587 29
MT1238979 Trisoar 93-2342 005449 33
036893 52
06525 29
93-4596 090643 33
I'll go ahead and answer my own question, while pursuing IBM's UniVerse SQL Administration for DBAs I came across code for CREATE TABLE on pg 55.
ACT_NO INTEGER FORMAT '5R' PRIMARY KEY
BADGE_NO INTEGER FORMAT '5R' PRIMARY KEY
ANIMAL_ID INTEGER FORMAT '5L' PRIMARY KEY
(see distracting side note below) This amused me at first, but essentially I believe this to be a column directive the same as a table directive like PRIMARY ( ACT_NO, BADGE_NO, ANIMAL_ID )
Later on page 5-19, I saw this
ALTER TABLE LIVESTOCK.T ADD ASSOC VAC_ASSOC (
VAC_TYPE KEY, VAC_DATE, VAC_NEXT, VAC_CERT
);
Which leads me to believe that tacking on ASSOC (VAC_ASSOC) to a column would be the same... like this
CREATE TABLE LIVESTOCK.T (
VAC_TYPE ... ASSOC ("VAC_ASSOC")
VAC_DATE ... ASSOC ("VAC_ASSOC")
VAC_NEXT ... ASSOC ("VAC_ASSOC")
VAC_cERT ... ASSOC ("VAC_ASSOC")
);
Anyway, I'm not 100% sure I'm right, but I'm guessing the order doesn't matter, and that rather than these being an intransitive association they're just a order-insensitive grouping.
Onward! With the second part of the question pertaining to MS and MV, I for the life of me can not figure out where the hell IBM got this syntax from. I believe it to be imaginary. I don't have access to a dev machine I can play on to test this out, but I can't find it (the term MV) in the old 10.1 or the new UniVerse 10.3 SQL Reference
side note for those not used to UniVerse the 5R and 5L mean 5 characters right or left justified. That's right a display feature built into the table meta data... Google for UniVerse FORMAT (or FMT) for more info.
Just so you know, Attribute, Multivalue and Sub-Multivalue comes from the way they structure their data.
Essentially, all data is stored in a tree of sorts.
UniVerse is a Multivalue Database. Generally, it does not work in the say way as Relational DBs of the SQL work function.
Each record can have multiple attributes.
Each attribute can have multiple multivalues.
Each multivalue can have multiple sub-multivalues.
So, if I have a record called FRED
Then, FRED<1,2,3> refers to the 1st attribute, 2 multivalue position and 3 subvalue position.
To read more about it, you need to learn more about how UniVerse works. The SQL section is just a side part of it. I suggest you read the other manuals to understand what you are working with.
EDIT
Essentially, the code above is telling you that:
There may be multiple orders per client. These are stored at an MV level in the 'table'
There may be multiple parts per order. These are stored at the MS level in the 'table'
There may be multiple qtys per order. These are stored at the MS level in the 'table'. since they are at the same level, although they are 1-n for orders, they are 1-1 in regards to parts.

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources