Scored similarity searches in Snowflake - snowflake-cloud-data-platform

Scored similarity searches in Snowflake - snowflake-cloud-data-platform

Goal: Implement a SQL query in a Snowflake database that, given an address-like string string (user input), does a fuzzy/approximate search against a single field, returning results with a similarity score, ordered by that score.
I see that Snowflake offers a few tools that seem related to this problem, such as APPROXIMATE_SIMILARITY and MINHASH, but it's not clear to me which of these tools I need or how to put them together. The documentation is good, but lacking a straightforward example, and seems to focus on the similarity of two tables, rather than comparing an arbitrary string to values in a column.
Given user_input and field locations.FullAddress, I'm looking for something like this pseudo query:
SELECT "score", field1, field2 from locations
WHERE FullAddress LIKE user_input;
I know there's more to it than that but just can't quite see how to integrate the functions provided by Snowflake to make it work.
Here is a sample of the locations table - note that the complete address is in a single field, and can be rendered in inconsistent ways.
| Somefield | FullAddress | OtherField |
|-----------|--------------------------------------------------|------------|
| foo | 123 SW Marble Street, Brainerd MA 55555 | yellow |
| bar | 98 Main, San Diego CA 99999 | green |
| beep | 123 SW Marble St, Brainerd 55555-2222 | orange |
| baz | 456 Somewhere Blvd, Apt 23, Boise ID, 44444-1234 | blue |
A user might search for 123 SW Marble Street, Brainerd MA 55555 (a perfect match). I would hope to return rows 1 and 3, with row 1 getting the highest score. Or user might search for 123 Marble Street (imperfect) and I would still want to return rows 1 and 3, ranked by the similarity algorithm.

There are three built-in fuzzy matching functions in Snowflake, JAROWINKLER_SIMILARITY (mentioned by NickW), EDITDISTANCE and SOUNDEX. It's a simple matter to extend this library using Java, Python, or JavaScript code in a UDF.
Here is an example of the three built-in functions compared to a given address:
set comparison = '123 SW Marble Street, Brainerd MA 55555';
select *
,JAROWINKLER_SIMILARITY($comparison, full_address)
,EDITDISTANCE($comparison, full_address)
,SOUNDEX($comparison) = SOUNDEX(full_address)
from T1

Related

Database Design - Need help scaling query

I'm trying to find the best data-structure/data store solution (highest performance) for the following request:
I have a list of attributes that I need to store for all individual in the US, for example:
+------------+-------+-------------+
| Attribute | Value | SSN |
+------------+-------+-------------+
| hair color | black | 123-45-6789 |
| eye color | brown | 123-45-6789 |
| height | 175 | 123-45-6789 |
| sex | M | 123-45-6789 |
| shoe size | 42 | 123-45-6789 |
As you can guess, with the general population, there are nothing unique and identifiable from those attributes.
However, let's assume that if we were to fetch from a combination of 3 or 4 attributes, then I would be able to uniquely identify a person (find their SSN).
Now here's the difficulties, the set of combinations that can uniquely identify a person will evolve over time and be adjusted.
What would be my best bet for storing and querying the data with the scenario mentioned above, that will remain highly performant (<100ms) at scale?
Current attempt with combining two attributes:
SELECT * FROM (SELECT * FROM people WHERE hair='black') p1
JOIN (SELECT * FROM people WHERE height=175) p2
ON p1.SSN = p2.SSN
But with a database with millions of rows, as you can guess.. NOT performant.
Thank you!

if the data store is not a constraint, I would use a DocumentDB, something like MongoDB, CosmosDB or even ElasticSearch.
With Mongo, for example, you could leverage its schemaless nature and have a collection of People with one property per "attribute":
{
"SSN": "123-45-6789",
"eyeColor": "brown",
"hairColor" "blond",
"sex": "M"
}
documents in this collection might have different properties, but it's not an issue. All you have to do now is to put an index on each one and run your queries.

SQL Server 2005 & Up - Concatenated Value based on Value in Joined Table

Prelude: The design of this database is truly horrible - this isn't the first "crooked" question I've asked, and it won't be the last. The question is what it is, and I'm only asking because A) I only have a couple of years of experience with SQL Server, and B) I've already been pounding my face on my keyboard for a couple of days trying to find a viable solution.
Having said all of that...
We have a database with two tables relevant to this issue. The schema is ridiculous, so I'm going to paraphrase so that it can be understood:
T_Customer & T_Task
T_Task holds data about various work that has been performed on behalf of a customer in T_Customer.
In T_Customer, there is a field called "Sort_Type" (again, paraphrasing...). In this field, there is a concatenated string of various fields from T_Task which, in the order specified, determine how the customer's report is produced in the client program. There are a total of 73 possible fields in T_Task that can be chosen as Sort_Type in T_Customer, and the user can choose up to 5 of them in any given order. For example:
T_Customer
Customer_ID | Sort_Type
------------|-------------------------------
1 | 'Task_Date,Task_Type,Task_ID'
2 | 'Task_Type,Destination'
3 | 'Route,Task_Type,Task_ID'
T_Task
Task_ID | Customer_ID | Task_Type | Task_Date | Route | Destination
--------|-------------|-----------|-----------|---------|-------------
12345 | 1 | 1 | 01/01/2017| '1 to 2'| '2'
12346 | 1 | 1 | 01/02/2017| '3 to 4'| '4'
12347 | 2 | 2 | 12/31/2016| '6 to 2'| '2'
12348 | 3 | 3 | 01/01/2017| '4 to 1'| '1'
In this example, Customer #1's report would be sorted/totaled by the Task_Date, then by Task_Type, then by Task_ID; but not simply by doing an ORDER BY. This function requires one single value which can be ordered as a whole, single unit. As such...
Up until today, a field existed in T_Task called (paraphrasing....) 'MySort'. This field contained a concatenated string of fixed-width values filled in with zeroes and created according to the order and content of the values in T_Customer.Sort_Type. In this case:
Task_ID | Customer_ID | Task_Type | Task_Date | Route | Destination | MySort
--------|-------------|-----------|-----------|-------|-------------|-------
12345 | 1 | 1 | 01/01/2017| 1 to 2| 2 |'002017010100000000010000012345'
12346 | 1 | 1 | 01/02/2017| 3 to 4| 4 |'002017010200000000010000012346'
12347 | 2 | 2 | 12/31/2016| 6 to 2| 2 |'000000000000000000020000000002'
12348 | 3 | 3 | 01/01/2017| 4 to 1| 1 |'000040to0100000000030000012348'
During the printing phase of every single report, the program would search for the customer, find the values in T_Customer.Sort_Type, split them by commas, and then run an update on all of the tasks of that customer to update the value of MySort accordingly...
Can you guess what the problem is with this? Performance (not to mention chronic insanity)
I have been tasked with finding a more efficient way of performing this same task server-side, within SQL Server 2005 if possible, using whatever means will eventually allow me to return a result set including all of the details of the tasks requested, together with a concatenated string similar to the one used in the past (which the client program relies upon in order to sort and subtotal the report).
I've tried Views, UDFs in computed columns, and parameterized queries, but I know my limitations. I'm too inexperienced to know all of my options.
Question: Aside from quitting (not an option) or going berserk (considering it...), what methods might you use to solve this problem?
EDIT: Having received two questions about the MySort column already,
I'll explain a bit better.
T_Task.MySort =
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value1])
+ CAST(T_Customer.Sort_Type[Value1] AS VARCHAR(10))
+
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value2])
+ CAST(T_Customer.Sort_Type[Value2] AS VARCHAR(10))
+
REPLICATE('0',10 - LEN(T_Customer.Sort_Type[Value3])
+ CAST(T_Customer.Sort_Type[Value3] AS VARCHAR(10))
WHERE T_Customer.Customer_ID = T_Task.Customer_ID
...Up to T_Customer.Sort_Type[Value5].
Reminder: Those values are not constants at all, so the value of the
field MySort had to constantly be updated before printing a report.
The idea is to somehow remove the need to constantly update the field,
and instead return the string as part of the result set.
The resulting string should always be 50 characters in length. I
didn't do that here simply to save a bit of space and time - I chose
only 3 for the example. The real string would simply have another
twenty zeroes leading the value:
'00000000000000000000002017010100000000010000012345'

Database Design - how to store quantities that are measured in different ways

I would like to know if the database design i have in mind for an online food store is good according to the usually followed standards and conventions.
Basically the confusion i have is how to store items whose quantity is measured in different ways.
For example, there are items that are measured in terms of kilograms and then there are items measured in terms of number of packets.
For example rice is measured in kilograms and something like say, Noodles would be measured in terms of number of packets.
so the tables are planned to have below fields:
Items table with the fields: category,name,company,variant and a boolean variable named measured_in_packets?..
for items where measured_in_packets is set to true, an entry in another table will hold the available packet sizes:
packet_sizes table with item_id and packet_size..
so if one product is available in multiple packet sizes (250 gm, 500 gm etc), a row would be made for each available size against the item id...
does this sound like a good database design?

In a nutshell, you have items which have a quantity value, but that quantity value can be measured in different kinds of measurement types. You gave examples such as kilograms, packages, and we can perhaps add others such as litres for liquids, etc.
One of the problems with the current solution is that is doesn't allow for any easy alteration or expansion. It also relies on the checking of a boolean field in order to make decisions (such as which table to join I believe, based on your description).
Instead, a better approach would be to create a table containing the possible measurement types, such as kilograms or packets. Your items then simply have a foreign key to this table, and that tells you how the item is measured. This allows you to expand the types in the future, and no need to maintain a boolean flag, or do any other manual work.
This diagram illustrates what I'm referring to:
So if the data in these tables looked like this:
items
+----+---------+----------+----------------------+
| id | name | quantity | measurement_types_id |
+----+---------+----------+----------------------+
| 1 | Rice | 50 | 1 |
| 2 | Noodles | 75 | 2 |
+----+---------+----------+----------------------+
measurement_types
+----+-----------+--------------------+
| id | name | measurement_symbol |
+----+-----------+--------------------+
| 1 | Kilograms | kg |
| 2 | Packets | packets |
+----+-----------+--------------------+
A practical example of this data using the following query:
SELECT items.name, items.quantity, measurement_types.measurement_symbol
FROM items
INNER JOIN measurement_types
ON measurement_types.id = items.measurement_types_id;
would yield this result:
+---------+----------+--------------------+
| name | quantity | measurement_symbol |
+---------+----------+--------------------+
| Rice | 50 | kg |
| Noodles | 75 | packets |
+---------+----------+--------------------+

Friendship Website Database Design

I'm trying to create a database for a frienship website I'm building. I want to store multiple attributes about the user such as gender, education, pets etc.
Solution #1 - User table:
id | age | birth day | City | Gender | Education | fav Pet | fav hobbie. . .
--------------------------------------------------------------------------
0 | 38 | 1985 | New York | Female | University | Dog | Ping Pong
The problem I'm having is the list of attributes goes on and on and right now my user table has 20 something columns.
I feel I could normalize this by creating another table for each attribute see below. However this would create many joins and I'm still left with a lot of columns in the user table.
Solution #2 - User table:
id | age | birth day | City | Gender | Education | fav Pet | fav hobbies
--------------------------------------------------------------------------
0 | 38 | 1985 | New York | 0 | 0 | 0 | 0
Pets table:
id | Pet Type
---------------
0 | Dog
Anyone have any ideas how to approach this problem it feels like both answers are wrong. What is the proper table design for this database?

There is more to this than meets the eye: First of all - if you have tons of attributes, many of which will likely be null for any specific row, and with a very dynamic selection of attributes (i.e. new attributes will appear quite frequently during the code's lifecycle), you might want to ask yourself, whether a RDBMS is the best way to materialize this ... essentially non-schema. Maybe a document store would be a better fit?
If you do want to stay in the RDBMS world, the canonical answer is to have either one or one-per-datatype property table plus a table of properties:
Users.id | .name | .birthdate | .Gender | .someotherfixedattribute
----------------------------------------------------------
1743 | Me. | 01/01/1970 | M | indeed
Propertytpes.id | .name
------------------------
234 | pet
235 | hobby
Poperties.uid | .pid | .content
-----------------------------
1743 | 234 | Husky dog

You have a comment and an answer that recommend (or at least suggest) and Entity-Attribute-Value (EAV) model.
There is nothing wrong with using EAV if your attributes need to be dynamic, and your system needs to allow adding new attributes post-deployment.
That said, if your columns and relationships are all known up front, and they don't need to be dynamic, you are much better off creating an explicit model. It will (generally) perform better and will be much easier to maintain.

Instead of a wide table with a field per attribute, or many attribute tables, you could make a skinny table with many rows, something like:
Attributes (id,user_id,attribute_type,attribute_value)
Ultimately the best solution depends greatly on how the data will be used. People can only have one DOB, but maybe you want to allow for multiple addresses (billing/mailing/etc.), so addresses might deserve a separate table.

Which of these definitions explain 1NF?

I found it vague when i'm trying to look for the definition of 1NF in google.
Some of the sites like this one, says the table is in 1st normal form when it doesn't have any repetitive set of columns.
Some others (most of them) says there shouldn't be multiple values of the same domain exist in the same column.
and some of them says, all tables should have a primary key but some others doesn't talk about primary key at all !
can someone explain this for me ?

A relation is in first normal form if it has the property that none of its domains has elements which are themselves sets.
From E. F. Codd (Oct 1972). "Further normalization of the database relational model"
This really gets down to what it is about, but the guy who invented the relational database model.
When something is in the first normal form, there are no columns which themselves contain sets of data.
The wikipedia article on first normal form demonstrates this with a denormalized table:
Example1:
Customer
Customer ID | First Name | Surname | Telephone Number
123 | Robert | Ingram | 555-861-2025
456 | Jane | Wright | 555-403-1659, 555-776-4100
789 | Maria | Fernandez | 555-808-9633
This table is denormalized because Jane has a telephone number that is a set. Writing the table thus is still in violation of 1NF.
Example2:
Customer
Customer ID | First Name | Surname | Telephone Number
123 | Robert | Ingram | 555-861-2025
456 | Jane | Wright | 555-403-1659
456 | Jane | Wright | 555-776-4100
789 | Maria | Fernandez | 555-808-9633
The proper way to normalize the table is to break it out into two tables.
Example3:
Customer
Customer ID | First Name | Surname
123 | Robert | Ingram
456 | Jane | Wright
789 | Maria | Fernandez
Phone
Customer ID | Telephone Number
123 | 555-861-2025
456 | 555-403-1659
456 | 555-776-4100
789 | 555-808-9633
Another way of looking at 1NF is as defined by Chris Date (from Wikipedia):
There's no top-to-bottom ordering to the rows.
There's no left-to-right ordering to the columns.
There are no duplicate rows.
Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else).
All columns are regular [i.e. rows have no hidden components such as row IDs, object IDs, or hidden timestamps].
Example2 lacks a unique key which is in violation of rule 3. Example1 violates rule 4 in that the telephone number contains multiple values.
Only Example3 fills all those requirements.
Further reading:
Simple Guide to Five Normal Forms in Relational Database Theory

The simplest explanation I have found is this modified definition copied from here:
1st Normal Form Definition
A database is in first normal form if it satisfies the following conditions:
1) Contains only atomic values
2) There are no repeating groups