I have several different tables in my database and I'm trying to use Sphinx to do fast full-text searches. For ease of discussion, let's say the main records of interest are packing slips, one of which is included when an order ships. How do I use Sphinx to execute complex queries across all of these tables without completely denormalizing the database?
Each packing slip lists the order number, shipper, recipient, and the tracking number of each box included with the shipment. A separate table contains information about the order items. An additional table contains the customer address information. So, orders contain boxes and boxes contain items. (Example schema listed at the bottom of this question).
I would like to be able to query Sphinx to answers to questions like:
How many people who live on a street named "Maple" ordered an item with "large" in the description?
Which orders contain include the word "blue" in either the box description or order items' description?
To answer these types of questions, I need to refer to several tables. Since Sphinx doesn't have JOINs, one option is to denormalize the database. Denormalizing using a view, so that each row represents an order item--plus all of the data of it's parent box and order, would result in billions of very wide rows. So I've been creating a separate index for each table instead. But that doesn't allow me to query across tables as a SQL JOIN would. Is there another solution?
Example database
CREATE TABLE orders (
id integer PRIMARY KEY,
date_ordered date,
customer_po varchar
);
INSERT INTO orders VALUES (1, '2012-12-13', NULL);
INSERT INTO orders VALUES (2, '2012-12-14', 'DF312442');
CREATE TABLE parties (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
party_type varchar,
company varchar,
city varchar,
state char(2)
);
INSERT INTO parties VALUES (1, 1, 'shipper', 'ACME, Inc.', 'New York', 'NY');
INSERT INTO parties VALUES (2, 1, 'recipient', 'Wylie Coyote Corp.', 'Flagstaff', 'AZ');
INSERT INTO parties VALUES (3, 2, 'shipper', 'Cyberdyne', 'Las Vegas', 'NV');
-- Please disregard the fact that this design permits multiple shippers and multiple recipients
-- per order. This is a vastly simplified version of the system I'm working on.
CREATE TABLE boxes (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
tracking_num varchar NOT NULL,
description varchar NOT NULL,
);
INSERT INTO boxes VALUES (1, 1, '1234567890', 'household goods');
INSERT INTO boxes VALUES (2, 1, '0987654321', 'kitchen appliances');
INSERT INTO boxes VALUES (3, 2, 'ABCDE12345', 'audio equipment');
CREATE TABLE box_contents (
id integer PRIMARY KEY,
order_id integer NOT NULL REFERENCES orders(id),
box integer NOT NULL REFERENCES boxes(id),
qty_units integer,
description varchar
);
INSERT INTO box_contents VALUES (1, 1, 1, 4, 'cookbook');
INSERT INTO box_contents VALUES (2, 1, 1, 2, 'baby bottle');
INSERT INTO box_contents VALUES (3, 1, 2, 1, 'television');
INSERT INTO box_contents VALUES (4, 2, 3, 2, 'lamp');
You put the JOIN in the sql_query that builds the index. The tables remain normalized, but you denormalize when building the index.
Its only a basic example, but your query would be something like.. .
sql_query = SELECT o.id,customer_po,UNIX_TIMESTAMP(date_ordered) AS date_ordered, \
GROUP_CONCAT(DISTINCT party_type) AS party_type, \
GROUP_CONCAT(DISTINCT company) AS company, \
GROUP_CONCAT(DISTINCT city) AS city, \
GROUP_CONCAT(DISTINCT description) AS description \
FROM orders o \
INNER JOIN parties p ON (o.id = p.order_id) \
INNER JOIN box_contents b ON (o.id = b.order_id) \
GROUP BY o.id \
ORDER BY NULL
Update: alternatively can use sql_joined_field to do the same but avoid actual sql_query joins. Sphinx then does the join process for you
Related
I'm not very experienced with SQLite, and I want to perform some operations based on a local database. My data consists of a lot of entries of one ID, and there are some additional IDs to specify the data. Based on multiple IDs, there should be a unique combination for each entry in the table.
I would like to select (if it exists) a row based on a combination of IDs, and either insert or update some columns in that row.
I haven't really tried anything because I can't find where to start.
But to illustrate what I mean, I would think of something like this:
UPDATE OR REPLACE INTO my_table (ID,Part_ID,Location_ID,Torque) VALUES (2,6,4,100) WHERE (ID,Part_ID,Location_ID) = (2,6,4)
First, you need a unique constraint for the combination of the columns ID, Part_ID and Location_ID, which you can define with a unique index:
CREATE UNIQUE INDEX idx_my_table ON my_table (ID, Part_ID, Location_ID);
Then use UPSERT:
INSERT INTO my_table (ID, Part_ID, Location_ID, Torque) VALUES (2, 6, 4, 100)
ON CONFLICT(ID, Part_ID, Location_ID) DO UPDATE
SET Torque = EXCLUDED.Torque;
See the demo.
I am building a functionality to estimate Inventory for my Ads serve platform.The fields on which I am trying to estimate with their cardinality is as below:
FIELD: CARDINALITY
location: 10000 (bengaluru, chennai etc..)
n/w speed : 6 (w, 4G, 3G, 2G, G, NA)
priceRange : 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
users: contains number of users falling under any of the above combination.
Ex. {'location':'bengaluru', 'n/w':'4G', priceRange:8, users: 1000}
means 1000 users are from bengaluru having 4G and priceRange = 8
So total combination can be 10000 * 6 * 10 = 600000 and in future more fields can be added to around 29(currently it is 3 location, n/w, priceRange) and total combination can reach the order of 10mn. Now I want to estimate how many users fall under
Now queries I will need are as follows:
1) find all users who are from location:bengaluru , n/w:3G, priceRange: 6
2) find all users from bengaluru
3) Find all users falling under n/w: 3G and priceRange: 8
What is the best possible way to approach to this?
Which database can be best suited for this requirement.What indexes I need to build. Will compound index help? If yes then How ? Any help is appreciated.
Here's my final answer:
Create table Attribute(
ID int,
Name varchar(50));
Create table AttributeValue(
ID int,
AttributeID int,
Value varchar(50));
Create table userAttributeValue(
userID int,
AttributeID varchar(20),
AttributeValue varchar(50));
Create table User(
ID int);
Insert into user (ID) values (1),(2),(3),(4),(5);
Insert into Attribute (ID,Name) Values (1,'Location'),(2,'nwSpeed'),(3,'PriceRange');
Insert into AttributeValue values
(1,1,'bengaluru'),(2,1,'chennai'),
(3,2, 'w'), (4, 2,'4G'), (5,2,'3G'), (6,2,'2G'), (7,2,'G'), (8,2,'NA'),
(9,3,'1'), (10,3,'2'), (11,3,'3'), (12,3,'4'), (13,3,'5'), (14,3,'6'), (15,3,'7'), (16,3,'8'), (17,3,'9'), (18,3,'10');
Insert into UserAttributeValue (userID, AttributeID, AttributeValue) values
(1,1,1),
(1,2,5),
(1,3,9),
(2,1,1),
(2,2,4),
(3,2,6),
(2,3,13),
(4,1,1),
(4,2,4),
(4,3,13),
(5,1,1),
(5,2,5),
(5,3,13);
Select USERID
from UserAttributeValue
where (AttributeID,AttributeValue) in ((1,1),(2,4))
GROUP BY USERID
having count(distinct concat(AttributeID,AttributeValue))=2
Now if you need a count wrap userID in count and divide by the attributes passed in as each user will have 1 record per attribute and to get the "count of users" you'd need to divide by the number of attributes.
This allows for N growth of Attributes and the AttributeValues per user without changes to UI or database if UI is designed correctly.
By treating each datapoint as an attribute and storing them in once place we can enforce database integrity.
Attribute and AttributeValue tables becomes lookups for UserAttributevalue so you can translate the IDs back to attribute name and the value.
This also means we only have 4 tables user, attribute, attributeValue, and UserAttributeValue.
Technically you don't have to store attributeID on the userAttributeValue, but for performance reasons on later joins/reporting I think you'll find it beneficial.
You need to add proper Primary Key's, Foreign keys, and indexes to the tables. They should be fairly self explanatory. On UserAttributeValue I would have a few Composite indexes each with a different order of the unique key. Just depends on the type of reporting/analysis you'll be doing but adding keys as performance tuning is needed is commonplace.
Assumptions:
You're ok with all datavalues being varchar data in all cases.
If needed you could add a datatype, precision, and scale on the attribute table and allow the UI to cast the attribute value as needed. but since they are all in the same field in the database they all have to be the same datatype. and of the same precision/scale.
Pivot tables to display the data across will likely be needed and you know how to handle those (and engine supports them!)
Gotta say I loved the metal exercise; but still would appreciate feedback from others on SO. I've used this approach in 1 systems I've developed and it's been in two I've supported. There are some challenges but it does follow 3rd normal form db design (except for the replicated attributeID in userAttributevalue but that's there for performance gain in reporting/filtering.
Let's say that we have to store information of different types of product in a database. However, these products have different specifications. For example:
Phone: cpu, ram, storage...
TV: size, resolution...
We want to store each specification in a column of a table, and all the products (whatever the type) must have a different ID.
To comply with that, now I have one general table named Products (with an auto increment ID) and one subordinate table for each type of product (ProductsPhones, ProductsTV...) with the specifications and linked with the principal with a Foreign Key.
I find this solution inefficient since the table Products has only one column (the auto incremented ID).
I would like to know if there is a better approach to solve this problem using relational databases.
The short answer is no. The relational model is a first-order logical model, meaning predicates can vary over entities but not over other predicates. That means dependent types and EAV models aren't supported.
EAV models are possible in SQL databases, but they don't qualify as relational since the domain of the value field in an EAV row depends on the value of the attribute field (and sometimes on the value of the entity field as well). Practically, EAV models tend to be inefficient to query and maintain.
PostgreSQL supports shared sequences which allows you to ensure unique auto-incremented IDs without a common supertype table. However, the supertype table may still be a good idea for FK constraints.
You may find some use for your Products table later to hold common attributes like Type, Serial number, Cost, Warranty duration, Number in stock, Warehouse, Supplier, etc...
Having Products table is fine. You can put there all the columns common across all types like product name, description, cost, price just to name some. So it's not just auto increment ID. Having an internal ID of type int or long int as the primary key is recommended. You may also add another field "code" or whatever you want to call it for user-entered or user-friendly which is common with product management systems. Make sure you index it if used in searching or query criteria.
HTH
While this can't be done completely relationally, you can still normalize your tables some and make it a little easier to code around.
You can have these tables:
-- what are the products?
Products (Id, ProductTypeId, Name)
-- what kind of product is it?
ProductTypes (Id, Name)
-- what attributes can a product have?
Attributes (Id, Name, ValueType)
-- what are the attributes that come with a specific product type?
ProductTypeAttributes (Id, ProductTypeId, AttributeId)
-- what are the values of the attributes for each product?
ProductAttributes (ProductId, ProductTypeAttributeId, Value)
So for a Phone and TV:
ProductTypes (1, Phone) -- a phone type of product
ProductTypes (2, TV) -- a tv type of product
Attributes (1, ScreenSize, integer) -- how big is the screen
Attributes (2, Has4G, boolean) -- does it get 4g?
Attributes (3, HasCoaxInput, boolean) -- does it have an input for coaxial cable?
ProductTypeAttributes (1, 1, 1) -- a phone has a screen size
ProductTypeAttributes (2, 1, 2) -- a phone can have 4g
-- a phone does not have coaxial input
ProductTypeAttributes (3, 2, 1) -- a tv has a screen size
ProductTypeAttributes (4, 2, 3) -- a tv can have coaxial input
-- a tv does not have 4g (simple example)
Products (1, 1, CoolPhone) -- product 1 is a phone called coolphone
Products (2, 1, AwesomePhone) -- prod 2 is a phone called awesomephone
Products (3, 2, CoolTV) -- prod 3 is a tv called cooltv
Products (4, 2, AwesomeTV) -- prod 4 is a tv called awesometv
ProductAttributes (1, 1, 6) -- coolphone has a 6 inch screen
ProductAttributes (1, 2, True) -- coolphone has 4g
ProductAttributes (2, 1, 4) -- awesomephone has a 4 inch screen
ProductAttributes (2, 2, False) -- awesomephone has NO 4g
ProductAttributes (3, 3, 70) -- cooltv has a 70 inch screen
ProductAttributes (3, 4, True) -- cooltv has coax input
ProductAttributes (4, 3, 19) -- awesometv has a 19 inch screen
ProductAttributes (4, 4, False) -- awesometv has NO coax input
The reason this is not fully relational is that you'll still need to evaluate the value type (bool, int, etc) of the attribute before you can use it in a meaningful way in your code.
Ok SQL Server fans I have an issue with a legacy stored procedure that sits inside of a SQL Server 2008 R2 Instance that I have inherited also with the PROD data which to say the least is horrible. Also, I can NOT make any changes to the data nor the table structures.
So here is my problem, the stored procedure in question runs daily and is used to update the employee table. As you can see from my example the incoming data (#New_Employees) contains the updated data and I need to use it to update the data in the Employee data is stored in the #Existing_Employees table. Throughout the years different formatting of the EMP_ID value has been used and must be maintained as is (I fought and lost that battle). Thankfully, I have been successfully in changing the format of the EMP_ID column in the #New_Employees table (Yeah!) and any new records will use this format thankfully!
So now you may see my problem, I need to update the ID column in the #New_Employees table with the corresponding ID from the #Existing_Employees table by matching (that's right you guessed it) by the EMP_ID columns. So I came up with an extremely hacky way to handle the disparate formats of the EMP_ID columns but it is very slow considering the number of rows that I need to process (1M+).
I thought of creating a staging table where I could simply cast the EMP_ID columns to an INT and then back to a NVARCHAR in each table to remove the leading zeros and I am sort of leaning that way but I wanted to see if there was another way to handle this dysfunctional data. Any constructive comments are welcome.
IF OBJECT_ID(N'TempDB..#NEW_EMPLOYEES') IS NOT NULL
DROP TABLE #NEW_EMPLOYEES
CREATE TABLE #NEW_EMPLOYEES(
ID INT
,EMP_ID NVARCHAR(50)
,NAME NVARCHAR(50))
GO
IF OBJECT_ID(N'TempDB..#EXISTING_EMPLOYEES') IS NOT NULL
DROP TABLE #EXISTING_EMPLOYEES
CREATE TABLE #EXISTING_EMPLOYEES(
ID INT PRIMARY KEY
,EMP_ID NVARCHAR(50)
,NAME NVARCHAR(50))
GO
INSERT INTO #NEW_EMPLOYEES
VALUES(NULL, '00123', 'Adam Arkin')
,(NULL, '00345', 'Bob Baker')
,(NULL, '00526', 'Charles Nelson O''Reilly')
,(NULL, '04321', 'David Numberman')
,(NULL, '44321', 'Ida Falcone')
INSERT INTO #EXISTING_EMPLOYEES
VALUES(1, '123', 'Adam Arkin')
,(2, '000345', 'Bob Baker')
,(3, '526', 'Charles Nelson O''Reilly')
,(4, '0004321', 'Ed Sullivan')
,(5, '02143', 'Frank Sinatra')
,(6, '5567', 'George Thorogood')
,(7, '0000123-1', 'Adam Arkin')
,(8, '7', 'Harry Hamilton')
-- First Method - Not Successful
UPDATE NE
SET ID = EE.ID
FROM
#NEW_EMPLOYEES NE
LEFT OUTER JOIN #EXISTING_EMPLOYEES EE
ON EE.EMP_ID = NE.EMP_ID
SELECT * FROM #NEW_EMPLOYEES
-- Second Method - Successful but Slow
UPDATE NE
SET ID = EE.ID
FROM
dbo.#NEW_EMPLOYEES NE
LEFT OUTER JOIN dbo.#EXISTING_EMPLOYEES EE
ON CAST(CASE WHEN NE.EMP_ID LIKE N'%[^0-9]%'
THEN NE.EMP_ID
ELSE LTRIM(STR(CAST(NE.EMP_ID AS INT))) END AS NVARCHAR(50)) =
CAST(CASE WHEN EE.EMP_ID LIKE N'%[^0-9]%'
THEN EE.EMP_ID
ELSE LTRIM(STR(CAST(EE.EMP_ID AS INT))) END AS NVARCHAR(50))
SELECT * FROM #NEW_EMPLOYEES
the number of rows that I need to process (1M+).
A million employees? Per day?
I think I would add a 3rd table:
create table #ids ( id INT not NULL PRIMARY KEY
, emp_id not NULL NVARCHAR(50) unique );
Populate that table using your LTRIM(STR(CAST, ahem, algorithm, and update Employees directly from a join of those three tables.
I recommend using ANSI update, not Microsoft's nonstandard update ... from because the ANSI version prevents nondeterministic results in cases where the FROM produces more than one row.
This question was asked in a interview. Design a organizational structure, where an employee can have direct reports, and indirect reports (that is reportees of reportee). The design should be such that, in a single query it should be able to retrieve either direct or indirect reportees or both.
I suggested,
Employee
----------
id
name
Reportee
------
emp_id FK
reportee_id FK
isDirect
The interviewer said the optimistic solution is
Employee
-------
id
name
reporting_path like (a>b>c)
Adding additional table, takes more space, but query will be executed faster. I said that due to string matching, the path based approach is bad and yields bad performance.
So which approach is optimistic?
The interviewer's approach is dumb because it does not use referential integrity.
For a purely hierarchical model (an employee cannot report to more than one boss), then this is the best approach:
create table employees (
employee_id int primary key,
name varchar(whatever) not null,
supervisor_id int null references employees(employee_id)
);
insert into employees (employee_id, name, supervisor_id) values
(1, 'Big Boss Bill', null),
(2, 'Vice President Victor', 1),
(3, 'Underling Ulysses', 2),
(4, 'Subordinate Sam', 2);
You can then use Recursive Common Table Expressions to query reports.
Some example queries here:
http://blog.databasepatterns.com/2014/02/trees-paths-recursive-cte-postgresql.html