How to make select query more efficient? - database

I have a table Customers with Millions of Records on 701 attributes ( columns ). I receive a csv file that has one row and 700 columns. Now on the basis of these 700 column values I have to extract the ids from the table Customers.
Now one way is obvious that i fire a select query with all 700 values in where clause.
My question is that if I first fetch a smaller table using only one attribute in where clause and then fetching again on the basis of second attribute in where clause ... and repeating this process for all attributes, would it be any faster ?
Or can you suggest any other method that could make it faster ?

Try to understand the logic of those 700 attributes. There might be dependencies between them that can help reduce the number of the attributes to something more "realistic".
I would then use the same technique to see if I can run smaller queries which would benefit from indexes on the main table. Each time I would store the result in a temporary table (reducing the number or rows in the tmp table), index the temp table for the next step and do it again till I have the final result.
Example: if you have date attributes: try to isolate all record for the year, then the day, etc.
Try to keep complex requests for the end as they will be running against smaller tmp tables.

Related

Performance strategies for selecting large amounts of data from child tables

I have a fairly standard database structure for storing invoices and their line items. We have an AccountingDocuments table for header details, and AccountingDocumentItems child table. The child table has a foreign key of AccountingDocumentId with associated non-clustered index.
Most of our Sales reports need to select almost all the columns in the child table, based on a passed date range and branch. So the basic query would be something like:
SELECT
adi.TotalInclusive,
adi.Description, etc.
FROM
AccountingDocumentItems adi
LEFT JOIN AccountingDocuments ad on ad.AccountingDocumentId=adi.AccountingDocumentId
WHERE
ad.DocumentDate > '1 Jan 2022' and ad.DocumentDate < '5 May 2022'
AND
ad.SiteId='x'
This results in an index seek on the AccountingDocuments table, but due to the select requirements on AccountingDocumentItems we're seeing a full index scan here. I'm not certain how to index the child table given the filtering is taking place on the parent table.
At present there are only around 2m rows in the line items table, and the query performs quite well. My concern is how this will work once the table is significantly larger and what strategies there are to ensure good performance on these reports going forward.
I'm aware the answer may be to simply keep adding hardware, but want to investigate alternative strategies to this.

Large table issue in Snowflake

I have large tables with about 2.2 gbs of data. When I use SELECT * to select a row in the tables, it takes about 14 mins to run. Is there a method to speed up this query?
Here are some other information that might be helpful:
~ 2 million rows
~ 25k columns
data type: Varcar
Warehouse:
Size: Computer_WH
Clusters: min:1, max:2
Auto Suspension: 10 minutes
Owner: ACCOUNTADMIN
2gb is not that large, and very much should not be taking 14m on a X-SMALL warehouse.
First rule of Snowflake, don't SELECT * FROM x, for two reasons,
The query compile has to wait for all meta data to be loaded for all partitions, before the plan can start being built as some partitions might have more data that the first partitions. Thus the output shape cannot be planned until all is known.
Second reason, when you "select all columns", all columns are loaded from disk, and if your data is unstructured JSON is has to rebuild all that data, which is "relatively expensive". You should name the columns you want, and only the columns you want.
If you are wanting to join to another table to do some filtering, just select the columns needed to do the filter, and the join, and then get the set of keys you want and re-join to the base table on those results (sometimes as a second query) so pruning can happen.
sigh, I have just looked at your stats a little hard 25K columns... sigh. This is not a database, this is something very painful..
As a strong opinion you cannot have a row of data that makes sense to have 25K related and meaning full columns. You have a table with a primary key, and it should have something like 25K rows of subtype data per attribute. Yes it means you have to exploded the data out via a PIVOT or the likes, but it's more honest about the relations present in the data, and how to process this volume of data.
With columnar databases each column in a table has it's own file. Previously each table was a file (older DBMS's). If you have 25,000 columns you'd be selecting 25,000 files.
Some of these files are big and some are small -> this is dependent on the data type and # distinct values.
If you found a column that say had 100 distinct values and just selected that column from your table I'd guess you'd get sub second response times.
So back to your problem ... instead of choosing all the columns (*) why not just choose some interesting ones?

Storing processed results of connection in RDBMS

A csv file contains following two column : admission_number, project_name.
The relationship between two entities are many to many relationships : a specific admission_number can work over multiple projects. A specific project may have multiple admission_number.
Data will be like as follows and initially there are '1000 milion' rows and data will keep on updating on daily basis in this table will go upto 1300 milion rows.
admission_number,project_name
1234567890,ABC1234567
1234567890,ABC1234568
1234567891,ABC1234569
1234567892,ABC1234569
1234567893,ABC1234570
1234567894,ABC1234567
1234567895,ABC1234567
For a specific admission number(lets say 1234567890), i want to know all the admission_number who are working on the same projects (ABC1234567,ABC1234568). The output of above query will be
1234567894,1234567895.
Explanation : Since for admission number '1234567890', projects name are 'ABC1234567' and 'ABC1234568'. On these two projects other 'admission_number' are working as '1234567894','1234567895'
I came up with two solutions, To store the data,RDBMS will be used.
Approach 1 : By using two retrieval query : First query shall return all the projcects_name for a specific 'admission_number' and the second query will retrun all the admission_number for 'project_name'.
select admission_number from table where project_name IN (select project_name from table where admission_number='ABC1234567'.
Approach 2 : In this approach, before going for loading i am preprocessing the results and directly results is storing in database. I am only storing all the connected 'admission_number'.
Eg. For project_name 'ABC1234567', these 3 admission_number '1234567890','1234567894', '1234567895' are working. I want to store all connected admission_number in table with two columns (number,connected_number) like ('1234567890','1234567894'),('1234567890','1234567895'), ('1234567894','1234567895'), and query will work on both columns (number and connected_number).
But in this approach there will be many rows means if a specifc project_name 'p', there are n 'admission_number' than total number of rows will be n(n-1)/2
How can i store all the connected admission_number in RDBMS? Loading of data can be slow, but retrieval should be fast.
Do not optimize the data structure. It would only cause problems.
Create a simple table with two columns for both ID and create index for both columns.
The RDBMS will build and maintain an index of the column values, which will enable fast lookup for a specific record.

Performance tuning on PATINDEX with JOIN

I have table called as tbl_WHO with 90 millions of records and temp table #EDU with just 5 records.
I want to do pattern matching on name field between two tables (tbl_WHO and #EDU).
Query: Following query took 00:02:13 time for execution.
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0
)
Sometimes I have to do pattern matching on more than one columns like:
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0 AND
(ISNULL(PATINDEX('%'+Tbl.PAddress+'%',Tmp.Addres),'0')) > 0 OR
(ISNULL(PATINDEX('%'+Tbl.PZipCode,Tmp.ZCode),'0')) > 0
)
Note: There is INDEX created on the columns which comes under condition.
Is there any other way to tune the query performance?
Searches starting with % are not sargable, so even having index on the given column, you are not going to be able to use it effectively.
Are you sure you need to search with PATINDEX each time? Table with 90 millions records is not huge, but having many columns and not applying normalization correctly can decrease the performance for sure.
I will advice to revise the table and check if the data can be normalized further. This can lead to better performance in particular cases and decreasing the table storage as well.
For example, the zipcode can be move to separated table and instead using the zipcode string, you can join by integer column. Try to normalized the address further - if you have city, street or block, street or block number? The names - if you need to search by first, last names just split the names in separate columns.
For string values, the data can be sanitized - remove empty strings at the beg and at the end (trim) for example. And having such data, we can create hash indexes and get extremely fast equal searches.
What I want to say is that if you normalized your data and add some rules (on database and application level) to ensure the input data is correct you are going to have very nice performance. And it is the long way, but you are going to do this - it's easier to be done now, than later (you are late and now).

Dynamic PIVOT with varchar columns

I'm trying to to pivot rows into columns. I basically have lots of lines where every N rows means a row on a table I'd like to list as a result set. I'll give a short example:
I have a table structure like this:
Keep it in mind that I removed lots of rows to simplify this example. Every 6 rows means 1 row in the result set, which I would like to be like this:
All columns are varchar types (that's why I couldn't get it done with pivot)
Number os columns are dynamic, so it's the number of rows in source table
Logically, Number of rows (table rows in result set) are equally dynamic
(Not really an answer, but it's what I've got.)
This is a name/value pair table, right? Your query will require something that identifies which "set" of rows is associated with one another. Without something like this, I don't see how the query can be written. The key factor is that you must never assume that data will be returned from SQL (Server, at least) in any particular order. How the data is stored internally generally, but not always, determines how it is returned when order is not specified.
Another consideration: what if (when?) a row is missing -- say, Product 4 has no Price B column? That would break a simple "every six rows" rule. "Start fresh with every new Code row" would it problems if a Code is missed or when (not if) data is not returned in the anticipated order.
If you have some means of grouping items, let us know in an updated question, but otherwise I don't think this one is particularly solvable.
I actually did it.
I wrote a SQL while...do based on the number of columnns registered for the resultset. This way I could write a dynamic SQL clause for N columns based on the values read. In the end I just inserted the resultset in a temp table, and voi lá.
Thanks anyways!

Resources