Clustering using lower case - snowflake-cloud-data-platform

I tried to cluster events table by lower(event_name), but when I checked the number of partitions using for the query after clustering by lower case of the event name it seems like it doesn't work - it uses the same numbers of partitions like it was before the clustering.
Any ideas why and what can I do?

It's very hard to help you, if you don't give us your queries and how you created the clustered table. Nevertheless, let me try these queries to show off how clustering could help you (or not):
Unclustered table: 64 of 64 partitions scanned.
select *
from wikimedia.public.wikidata_modern_people
where name like 'Alex%';
Now let me cluster that table by name:
create or replace temp table people_name_cluster
cluster by (name)
as
select *
from wikimedia.public.wikidata_modern_people
Now this query scans only 1 of 16 partitions:
select *
from people_name_cluster
where name like 'Alex%';
That was expected, but what if we cluster by lower(name):
create or replace temp table people_name_cluster_lower
cluster by (lower(name))
as
select *
from wikimedia.public.wikidata_modern_people
Then this query scans 2 of 16 partitions:
select *
from people_name_cluster_lower
where name like 'Alex%';
But when I try to play with lower(name) or regular expressions, nothing gets pruned. The following queries scan 16 of 16 partitions:
select *
from people_name_cluster_lower
where lower(name) like 'alex%';
select *
from people_name_cluster_lower
where name regexp('^Alex.*');
select *
from people_name_cluster_lower
where regexp_like(name, 'Alex.*');
And then this query scans 2 of 16:
select *
from people_name_cluster_lower
where regexp_like(name, 'Alex.*')
and name between 'A' and 'B';
Lesson learned through these experiments: Clustering by lower(string) might not make things much better. Trying to prune by regular expressions might not be optimized either.

Adding a clustering key, does not make the data instantly clustered. It can takes hours for a table to be reordered.
And if you want to test is the reorder will have an impact the fastest method is to create a new (temporary) table with an ORDER BY cluster_key_terms and test if the order improves the query pruning.
Given the credits it can cost to incrementally reshape/cluster the data is more than the compute to do it in a single operation, and the "rewrite" will use more disk. the truely cheapest way to test, is do it in one go.

Related

Performance tuning on PATINDEX with JOIN

I have table called as tbl_WHO with 90 millions of records and temp table #EDU with just 5 records.
I want to do pattern matching on name field between two tables (tbl_WHO and #EDU).
Query: Following query took 00:02:13 time for execution.
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0
)
Sometimes I have to do pattern matching on more than one columns like:
SELECT Tbl.PName,Tbl.PStatus
FROM tbl_WHO Tbl
INNER JOIN #EDU Tmp
ON
(
(ISNULL(PATINDEX(Tbl.PName,Tmp.FirstName),'0')) > 0 AND
(ISNULL(PATINDEX('%'+Tbl.PAddress+'%',Tmp.Addres),'0')) > 0 OR
(ISNULL(PATINDEX('%'+Tbl.PZipCode,Tmp.ZCode),'0')) > 0
)
Note: There is INDEX created on the columns which comes under condition.
Is there any other way to tune the query performance?
Searches starting with % are not sargable, so even having index on the given column, you are not going to be able to use it effectively.
Are you sure you need to search with PATINDEX each time? Table with 90 millions records is not huge, but having many columns and not applying normalization correctly can decrease the performance for sure.
I will advice to revise the table and check if the data can be normalized further. This can lead to better performance in particular cases and decreasing the table storage as well.
For example, the zipcode can be move to separated table and instead using the zipcode string, you can join by integer column. Try to normalized the address further - if you have city, street or block, street or block number? The names - if you need to search by first, last names just split the names in separate columns.
For string values, the data can be sanitized - remove empty strings at the beg and at the end (trim) for example. And having such data, we can create hash indexes and get extremely fast equal searches.
What I want to say is that if you normalized your data and add some rules (on database and application level) to ensure the input data is correct you are going to have very nice performance. And it is the long way, but you are going to do this - it's easier to be done now, than later (you are late and now).

SQL Server: one table fulltext index vs multiple table fulltext index

I have a search procedure which has to search in five tables for the same string. I wonder which is better regarding read performance?
To have all tables combined in one table, then add a full text index on it
To create full text indexes in all those tables and issue a query on all of them, then to combine the results
I think something to consider when working with performance, is that reading data is almost always faster than reading data, moving data and then reading again rather than just reading once.
So from your question if you are combining all the tables into a single say temporary table or table variable this will most definitely be slow because it has to query all that data and move it (depending on how much data you are working with this may or may not make much of a difference). As well regarding your table structure, indexing on strings only really becomes efficient when the same string shows up a number of times throughout the tables.
I.e. if you are searching on months (Jan, Feb, Mar, etc...) an index will be great because it can only be 1 of 12 options, and the more times a value is repeated/the less options there are to choose from the better an index is. Where if you are searching on say user entered values ("I am writing my life story....") and you are searching on a word inside that string, an index may not make any difference.
So assuming your query is searching on, in my example months, then you could do something like this:
SELECT value
FROM (
SELECT value FROM table1
UNION
SELECT value FROM table2
UNION
SELECT value FROM table3
UNION
SELECT value FROM table4
UNION
SELECT value FROM table5
) t
WHERE t.value = 'Jan'
This will combine your results into a single result set without moving data around. As well the interpreter will be able to find the most efficient way to query each table using the provided index on each table.

Can we create Bitmap Index on a table column in Oracle 11 which reloads daily using a Job

We have table which stores information about clients which gets loaded using a scheduled job on daily basis from Data warehouse. There are more than 1 million records in that table.
I wanted to define BitMap Index on Country column as there would be limited number of values.
Does it have any impact on the indexes if we delete and reload data into table on daily basis. Do we need to explicitly rebuild the index after every load
Bitmap index is dangerous when the table is frequently updated (the indexed column) because DML on a single row can lock many rows in the table. That's why it is more data warehouse tool than OLTP. Also the true power of bitmap indexes comes with combining more of them using logical operations and translating the result into ROWIDs (and then accessing the rows or aggregate them). In Oracle in general there is not so many reasons to rebuild an index. When frequently modified it will always adapt by 50/50 block split. It doesn't make sense to try to compact it to smallest possible space. One million rows today is nothing unless each row contains big amount of data.
Also be aware that BITMAP indexes requires Enterprise edition license.
The rationale for defining a bitmap index is not a few values in a column, but a query(s) that can profit from it accessing the table rows.
For example if you have say 4 countries equaly populated, Oracle will not use the index as a FULL TABLE SCAN comes cheaper.
If you have some "exotic" countries (very few records) BITMAP index could be used, but you will most probably spot no difference to a conventional index.
I wanted to define BitMap Index on Country column as there would be limited number of values.
Just because a column is low cardinality does not mean it is a candidate for a bitmap index. It might be, it might not be.
Good explanation by Tom Kyte here.
Bitmap indexes are extremely useful in environments where you have
lots of ad hoc queries, especially queries that reference many columns
in an ad hoc fashion or produce aggregations such as COUNT. For
example, suppose you have a large table with three columns: GENDER,
LOCATION, and AGE_GROUP. In this table, GENDER has a value of M or F,
LOCATION can take on the values 1 through 50, and AGE_GROUP is a code
representing 18 and under, 19-25, 26-30, 31-40, and 41 and over.
For example,
You have to support a large number of ad hoc queries that take the following form:
select count(*)
from T
where gender = 'M'
and location in ( 1, 10, 30 )
and age_group = '41 and over';
select *
from t
where ( ( gender = 'M' and location = 20 )
or ( gender = 'F' and location = 22 ))
and age_group = '18 and under';
select count(*) from t where location in (11,20,30);
select count(*) from t where age_group = '41 and over' and gender = 'F';
You would find that a conventional B*Tree indexing scheme would fail you. If you wanted to use an index to get the answer, you would need at least three and up to six combinations of possible B*Tree indexes to access the data via the index. Since any of the three columns or any subset of the three columns may appear, you would need large
concatenated B*Tree indexes on
GENDER, LOCATION, AGE_GROUP: For queries that used all three, or GENDER with
LOCATION, or GENDER alone
LOCATION, AGE_GROUP: For queries that used LOCATION and AGE_GROUP or LOCATION
alone
AGE_GROUP, GENDER: For queries that used AGE_GROUP with GENDER or AGE_GROUP
alone
Having only a single Bitmap Index on a table is useless in most times. The benefit of Bitmap Indexes you get when you have several created on a table and your query combines them.
Maybe a List-Partition is more suitable in your case.

One large table or many small ones in database?

Say I want to create a typical todo-webApp using a db like postgresql. A user should be able to create todo-lists. On this lists he should be able to make the actual todo-entries.
I regard the todo-list as an object which has different properties like owner, name, etc, and of course the actual todo-entries which have their own properties like content, priority, date ... .
My idea was to create a table for all the todo-lists of all the users. In this table I would store all the attributes of each list. But the questions which arises is how to store the todo-entries themselves? Of course in an additional table, but should I rather:
1. Create one big table for all the entries and have a field storing the id of the todo-list they belong to, like so:
todo-list: id, owner, ...
todo-entries: list.id, content, ...
which would give 2 tables in total. The todo-entries table could get very large. Although we know that entries expire, hence the table only grows with more usage but not over time. Then we would write something like SELECT * FROM todo-entries WHERE todo-list-id=id where id is the of the list we are trying to retrieve.
OR
2. Create a todo-entries table on a per user basis.
todo-list: id, owner, ...
todo-entries-owner: list.id, content,. ..
Number of entries table depends on number of users in the system. Something like SELECT * FROM todo-entries-owner. Mid-sized tables depending on the number of entries users do in total.
OR
3. Create one todo-entries-table for each todo-list and then store a generated table name in a field for the table. For instance could we use the todos-list unique id in the table name like:
todo-list: id, owner, entries-list-name, ...
todo-entries-id: content, ... //the id part is the id from the todo-list id field.
In the third case we could potentially have quite a large number of tables. A user might create many 'short' todo-lists. To retrieve the list we would then simply go along the lines SELECT * FROM todo-entries-id where todo-entries-id should be either a field in the todo-list or it could be done implicitly by concatenating 'todo-entries' with the todos-list unique id. Btw.: How do I do that, should this be done in js or can it be done in PostgreSQL directly? And very related to this: in the SELECT * FROM <tablename> statement, is it possible to have the value of some field of some other table as <tablename>? Like SELECT * FROM todo-list(id).entries-list-name or so.
The three possibilities go from few large to many small tables. My personal feeling is that the second or third solutions are better. I think they might scale better. But I'm not sure quite sure of that and I would like to know what the 'typical' approach is.
I could go more in depth of what I think of each of the approaches, but to get to the point of my question:
Which of the three possibilities should I go for? (or anything else, has this to do with normalization?)
Follow up:
What would the (PostgreSQL) statements then look like?
The only viable option is the first. It is far easier to manage and will very likely be faster than the other options.
Image you have 1 million users, with an average of 3 to-do lists each, with an average of 5 entries per list.
Scenario 1
In the first scenario you have three tables:
todo_users: 1 million records
todo_lists: 3 million records
todo_entries: 15 million records
Such table sizes are no problem for PostgreSQL and with the right indexes you will be able to retrieve any data in less than a second (meaning just simple queries; if your queries become more complex (like: get me the todo_entries for the longest todo_list of the top 15% of todo_users that have made less than 3 todo_lists in the 3-month period with the highest todo_entries entered) it will obviously be slower (as in the other scenarios). The queries are very straightforward:
-- Find user data based on username entered in the web site
-- An index on 'username' is essential here
SELECT * FROM todo_users WHERE username = ?;
-- Find to-do lists from a user whose userid has been retrieved with previous query
SELECT * FROM todo_lists WHERE userid = ?;
-- Find entries for a to-do list based on its todoid
SELECT * FROM todo_entries WHERE listid = ?;
You can also combine the three queries into one:
SELECT u.*, l.*, e.* -- or select appropriate columns from the three tables
FROM todo_users u
LEFT JOIN todo_lists l ON l.userid = u.id
LEFT JOIN todo_entries e ON e.listid = l.id
WHERE u.username = ?;
Use of the LEFT JOINs means that you will also get data for users without lists or lists without entries (but column values will be NULL).
Inserting, updating and deleting records can be done with very similar statements and similarly fast.
PostgreSQL stores data on "pages" (typically 4kB in size) and most pages will be filled, which is a good thing because reading a writing a page are very slow compared to other operations.
Scenario 2
In this scenario you need only two tables per user (todo_lists and todo_entries) but you need some mechanism to identify which tables to query.
1 million todo_lists tables with a few records each
1 million todo_entries tables with a few dozen records each
The only practical solution to that is to construct the full table names from a "basename" related to the username or some other persistent authentication data from your web site. So something like this:
username = 'Jerry';
todo_list = username + '_lists';
todo_entries = username + '_entries';
And then you query with those table names. More likely you will need a todo_users table anyway to store personal data, usernames and passwords of your 1 million users.
In most cases the tables will be very small and PostgreSQL will not use any indexes (nor does it have to). It will have more trouble finding the appropriate tables, though, and you will most likely build your queries in code and then feed them to PostgreSQL, meaning that it cannot optimize a query plan. A bigger problem is creating the tables for new users (todo_list and todo_entries) or deleting obsolete lists or users. This typically requires behind-the scenes housekeeping that you avoid with the previous scenario. And the biggest performance penalty will be that most pages have only little content so you waste disk space and lots of time reading and writing those partially filled pages.
Scenario 3
This scenario is even worse that scenario 2. Don't do it, it's madness.
3 million tables todo_entries with a few records each
So...
Stick with option 1. It is your only real option.

How do I return the database rows in a custom order?

I have a data base with two columns: a number and a name. The table is as follows:
Am expecting the result of select * from table where num=6 or num=3 to be :
But what I get is:
How do I order the results?
Note: I assume that your actual data is not just A, B, C, ..., F. Otherwise, you don't need a database for that, and can do it directly in your language (example in C#).
You should understand the difference between filtering and ordering.
The data in the database is presented in a specific order. When a specific order is important, it is specified in the query. When not, some databases might return the rows in a nearly-nondeterministic order.
When you use where, you are simply filtering the results. Your query tells to the database:
“Please, give me every column in a table table, given that I'm only interested by the rows containing num 6 or 3.”
While you say what should be returned, you don't specify in which order.
The order by is used exactly for that:
select * from table where num=3 or num=6 order by num desc
will return the rows in the order where the highest num value will appear first.
You are not limited to asc and desc, by the way. If you need, for instance, to have an order such as 6, 7, 2, 3, you can do that with case in Microsoft SQL Server (and similar constructs in other databases). Example:
select [name] from table
where num in (2, 3, 6, 7)
order by case when [num] = 6 then 1
when [num] = 7 then 2
when [num] = 2 then 3
else [num] end asc
While this will do what you want, it's not a good solution, since you'll need to build your SQL query dynamically from user's input—thing you should avoid at all costs (because, aside poor performance, you'll end up letting SQL Injection through).
Instead:
Create a temporary table with two columns: an auto-incremented primary key column and a column containing numbers.
Insert your values (6, 7, 2, 3) in the second column.
Join two tables (the table table and the temporary table).
Filter on the primary key of the temporary table.
Remove the temporary table.
This has a benefit that you don't have to create your queries dynamically, the drawback being that the solution is slightly difficult, especially if multiple users can select the data at the same time, which means that you have to chose the name of your temporary table wisely.
The easiest solution is to just do the filtering in your programming language instead of the database. Load all the required rows (don't forget the where), and filter them later. Since, according to your comments, the user is specifying the order, I'm pretty sure you deal with only few rows, not thousands, which makes it an ideal solution. If you had to deal with hundreds of thousands of rows that you need to process as they arrive from the database without keeping them all in memory, then the previous solution with a temporary table (and a bulk insert) will be more appropriate.
Eventually, if the number of rows in the overall table is low, you don't even have to filter the data. Just load it in memory and keep it there as a map.
Notes:
Don't use select *. There are practically no cases where you really need it in your application. Using select * instead of explicitly specifying the columns has at least two drawbacks:
If later, a column is added, the data set you get will be different, and it might take you time to notice that and debug the related issues. Also, if a column is removed or renamed, the bugs won't be necessarily easy to find, since the errors will occur not at database level, with a clear, explicit error message, but somewhere within your application.
It has a performance impact, since the database needs to figure out which columns should be selected. If you do a select * on a table containing a few dozens of columns, including some blobs or long strings while all you need is a few columns, the performance will be quite terrible.
If your database supports in, use it:
select * from table where num in (3, 6) order by num desc
A common construct I have seen used in numerous database for metadata like lists is to add a column to the table with the boring name of "Column_order", a number.
Then your select is
select list, list_value
from your_table
where enabled = 1
order by column_order;
This works great in allowing any arbitrary order you want but does not work well if you have an application in multiple languages where the order might be different by language.

Resources