Partitioning table based on first letter of a varchar field - database

I have a massive table (over 1B records) that have a specific requirement for table partitioning:
(1) Is it possible to partition a table in Postgres based on the first character of a varchar field?
For example:
For the following 3 records:
a-blah
a-blah2
b-blah
a-blah and a-blah2 would go in the "A" partition, b-blah would go into the "B" partition.
(2) If the above is not possible with Postgres, what is a good way to evenly partition a large growing table? (without partitioning by create date -- since that is not something these records have).

You can use an expression in the partition by clause, e.g.:
create table my_table(name text)
partition by list (left(name, 1));
create table my_table_a
partition of my_table
for values in ('a');
create table my_table_b
partition of my_table
for values in ('b');
Results:
insert into my_table
values
('abba'), ('alfa'), ('beta');
select 'a' as partition, name from my_table_a
union all
select 'b' as partition, name from my_table_b;
partition | name
-----------+------
a | abba
a | alfa
b | beta
(3 rows)
If the partitioning should be case insensitive you might use
create table my_table(name text)
partition by list (lower(left(name, 1)));
Read in the documentation:
Table Partitioning
CREATE TABLE

Related

SQL Server index on optional columns

In my scenario i have a table with a lot of optional columns (20 columns in total, say form col00 to col19, every column contain an integer not nullable).
When the column contains a 0 it's considered empty any other values have a meaning.
Any subset of that 20 columns could be queried, so i should query for col01 = int1 and col17 = int2.
I need to improve the performance of such queries, but i don't know how to create a representative index.
Surely i can monitor table for a while and see which columns subset are searchest most, but this is not a satisfiable solution to me (the table is periodically regenerated every few months..and the "tags" encoded that way may change)
I think the best you'll be able to do is to index every column by itself, then use the set operator INTERSECT... in a subquery of your where clause.
INTERSECT returns distinct rows that are output by both the left and right input queries operator. So if you select the primary key of the table in the INTERSECT then you should have a good subquery that can be used in a where-clause. This will require you to re-write your queries however.
Example:
SELECT *
FROM tablename
WHERE primary_key = (
SELECT primary_key FROM tablename WHERE col01 = int1
INTERSECT
SELECT primary_key FROM tablename WHERE col17 = int2
)
That should be sargable, if col01 and col17 have their own index.

Is there a way to cast a group of columns to the same NUMBER(p,s) type so that they can be UNPIVOT in Snowflake SQL?

I have a table several numeric columns all with different NUMBER(p,s) types. The table was created with a CREATE TABLE xx as (select date, SUM(x), SUM(y) from xxx GROUP BY date). It seems that snowflake decided the minimum NUMBER(precision, scale) required to store each resulting column. That resulting in different types for each column.
Now I want to UNPIVOT those columns and Snowflake will complain that SQL compilation error: The type of column 'xxxxx' conflicts with the type of other columns in the UNPIVOT list
I create this little minimal table to exemplify the problem:
create or replace temporary table temp1(id number, sales number(10,0), n_orders number(20,0)) as (
select * from (values
(1, 1, 2 )
,(2, 3, 4)
,(3, 5, 6)
)
); -- imagine that temp1 was created via a select AGG1, AGG2 FROM XXX GROUP BY YYY
describe table temp1;
--
name type kind null? default primary key unique key check expression comment
ID NUMBER(38,0) COLUMN Y N N
SALES NUMBER(10,0) COLUMN Y N N
N_ORDERS NUMBER(20,0) COLUMN Y N N
select *
from temp1 UNPIVOT(measure_value for measure_name in (sales, n_orders)); -- won't work because SALES is NUMBER(10,0) and N_ORDERS is NUMBER(20,0)
Right now my workaround is to cast each columns with a explicit TO_NUMBER(x, 38,0) as x like so:
with t1 as (
select
id
,TO_NUMBER(sales,38,0) as sales
,TO_NUMBER(n_orders, 38,0) as n_orders
from temp1
)
select * from t1 UNPIVOT(measure_value for measure_name in (sales, n_orders));
This is less than optimal because there are many columns in the actual table that I'm using.
I don't want to recreate the table (the aggregations take long to compute) so what are my options?
Is there a any other syntax that I can use to cast in bulk a list of columns?
Your best option is to modify the already create table (without having to rerun the costly aggregation) like so:
alter table temp1 modify
sales set data type number(38,0)
,n_orders set data type number(38,0)
;
This way has two advantages:
you avoid typing the column name twice for each column: column_name set data type number(38,) instead of TO_NUMBER(column_name, 38,0) as column_name
It's run just once, instead of having to running as a CTE before each UNPIVOT query.

SQL unique PK for grouped data in SP

I am trying to build a temp table with grouped data from multiple tables (in an SP), I am successful in building the data set however I have a requirement that each grouped row have a unique id. I know there are ways to generate unique ids for each row, However the problem I have is that I need the id for a given row to be the same on each run regardless of the number of rows returned.
Example:
1st run:
ID Column A Column B
1 apple 15
2 orange 10
3 grape 11
2nd run:
ID Column A Column B
3 grape 11
The reason I want this is because i am sending this data up to SOLR and when I do a delta I need to have the ID back for the same row as its trying to re-index
Any way I can do this?
Not sure if this will help, not entirely confident of your wider picture, but ...
As your new data is assembled, log each [column a] value in a table of your own.
Give that table an IDENTITY column to do the numbering for you.
Now you can join any new data sets to your lookup table and you'll have a persistent number for each column A.
You just need to ensure that each time you query new data, you add new values to the lookup table.
create table dbo.myRef(
idx int identity(1,1)
,[A] nvarchar(100)
)
General draft as below ...
--- just simulating some input data here
with cte as (
select 'apple' as [A], 15 as [B]
UNION
select 'orange' as [A], 10 as [B]
UNION
select 'banana' as [A], 4 as [B]
)
select * into #temp from cte;
-- Put any new values into the lookup table
-- and they will be assigned a new index number by the identity column
insert into dbo.myRef([A])
select distinct [A]
from #temp where [A] not in (select [A] from dbo.myRef)
-- now pull your original data for output, joining to the lookup table to get a ref number.
select T.*,R.idx
from #temp T
inner join
oer.myRef R
on T.[A] = R.[A]
Sorry for the late reply, i was stuck with something else, however i solved my own issue.
I built 2 temp tables one with all the data from the various tables (#master) and another temp table (#final) to house all the grouped data with an empty column for ID
Next i did a concat(column1, '-',column2,'-', column3) on 3 columns from the #master and updated the #final table based on the type
this helped me to get the same concat ids on each run

Delete duplicates from large dataset (>100Mio rows)

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.
My table looks like the following and has 108 Mio rows:
Col1 |Col2 | Col3 |Col4 |SICComb | NameComb
Case New |3523 | Alexander |6799 |67993523| AlexanderCase New
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
Undisclosed|6799 | Case New |3523 |67993523| Case NewUndisclosed
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
SmartCard |3674 | Virtual NetComm|7373 |73733674| SmartCardVirtual NetComm
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
The unique columns are SICComb and NameComb. I tried to add a primary key with:
ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)
but the integers are filling up more than 30 GB of my storage just in a new minutes.
Which would be the fastest and most efficient method to delete the duplicates from the table?
If you're using SQL Server, you can use delete from common table expression:
with cte as (
select row_number() over(partition by SICComb, NameComb order by Col1) as row_num
from Table1
)
delete
from cte
where row_num > 1
Here all rows will be numbered, you get own sequence for each unique combination of SICComb + NameComb. You can choose which rows you want to delete by choosing order by inside the over clause.
In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.
Here is the idea, using SQL Server syntax:
select distinct t.*
into #temptable
from t;
truncate table t;
insert into t
select tt.*
from #temptable;
Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.
Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.
EDIT:
Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:
select min(<identity col>), <all other columns>
from t
group by <all other columns>
If you have an identity column, then there are no duplicates (by definition).
In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:
insert into t(<all other columns>)
select <all other columns>;
If you need the old identity value (and the minimum will do), turn off identity insert and do:
insert into t(<all columns including identity>)
select <all columns including identity>;

TSQL: getting next available ID

Using SQL Server 2008, have three tables, table a, table b and table c.
All have an ID column, but for table a and b the ID column is an identity integer, for table c the ID column is a varchar type
Currently a stored procedure take a name param, following certain logic, insert to table a or table b, get the identity, prefix with 'A' or 'B' then insert to table c.
Problem is, table C ID column potentially have the duplicated values, i.e. if identity from table A is 2, there might already have 'A2','A3','A5' in the ID column for table C, how to write a T-SQL query to identify the next available value in table C then ensure to update table A/B accordingly?
[Update]
this is the current step,
1. depends on input parameter, insert to table A or table B
2. initialize seed value = ##Identity
3. calculate ID value to insert to table C by prefix 'A' or append 'B' with the seed value
4. look for record match in table C by ID value from step 3, if didn't find any record, insert it, else increase seed value by 1 then repeat step 3
The issue being at a certain value range, there could be a huge block of value exists in table C ID, i.e. A3000 to A500000 existed now in table C ID, the database query is extemely slow if follow the existing logic. Needs to figure out a logic to smartly get the minimum available number (without the prefix)
it is hard to describe, hope this make more sense, I truly appreciate any help on this Thanks in advance!
This should do the trick. Simple self extracting example will work in SSMS. I even made it out of order just in case. You would just change your table to be where #Data is and then change Identifier field to replace 'ID'.
declare #Data Table ( Id varchar(3) );
insert into #Data values ('A5'),('A2'),('B1'),('A3'),('B2'),('A4'),('A1'),('A6');
With a as
(
Select
ID
, cast(right(Id, len(Id)-1) as int) as Pos
, left(Id, 1) as TableFrom
from #Data
)
select
TableFrom
, max(Pos) + 1 as NextNumberUp
from a
group by TableFrom
EDIT: If you want to not worry about production data you could add this last part amending what I wrote:
Select
TableFrom
, max(Pos) as LastPos
into #Temp
from a
group by TableFrom
select TableFrom, LastPos + 1
from #Temp
Regardless if this was production environment you are going to have to hit part of it at some time to get data. If the datasets are not too large and just varchar(256) or less and only 5 million rows or less you could dump that entire column from tableC to a temp table. Honestly query performance versus imports change vastly from system to system.
Following your design there shouldn't be any duplicates in Table C considering that A and B are unique.
A | B | C
1 1 A1
2 2 A2
B1
B2

Resources