I have 2 tables. One having 1 million rows (Table 1), and another having 99 million rows (Table 2). Both of them are in a separate schemas.
They have similar structures, so no problem there.
My question would be this:
I need the table containing both of the tables' Data on the schema containing Table 1.
Would it be faster to run a code to transfer all 99 million rows of Table 2 to Table 1
OR
Would it be faster to run a code to transfer all 1 million rows to Table 2, and then run a code to Alter Table 2's schema to Table 1's schema?
OR
Would everything actually be instantaneous?
My understanding is you want to insert all records from Table 2 into Table 1. If this is the case I would suggest dropping the indexes on table 1, run the insert and then rebuild it. Alternatively you could leave the indexes on, but that would slow things WAY DOWN. Another solution which is my preferred, is create Table 3, insert each of the 2 tables into it, build the index then rename Table 1 & 2 to TableName_Backup, and rename table 3 to be whatever you want. The 2nd solution should give you optimal results while keep both original tables in their original state while you validate the data. Once you feel good, either put the 2 original tables in an archive location or drop them, depending on your company policy.
Related
Database: MongoDB.
In my dataset, every record will have an attribute that can take on 1 of 3 values.
In terms of performance of query speed, would it be faster to have one table with a column that will store this attribute or have 3 separate tables for each of the values?
I have a database with one critical table which is widely used and queried. The lifetime of the data in this table is divided in two stages: Table1 and Table1_Hist, this way once the User has finished his work then the record is passed from Table1 to Table1_Hist for consultation and reports.
The structure of Table1 is
ID (Long) KEY,
Priority (INT)
val1 (VARBINARY(XXXX))
val2 (VARNINARY(XXXX))
This table is filled with millions of records per month and the records passed to Table1_Hist are made in no specific order, this means that User1 may insert a record in Table1 today and finish working with it today too, but User2 may insert a similar record in Table1 and finish working with it the next week or month or in 3 months.
The issue arises when the Table1_Hist starts to grow to the point of affecting the performance of the queries to this table. I had in mind to split this table in Table1_Hist_1, Table1_Hist_2... Table1_Hist_n and create an ID table Map where I can register the range of IDs stored on each table. For example, I may have a map saying that in Table1_Hist_1 are stored the IDs from 1 to 10M and so on, but as I said before, there is no order in time in which the records are inserted in these historical tables, hence I may have a Map from 1 to 10 pointing to table Table1_Hist_1 and also have a record ID=3 in Table1_Hist_2, just because the ID=3 was finished 2 months later and stored in the following table.
So, anyone knows about an efficient approach to split a table in multiple tables with their respective mapping?
My question is about performance on SQL server tables.
Assume I have a table that has many columns, for example 30 columns, with 1 column indexed. This table has approximately 30,000 rows.
If I perform a select that selects the indexed column, and one more, for example this:
SELECT IndexedColumn, column1
FROM table
Will this be slower than performing the same select on a table that only has 2 columns, and doing a SELECT * ...
So basically, will the existence of the extra columns slow down the select query event if I am not retrieving the data from the extra columns?
There will be minor difference on the very end of the process as you don't have to print/pass the rest of information for the end client (either SSMS or other app).
When performing a read based on clustered index all of the column (without BLOB) are saved on the same page set so to read the data you have to access the same set of pages anyway.
You would see a performance increase if you would have a nonclustered index on the column list you are after as then they are saved in their own structure of data pages (so it would be less to read).
Assuming that you are using the default Clustered Index created by SQL server when defining the primary key on the table in both scenarios then no, there shouldn't be any performance difference between these two scenarios. Maybe worth just checking it out and generating an Actual Execution plan to see for yourself? -- Actually not sure above is true, as given this is rowstore, the first table wont be able to fit as many rows onto each page so will suffer more of an IO/Disk overhead when reading data.
Background
This is a simplified version of the postgres database I am managing:
TableA: id,name
TableB: id,id_a,prop1,prop2
This database has a peculiarity: when I select data, I only consider rows of TableB that have the same id_a. So I am never interested in selecting data from TableB with mixed values of id_a. Therefore, queries are always of this kind:
SELECT something FROM TableB INNER JOIN TableA ON TableA.id=id_a
Some time ago, the number of rows in TableA grew up to 20000 rows and TableB up to 10^7 rows.
To first speedup queries I added a binary tree lookup table to the TableB properties. Something like the following:
"my_index" btree (prop1)
The problem
Now I have to insert new data, and the database size will became more than the double of its current size. Inserting data to TableB became too slow.
I understood that the slowness cames from the updating of my_index.
When I add a new row of TableB, the database have to reorder the my_index lookup table.
I feel like this would be speeded up, if my_index was not over all elements.
But I do not need the new row, with a given id_a property to be sorted with a row having a different id_a property
The question
How can I create an index on a table, where the elements are ordered only when they have a same common property (es. a column called id_a)?
You can't.
The question that I would immediately ask you if you want such an index is: Yes, but for what values of id_a do you want the index? And your answer would be “for all of them”.
If you actually would want an index only for some values, you could use a partial index:
CREATE INDEX partidx ON tableb(prop1) WHERE id_a = 42;
But really you want an index for the whole table.
Besides, the INSERT would be just as slow unless the row inserted does not satisfy the WHERE condition of your index.
There are three things you can do to speed up INSERT:
Run as many INSERT statements as possible in a single transaction, ideally all of them.
Then you don't have to pay the price of COMMIT after every single INSERT, and COMMITs are quite expensive: they require data to be written to the disk hardware (not the cache), and this is incredibly slow (1 ms a decent time).
You can speed this up even more if you use prepared statements. That way the INSERT doesn't have to be parsed and prepared every time.
Use the SQL command COPY to insert many rows. COPY is specifically designed for bulk data import and will be faster than INSERT.
If COPY is to slow, usually because you need to INSERT a lot of data, the fatsest way is to drop all indexes, insert the data with COPY and then recreate the indexes. It can speed up the process by an order of magnitude, but of course the database is not fully available while the indexes are dropped.
I have sql query composed from 3 tables. I need to synchronize the query result with 1 table every 24 hours.
The idea is to run query every 24 hours. Compare the result with target table. Delete the target table rows, insert new ones or update existing rows.
I'm asking for "best-practices" to deal with this kind of situation.
Thank you.