How to truncate TABLE(Kind) in datastore - google-app-engine

I want to delete all-entity in certain Table(Kind).
(That Table has a lot of entity)
But delete()/deleteMulti() is slow and costs delete-operation.
Is there truncate operation in datastore?

No there's no truncate operation. How many entities do you want to delete? The price is $0.02 per 100K entities. https://cloud.google.com/datastore/pricing You can delete 20K per day so it's using free limit :)

Related

Sql server table: create partitions?

I have an sql datadas, where among other things I have a prices table, where I have one price per product per store.
There are 50 stores and over 500000 products, so this table Will easily have 25 to 30 million records.
This table is feed daily over night with prices updates, and has huge read operations during day. Reads are made with readonly intent.
All queries contain storeid as part of identifying the record to update or read.
I m not able yet to determine how this Will behave since I m expecting external supply of prices but I m expecting performance issues at least on read operations, even though indexes are in place for now...
My question is if I should consider table partition by store since it is always part of queries. But then I have indexes where storeid is not the only column that is part of the index.
Based on this scenario, would you recommend partitioning? The alternative I see is having 50 tables one per store, but it seems painless and if possible to avoid the better
if I should consider table partition by store since it is always part of queries
Yes. That sounds promising.
But then I have indexes where storeid is not the only column that is part of the index.
That's fine. So long as the partitioning column is one of the clustered index columns, you can partition by it. In fact with partitioning, you can get partition elimination for a trailing column of the clustered index, then a clustered index seek within the target partition.
Hi all and thank you for your replies.
I was able to generate significant information on a contained environment where I was able to confirm that I can achieve excelent performance indicators by using only the appropriate indexes.
So for now we will keep it "as is" and have the partition strategy on hand just in case.
Thanks again, nice tips guys

Partitioning table that is typically only inserted to in SQL Server 2012

I have a table that is having approximately 450,000 records per month going into it. It is an audit table of sorts that tracks changes to other tables in the database. That is, inserts, updates and deletes of records. Typically this table is not queried (perhaps only 2-3 times per month to examine how data in other tables changed and under very specific circumstances)
It has been put to me that we should consider partitioning this table to help improve database performance. If the table is only being inserted to 99.9% of the time and rarely queried, would there be any tangible benefit to this partitioning this table?
Thanks.
If the table is only being inserted to 99.9% of the time and rarely
queried, would there be any tangible benefit to this partitioning this
table?
Partitioning is mostly a manageability feature. I would expect no difference in insert performance with or without able partitioning. For SELECT queries, partitioning may improve performance of large scans if partitions can be eliminated (i.e. partitioning column specified in WHERE clause, but indexing and query tuning is usually the key to performance.
Partitioning can improve performance of purge operations. For example, you could use a monthly sliding window to purge an entire month of data at once rather than individual row deletes. I don't know if that's with the trouble with only 450K rows/month, though.
I think you want to get fast access to your recent data.
Add date column as first column in primary key clustered instead of partioning

Difference between select from a whole partition or from table with small dataset?

I am new to partitioning.
Would be there a difference in performance between
select * from my_partitionedData where date = '20110523'
and
select * from my_Data where date = '20110523'
where my_partitionedData is a table partitioned by date by 1 day and my_Data is a table which has only data for '20110523' and both tables have same structure?
The other question - would be there a difference in performance in running these selects if all the partitions of the my_partitionedData are in the same file group? (note - the select is always for 1 day)
Like everything else in SQL, you will need to test to be sure.
That being said, I think you should get identical performance.
Behind the scenes, a partitioned table is basically a lot of smaller tables logically unioned together. If you are partitioning by day in you partitioned table, and your non-part table has only one day of data, the execution plan and performance should be pretty much identical.
If one returns the same data set a partitioned and non-partitioned table will return the data with the same IO. If the partitioned table has less fragmentation there would be a reduction in the IO delay from a random seek of the disk heads to retrieve the pages but all in all 100k of data is 100k of data.
You did not mention if you were considering partitioning the index. Partitioning index is an excellent way to reduce the number of levels which must be traversed to find the location of the data row. Partitioning indexes and tables with the same function is the optiomal solution.
where my_partitionedData is a table
partitioned by date by 1 day and
my_Data is a table which has only data
for '20110523' and both tables have
same structure?
The later one will less access time.
The other question - whould be there a
difference in performance in running
these selects if all the partitions of
the my_partitionedData are in the same
file group? (note - the select is
always for 1 day)
The access time will be more in this case despite of 1 day data.
Partitioning is required to improve the scalability and manageability of large tables and tables that have varying access patterns.
You created two tables to store information about each day records and on the other hand a single table for each day data is the easiest to design and understand, but these tables are not necessarily optimized for performance, scalability, and manageability, particularly as the table grows larger.

What's your approach for optimizing large tables (+1M rows) on SQL Server?

I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.
Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:
[Stock] [int] NOT NULL
[Date] [smalldatetime] NOT NULL
[Open] [smallmoney] NOT NULL
[High] [smallmoney] NOT NULL
[Low] [smallmoney] NOT NULL
[Close] [smallmoney] NOT NULL
[Trades] [int] NOT NULL
[Quantity] [bigint] NOT NULL
[Volume] [money] NOT NULL
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?
What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.
At 1 million records, I wouldn't consider this a particularly large table needing unusual optimization techniques such as splitting the table up, denormalizing, etc. But those decisions will come when you've tried all the normal means that don't affect your ability to use standard query techniques.
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
It's logically true - the clustered index defines the logical ordering of the records on the disk, which is all you should be concerned about. SQL Server may forego the overhead of sorting within a physical block, but it will still behave as if it did, so it's not significant. Querying for one stock will probably be 1 or 2 page reads in any case; and the optimizer doesn't benefit much from unordered data within a page read.
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Not necessarily significantly. There isn't a linear relationship between table size and query speed. There are usually a lot more considerations that are more important. I wouldn't worry about it in the range you describe. Is that the reason you're concerned? 200 ms would seem to me to be great, enough to get you to the point where your tables are loaded and you can start doing realistic testing, and get a much better idea of real-life performance.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
No! This kind of optimization is so premature it's probably stillborn.
Another approach would be using the partition feature of SQL Server.
Same comment. You will be able to stick for a long time to strictly logical, fully normalized schema design.
What would be other good approachs to make this the fastest possible?
The best first step is clustering on stock. Insertion speed is of no consequence at all until you are looking at multiple records inserted per second - I don't see anything anywhere near that activity here. This should get you close to maximum efficiency because it will efficiently read every record associated with a stock, and that seems to be your most common index. Any further optimization needs to be accomplished based on testing.
A million records really isn't that big. It does sound like it's taking too long to search though - is the column you're searching against indexed?
As ever, the first port of call should be the SQL profiler and query plan evaluator. Ask SQL Server what it's going to do with the queries you're interested in. I believe you can even ask it to suggest changes such as extra indexes.
I wouldn't start getting into partitioning etc just yet - as you say, it should all comfortably sit in memory at the moment, so I suspect your problem is more likely to be a missing index.
Check your execution plan on that query first. Make sure your indexes are being used. I've found that. A million records is not a lot. To give some perspective, we had an inventory table with 30 million rows in it and our entire query which joined tons of tables and did lots of calculations could run in under 200 MS. We found that on a quad proc 64 bit server, we could have signifcantly more records so we never bothered partioning.
You can use SQL Profier to see the execution plan, or just run the query from SQL Management Studio or Query Analyzer.
reevaluate the indexes... thats the most important part, the size of the data doesn't really matter, well it does but no entirely for speed purposes.
My recommendation is re build the indexes for that table, make a composite one for the columns you´ll need the most. Now that you have only a few records play with the different indexes otherwise it´ll get quite annoying to try new things once you have all the historical data in the table.
After you do that review your query, make the query plan evaluator your friend, and check if the engine is using the right index.
I just read you last post, theres one thing i don't get, you are quering the table while you insert data? at the same time?. What for? by inserting, you mean one records or hundred thousands? How are you inserting? one by one?
But again the key of this are the indexes, don't mess with partitioning and stuff yet.. specially with a millon records, thats nothing, i have tables with 150 millon records, and returning 40k specific records takes the engine about 1500ms...
I work for a school district and we have to track attendance for each student. It's how we make our money. My table that holds the daily attendance mark for each student is currently 38.9 Million records large. I can pull up a single student's attendance very quickly from this. We keep 4 indexes (including the primary key) on this table. Our clustered index is student/date which keeps all the student's records ordered by that. We've taken a hit on inserts to this table with regards to that in the event that an old record for a student is inserted, but it is a worthwhile risk for our purposes.
With regards to select speed, I would certainly take advantage of caching in your circumstance.
You've mentioned that your primary key is a compound on (Stock, Date), and clustered. This means the table is organised by Stock and then by Date. Whenever you insert a new row, it has to insert it into the middle of the table, and this can cause the other rows to be pushed out to other pages (page splits).
I would recommend trying to reverse the primary key to (Date, Stock), and adding a non-clustered index on Stock to facilitate quick lookups for a specific Stock. This will allow inserts to always happen at the end of the table (assuming you're inserting in order of date), and won't affect the rest of the table, and lesser chance of page splits.
The execution plan shows it's using the clustered index quite fine, but I forgot an extremely important fact, I'm still inserting data! The insert is probably locking the table too often. There is a way we can see this bottleneck?
The execution plan doesn't seems to show anything about lock issues.
Right now this data is only historical, when the importing process is finished the inserts will stop and be much less often. But I will have a larger table for real-time data soon, that will suffer from this constant insert problem and will be bigger than this table. So any approach on optimizing this kind of situation is very welcome.
another solution would be to create an historical table for each year, and put all this tables in an historical database, fill all those in and then create the appropriate indexes for them. Once you are done with this you won't have to touch them ever again. Why would you have to keep on inserting data? To query all those tables you just "union all" them :p
The current year table should be very different to this historical tables. For what i understood you are planning to insert records on the go?, i'd plan something different like doing a bulk insert or something similar every now and then along the day. Of course all this depends on what you want to do.
The problems here seems to be in the design. I'd go for a new design. The one you have now for what i understand its not suitable.
Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Indexes in SQL Server are always sorted by column order in index. So an index on [stock,date] will first sort on stock, then within stock on date. An index on [date, stock] will first sort on date, then within date on stock.
When doing a query, you should always include the first column(s) of an index in the WHERE part, else the index cannot be efficiently used.
For your specific problem: If date range queries for stocks are the most common usage, then do the primary key on [date, stock], so the data will be stored sequencially by date on disk and you should get fastest access. Build up other indexes as needed. Do index rebuild/statistics update after inserting lots of new data.

Fastest way to delete all the data in a large table

I had to delete all the rows from a log table that contained about 5 million rows. My initial try was to issue the following command in query analyzer:
delete from client_log
which took a very long time.
Check out truncate table which is a lot faster.
I discovered the TRUNCATE TABLE in the msdn transact-SQL reference. For all interested here are the remarks:
TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table's data, and only the page deallocations are recorded in the transaction log.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement.
You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger.
TRUNCATE TABLE may not be used on tables participating in an indexed view.
There is a common myth that TRUNCATE somehow skips transaction log.
This is misunderstanding, and is clearly mentioned in MSDN.
This myth is invoked in several comments here. Let's eradicate it together ;)
For reference TRUNCATE TABLE also works on MySQL
I use the following method to zero out tables, with the added bonus that it leaves me with an archive copy of the table.
CREATE TABLE `new_table` LIKE `table`;
RENAME TABLE `table` TO `old_table`, `new_table` TO `table`;
forget truncate and delete. maintain your table definitions (in case you want to recreate it) and just use drop table.
truncate table client_log
is your best bet, truncate kills all content in the table and indices and resets any seeds you've got too.
truncate table is not SQL-platform independent. If you suspect that you might ever change database providers, you might be wary of using it.
On SQL Server you can use the Truncate Table command which is faster than a regular delete and also uses less resources. It will reset any identity fields back to the seed value as well.
The drawbacks of truncate are that it can't be used on tables that are referenced by foreign keys and it won't fire any triggers. Also you won't be able to rollback the data if anything goes wrong.
Note that TRUNCATE will also reset any auto incrementing keys, if you are using those.
If you do not wish to lose your auto incrementing keys, you can speed up the delete by deleting in sets (e.g., DELETE FROM table WHERE id > 1 AND id < 10000). It will speed it up significantly and in some cases prevent data from being locked up.
Yes, well, deleting 5 million rows is probably going to take a long time. The only potentially faster way I can think of would be to drop the table, and re-create it. That only works, of course, if you want to delete ALL data in the table.
The suggestion of "Drop and recreate the table" is probably not a good one because that goofs up your foreign keys.
You ARE using foreign keys, right?
If you cannot use TRUNCATE TABLE because of foreign keys and/or triggers, you can consider to:
drop all indexes;
do the usual DELETE;
re-create all indexes.
This may speed up DELETE somewhat.
I am revising my earlier statement:
You should understand that by using
TRUNCATE the data will be cleared but
nothing will be logged to the
transaction log. Writing to the log
is why DELETE will take forever on 5
million rows. I use TRUNCATE often
during development, but you should be
wary about using it on a production
database because you will not be able
to roll back your changes. You should
immediately make a full database
backup after doing a TRUNCATE to
establish a new basis for restoration.
The above statement was intended to prompt you to be sure that you understand there is difference between the two. Unfortunately, it is poorly written and makes unsupported statements as I have not actually done any testing myself between the two. It is based on statements that I have heard from others.
From MSDN:
The DELETE statement removes rows one
at a time and records an entry in the
transaction log for each deleted row.
TRUNCATE TABLE removes the data by
deallocating the data pages used to
store the table's data, and only the
page deallocations are recorded in the
transaction log.
I just wanted to say that there is a fundamental difference between the two and because there is a difference, there will be applications where one or the other may be inappropriate.
DELETE * FROM table_name;
Premature optimization may be dangerous. Optimizing may mean doing something weird, but if it works you may want to take advantage of it.
SELECT DbVendor_SuperFastDeleteAllFunction(tablename, BOZO_BIT) FROM dummy;
For speed I think it depends on...
The underlying database: Oracle, Microsoft, MySQL, PostgreSQL, others, custom...
The table, it's content, and related tables:
There may be deletion rules. Is there an existing procedure to delete all content in the table? Can this be optimized for the specific underlying database engine? How much do we care about breaking things / related data? Performing a DELETE may be the 'safest' way assuming that other related tables do not depend on this table. Are there other tables and queries that are related / depend on the data within this table? If we don't care much about this table being around, using DROP might be a fast method, again depending on the underlying database.
DROP TABLE table_name;
How many rows are being deleted? Is there other information that is quickly gleaned that will optimize the deletion? For example, can we tell if the table is already empty? Can we tell if there are hundreds, thousands, millions, billions of rows?

Resources