I know oracle automatically preserve frequently accessed data in memory. I'm curious is any way to keep a table in memory manually for more performance?
Yes, you could certainly do that. You need to pin the table in the KEEP POOL cache in DB cache.
For example,
ALTER TABLE table_name STORAGE (buffer_pool KEEP);
By the way, Oracle 11g and up, you can have a look at the RESULT CACHE. It is quite useful.
Have a look at this AskTom link https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:253415112676
The short answer is no, and you don't want to.
If you need that high a level of retrieval performance, then consider using an in memory DB like Times 10.
Think about what you are asking the DB to do. You are asking the DB to dedicate n amount of cache memory to a single table and hold it there indefinitely. In a busy DB this will simply kill performance to the point of the DB being useless. Lets say you have a DB with a few hundred tables in it, some of them small, some large and some very large and throw in a few PK's and indexes.
A query comes along that asks for say 100K rows of data that are 1 Kbyte each and the index is a 100 byte long string. The DB will allocate sufficient memory to load up the index, and then start grabbing 8K chunks of data off the disk and putting those into cache.
If you ask it to store a few gigabytes of data, in ram, permanently, you will run out of memory in a big hurry unless you have a VERY expensive machine with 512 gigs of ram in it and you will start hitting the swap file and well at that point your performance is toast.
If you are having performance issues on queries run explain plan and learn how to use it to discover the bottle necks. I have a 24 core machine with 48 gigs of ram, but I have tables with billions of rows of data. I keep a close eye on my cache hits and execution plans.
Also consider materialized views.
Related
If I am loading data in batches, and, as a part of the loading pipeline, I am joining to a very large reference table to grab some values, is there an advantage to creating a temp table out of that large reference table and using that during the batched data load?
I am wondering if during each batch the reference table has to be pulled off of disk, and by loading the table to a temp table I can keep it in memory and bypass that expensive disk operation. Or, what is best practice for using a very large reference table during batched loads? BTW the table is about 3 gigs so big but not too big to keep in memory.
is there an advantage to creating a temp table out of that large reference table and using that during the batched data load
No.
The data in SQL Server is organised into 8kb chunks, called "pages". SQL Server will always satisfy a query by performing operations against pages in RAM. If the pages it needs are not in RAM, it will pull them from the disk to RAM, and then perform the operations in RAM. Those pages will then stay in RAM.
"Forever".
Except... what if we don't have enough RAM to store all of the data?
If SQL Server needs a page that is not currently in RAM, but there is no more RAM available for it to use, it has to clear out some other page to make room for this new page. It has an algorithm for deciding which page should be cleared out, roughly based on what has been "least recently used" (technically LRU2, I believe).
The end result is this: If you have just finished reading a bunch of rows from VeryLargeReferenceTable, and all of the pages needed to satisfy that query were able to fit in RAM, then as long as SQL isn't forced to flush your pages out of RAM because of other queries being run, all of that data for VeryLargeReferenceTable is still in RAM when you run your next batch.
Now, what if you create the temp table as a copy of VeryLargeReferenceTable (henceforth VLRT)?
SQL Server clearly has to read the data out of VLRT to do that - which means getting the VLRT pages into RAM! That's what it would have had to do if we just joined to VLRT directly!
But it also has to write all the data from VLRT to the temp table. So it also has to allocate new pages for the temp table in RAM! And these are going to be "dirty" pages (they've been modified), so if we start getting memory pressure we might have to write them to disk! Yikes.
In addition to this, you probably (hopefully!) have a useful index on VLRT for this query, yes? We want to create that index on the temp table as well, right? More page allocations, more page IO, more CPU time, to build that index.
So by using the temp table we did everything we had to do without the temp table, plus a whole lot more.
Can someone please help to understand from which layer in snowflake data is being fetched in this below plan? I understand snowflake uses either of 3 (besides results from metadata for queries like select count(*)) - result cache, warehouse cache or disk IO. In the below plan - its not from result cache ( as for the plan would say 'query result reuse'), its not showing any remote disk I/O and also the cache usage is 0%.
So its not very clear how data is being processed here. Any thoughts or pointers will be helpful.
The picture says that 0.44MB were scanned.
The picture says that 0% of those 0.44MB came from the local cache.
Hence 0.44MB were read from the main storage layer.
The data is read from the storage layer. I will assume AWS, thus from the S3 there you table is stored. There are three primary reasons for a remote read:
It is the first time this warehouse has used this data. This is the same thing that happens if you stop/start the warehouse.
The data has changed (which can be anything from 0% - 100% change of partitions), given in your example there is only one partition, any insertion happening in the back ground will cause 100% cache invalidation.
The data was flushed from the local caches by more active data, if you read this table once every 30 minutes, but between then read GB of other tables, like all caches low usage data gets dropped.
The result cache can be used, but it also can be turned off for a session, but then local disk cache still happens. And you WHERE 20 = 20 in theory might cache bust the result cache, but as it's a meaningless statement it might not. But given your results it seems, at this point of time it's enough to trick the result cache. Which implies if you want to not avoid the result cache, stop changing the number, and it you want to avoid, this seems to work.
I see you have highlighted the two spilling options, those are when working state data is too large for memory, and too large for local disk so are sent to remote (s3). The former is a sign your warehouse is undersized, and both are a hint that something in your query is rather bloated. Now maybe that is what you want/needed, but it slows things down very much. Now to know if there is perhaps "another way" if in the profile plan there is some step that goes 100M rows -> 100GB rows -> 42 rows this implies a giant mess was made, and then some filter smashed the heck out of nearly all of it, which implies the work could be done different, to avoid that large explosion/filtering.
Theoretical SQL Server 2008 question:
If a table-scan is performed on SQL Server with a significant amount of 'free' memory, will the results of that table scan be held in memory, thereby negating the efficiencies that may be introduced by an index on the table?
Update 1: The tables in question contain reference data with approx. 100 - 200 records per table (I do not know the average size of each row), so we are not talking about massive tables here.
I have spoken to the client about introducing a memcached / AppFabric Cache solution for this reference data, however that is out of scope at the moment and they are looking for a 'quick win' that is minimal risk.
Every page read in the scan will be read into the buffer pool and only released under memory pressure as per the cache eviction policy.
Not sure why you think that would negate the efficiencies that may be introduced by an index on the table though.
An index likely means that many fewer pages need to be read and even if all pages are already in cache so no physical reads are required reducing the number of logical reads is a good thing. Logical reads are not free. They still have overhead for locking and reading the pages.
Besides the performance problem (even when all pages are in memory a scan is still going to be many many times slower than an index seek on any table of significant size) there is an additional issue: contention.
The problem with scans is that any operation will have to visit every row. This means that any select will block behind any insert/update/delete (since is guaranteed to visit the row locked by these operations). The effect is basically serialization of operations and adds huge latency, as SELECT now have to wait for DML to commit every time. Even under mild concurrency the effect is an overall sluggish and slow to respond table. With indexes present operations are only looking at rows in the ranges of interest and this, by virtue of simple probabilities, reduces the chances of conflict. The result is a much livelier, responsive, low latency system.
Full Table Scans also are not scalable as the data grows. It’s very simple. As more data is added to a table, full table scans must process more data to complete and therefore they will take longer. Also, they will produce more Disk and Memory requests, further putting strain on your equipment.
Consider a 1,000,000 row table that a full table scan is performed on. SQL Server reads data in the form of an 8K data page. Although the amount of data stored within each page can vary, let’s assume that on average 50 rows of data fit in each of these 8K pages for our example. In order to perform a full scan of the data to read every row, 20,000 disk reads (1,000,000 rows / 50 rows per page). That would equate to 156MB of data that has to be processed, just for this one query. Unless you have a really super fast disk subsystem, it might take it a while to retrieve all of that data and process it. Now then, let’s say assume that this table doubles in size each year. Next year, the same query must read 312MB of data just to complete.
Pls refer this link - http://www.datasprings.com/resources/articles-information/key-sql-performance-situations-full-table-scan
I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.
I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.
My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?
The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?
OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?
MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.
In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.
The big key here is:
What queries are you planning to run and how does MongoDB make running these queries easier?
Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.
Additionally, MongoDB does not support compression. So you will be using lots of disk space.
If not, what other database system can I use?
Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)
If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?
EDIT: more details
MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.
Larges indices defeat the purpose of using an index.
According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.
Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.
After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)
So it's going to be really important to figure out which queries you want to run.
Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.
I'm creating a database, and prototyping and benchmarking first. I am using H2, an open-source, commercially free, embeddable, relational, java database. I am not currently indexing on any column.
After the database grew to about 5GB, its batch write speed doubled (the rate of writing was slowed 2x the original rate). I was writing roughly 25 rows per milliseconds with a fresh, clean database and now at 7GB I'm writing roughly 7 rows/ms. My rows consist of a short, an int, a float, and a byte[5].
I do not know much about database internals or even how H2 was programmed. I would also like to note I'm not badmouthing H2, since this is a problem with other DBMSs I've tested.
What factors might slow down the database like this if there's no indexing overhead? Does it mainly have something to do with the file system structure? From my results, I assume the way windows XP and ntfs handle files makes it slower to append data to the end of a file as the file grows.
One factor that can complicate inserts as a database grows is the number of indexes on the table, and the depth of those indexes if they are B-trees or similar. There's simply more work to do, and it may be that you're causing index nodes to split, or you may simply have moved from, say, a 5-level B-tree to a 6-level one (or. more generally, from N to N+1 levels).
Another factor could be disk space usage -- if you are using cooked files (that's the normal kind for most people most of the time; some DBMS use 'raw files' on Unix, but it is unlikely that your embedded system would do so, and you'd know if it did because you'd have to tell it to do so), it could be that your bigger tables are now fragmented across the disk, leading to worse performance.
If the problem was on SELECT performance, there could be many other factors also affecting your system's performance.
This sounds about right. Database performance usually drops significantly as the data can no longer be held in memory and operations become disk bound. If you are using normal insert operations, and want a significant performance improvement, I suggest using some sort of a bulk load API if H2 supports it (like Oracle sqlldr, Sybase BCP, Mysql 'load data infile'). This type of API writes data directly to the data-file bypassing many of the database subsystems.
This is most likely caused by variable width fields. I don't know if H2 allows this, but in MySQL, you have to create your table with all fixed width fields, then explicitly declare it as a fixed width field table. This allows MySQL to calculate exactly where it needs to go in the database file to do the insert. If you aren't using a fixed width table, then it has to read through the table to find the end of the last row.
Appending data (if done right) is an O(n) operation, where n is the length of the data to be written. It doesn't depend on the file length, there are seek operations to skip over that easily.
For most databases, appending to a database file is definitely slower than pre-growing the file and then adding rows. See if H2 supports pre-growing the file.
Another cause is whether the entire database is held in memory or if the OS has to do a lot of disk swapping to find the location to store the record.
I would blame it on I/O, specially if you're running your database on a normal PC with a normal hard disk (by that I mean not in server with super fast hard drives, etc).
Many database engines create an implicit integer primary key for each update, so even if you haven't declared any indexes, your table is still indexed. This may be a factor.
Using H2 for 7G datafile is a wrong choice from technological point of view. As you said, embeddable. What kind of "embedded" application do you have, if you need to store so much data.
Are you performing incremental commits? Since H2 is an ACID compliant database, if you are not performing incremental commits, then there is some type of redo log so that in the case of some accidental failure (say, power outage) or rollback, the deletes can be rolled back.
In that case, your redo log may be growing large and overflowing memory buffers and needing to write out your redo log to disk, as well as your actual data, adding to your I/O overhead.