When experimenting with (embedded) Apache Derby DB, I noticed that a fresh database, with no tables in it, takes about 1.7 MB of disk space. It's quite a bit more than my common wisdom would expect.
Why is that? Are there significant differences in this between various database engines? Can this be controlled with some "block size" -like settings?
There will be differences between different database engines.
Generally, there will be all the metadata tables that are needed to track the real tables/views/whatever else can appear in the database when they're created, possibly some pre-allocated space ready for when tables are added or when transactions start occurring.
e.g. the model database for SQL Server (2000) occupies ~1.25MB of space, of which 0.5MB is empty. This DB is the basis for all other databases in SQL server.
Why does an empty folder occupies 4KB of data i.e. in Windows?
I have this wild guess out of nowhere...
You said that it's embedded... So since it's embedded, the database itself will contain of the necessary information that it needs in properly handling the database, maybe information about user account information, and so on which in most server/network version of databases is usually handled by built-in databases and so on.. its EMBEDDED! just a thought!
Related
How can I monitor which data is being accessed and which frequency?
I'm in need to migrate several (very) small SQL Server instances, each which several small databases. Current configuration is based in a lot of also small servers with local storage. New configuration is based in a single server with a single NAS.
So far, the SQL Server memory and CPU sizing is OK. Also DB sizes and total IOPS. But there's no existing documentation of what data set is actually being accessed. So, basically, I don't have a clue about what are the real storage requirements since the total amount of IOPS may be for only a couple of tables (so it would work like a charm with just a couple of SSD) or if the whole set of databases are being scanned all the time and I'll need several dozens of disks.
So, back to the question: How can I "profile" and get statistics of what data is being accessed? Either at SQL or Windows level?
The best way to see how much a table or groups of tables are being used is to use SQL Server Audit. It has very little impact on SQL Server's performance and can be easily set up to monitor selects (unlike triggers) in addition to inserts/updates/deletes.
We have a Delphi application which can connect to either Oracle or SQL Server. We use Devart components to connect to the databases, and everything is very generic when it comes to database access. i.e. we use the lowest common denominator. Ultimately we use the databases as data stores and do not use any of the more "advanced" features which maybe specific to the database.
However we have a serious performance issue with Oracle. It is to do with inserting data. I know that inserting data by running off a load of insert statements is not great for performance, but due to some business logic that needs to be done on the raw data before it gets uploaded to the database, we are a little restricted to multiple inserts. To get an idea of performance differences, a recent test we did, inserts 1000 items into our database and takes 5 minutes in SQL Server (acceptable) but 44 minutes in Oracle.
Is there anything we could do to improve performance? The inserting of data needs to be done by the user and NOT an Oracle DBA, so absolutely no Oracle skills is one of the pre-requisites for any solution. Basically, the users need to press a button and everything is done.
Edit: Business Logic happens before the insert (although there is a little going on during the actual insert, so more realistic number would be 2 minutes for SQL Server and 40 or so minutes for Oracle. Bear in mind we are inserting a few large blobs per record, so perhaps that explains the slowish performance, but not why there is such a difference. The 1000 items are part of a transaction.
Oracle supports array DML, which can speed up performance. Also if BLOB are involved, performance may depend on caching settings, and how the BLOB are setup in the destination table. Some db client parameters tuning may be also beneficial to increase network speed.
Anyway, without knowing which version of Oracle you're using, how it is configured, your table(s) deinition (and its tablespaces), how large are the BLOBS, and the SQL actually used (did you trace it?), it's very difficult to diagnose the real problem.
Oracle has some powerful diagnostic tools to identify bottlenecks, but they may not be easy to use and require to know enough about how Oracle works. From the Enterprise Manager Console you can access some of them in a more readable format - did you check it?
Update: because I can't comment to other answers, Oracle support differet type of LOB storage:
LOBs stored into the database (under transaction managment)
BFILES, external file system LOBs yet still managed by Oracle (LOB data not under transaction)
SecureFiles (11g onwards, alike BFILES but with transaction support and other features)
Oracle is designed for and can manage large LOBs - just it needs to be configured properly. Parameter that will affect LOB performance:
ENABLE/DISABLE STORAGE IN ROW
CACHE/NOCACHE/CACHE READS
LOGGING/NOLOGGING
CHUNK
PCTVERSION/RETENTION (especially for updates and deletes)
TABLESPACE (usually, a dedicated tablespace for lobs is advisable)
These parameters needs to be set taking into account the average LOB size, how LOBs are accessed, amd how often are modified. There's no "one size fits all".
But there are also the client side: OCI can buffer LOBs client side, so small read/write operations are cached, minimizing the number of network roundtrips and LOB versioning - that's up to the OCI wrapper you're using.
Array DML (only available with FireDac, ODAC, DOA and our SynDbOracle unit afaik) won't change much if your problem is about blob transfer.
First idea is to compress the data before transmission.
Try several access libraries. Our open source SynDBOracle directly accesses the oci.dll client but may be slightly faster.
But perhaps the problem may be on the server side. Oracle does not like transactions with huge data, since it tends to overflow its wal files. Try to tune the write ahead log files of the table.
IMHO a rdbms is not the best option to store huge blobs. Plain files, indexed via a rdbms for metadata is usually better. Or switch to a big SQL storage, like key/value stores or mongodb blob api.
Remember that both Oracle and mssql do ask money proportional to the data size....
We are using SQL Server 2005. Recently SQL server 2005 crashed in our production environment due to large tempdb size.
1) what could be reason for large tempdb size?
2) Is there any way to look what data is there in tempdb?
2) Is there any way to look what data is there in tempdb?
No, because it is not kept there. Tempdb has very special treatment, like being dropped on every server restart.
1) what could be reason for large tempdb size?
Inefficient SQL, maintenance jobs or just the data at hand. Obviously a 800gb, 6000gb database may require more tempdb space than a 4gb online crm attempt. You dont really specify ANY size in absolute terms. What IS large? I have tempdb databases hardcoded at 64gb ony my smaller servers.
Typical SQL that goes into Tempdb are:
Sorts that are not solvable as part of the query (you need to store keys SOMEWHERE)
DISCTINCT. Needs all returned data in tempdb to find doubles.
Certain poerations psossibly during joins.
Tempdb usage (temporary tables). I just mention them becasue I often keep some hundred megabytes worth of data in them during loads and scrubbing.
In general you can find those queries by having hugh IO stats in the query log, or simply being slow.
That said, maintenance plans also go int there, but with reason. At the end, your "large" is possibly mine "not even worth mentioning tiny". It really depends what you do. Use the query trace tool to find out what takes long.
Physically Tempdb is very special in treatment - sql server does NOT write to the file if it does not have to (i.e. keeps thigns in memory). Writes to the disc are a sign of memory flowing ofer. This is different from normal db write behavior. Tempdb, IF it flows over, is best put onto a decently fast SSD... which wont normally be SO expensive because it still will be relatively small.
Use the query here to find other queries for tempdb - basicaly you are fishing in dirty water here, need to try out things until you find the culprit.
The usual way to grow a SQL Server database — any database, not just tempdb — is to have it's data and log files set to autogrow (especially the log files). SQL Server is perfectly happy grow the log and data files until the consume all the disk space available to them.
Best practice, IMHO, is to allow limited autogrowth on the data files (put an upper bound on how big it can grow) and fix the size of the log files. You might need to do some analysis to figure out how big the log files need to be. For tempdb, especially, the recovery model should be set to simple, too.
Ok tempdb is a kinda special database. Any temporary objects you use in procedures etc, is created here. So if you application uses a lot of temp tables in queries, they will all reside here, but they should clean themselves up after the connection (spid) is reset.
The other thing that can grow a tempdb is database maintenance tasks, however they will take a larger toll on the database log files.
Tempdb is also cleared every time you restart the SQL Service. It basically drops the database and re-create it. I agree with #Nic about leaving tempdb as it is, dont muck around with it, any issues with space in tempdb, usually indicates another larger problem somewhere else. More space will mask the problem, but only for so long. How much free space does your drive have that you have tempdb on?
Other thing, if not already, try and put tempdb on it's own drive, and one more if possible, have the data and the log files on their own separate drives.
So, if you dont restart your SQL Server/Service, your drive will run out of space pretty soon.,
use tempdb
select (size*8) as FileSizeKB from sys.database_files
Which is the better practice to store file? Directly store the file in database or just the location to that file?
Avoid storing files in your database. Most don't deal with them well.
It depends. You need to consider several things.
If you have a mickey mouse freeware database, meaning that it does not handle blobs appropriately (reads the blobs on every SELECT; does not store the blobs in a separate physical structure to the row; very slow with blobs; etc)
keep the files outside, store only the location
manually deal with the syncing of row.location vs the file system
If you have an enterprise SQL Platform, it is no problem at all to keep the blobs inside the database. In fact, retrieval is faster. These do not read the blobs on every SELECT, they are stored in a separate physical structure to the rows. The one extra read to get the blob if the SELECT requests it, is not a "performance problem".
The PAGESIZE in genuine SQL databases can be set as 2k; 4k; 8k; or 16k.
2k is perfect for OLTP (small rows, small Transactions: you do not want to move 8K on every IO operation)
larger sizes are relevant based on how much OLAP you cater for
in your case, the average size of the files
there will be some waste in the unused portion of the last page, per row/blob.
The disadvantage of keeping the blobs in the database is, your database backups will be significantly larger.
Some enterprise databases (eg. SAP/Sybase) recognise that a page has not changed, and excludes it from the incremental backups
others have no incremental database backups.
The advantage of keeping the blobs in the database is:
data and referential integrity. You will not have the problem of having the rows that are out of synch with the blobs
the blobs are included in the backup: otherwise, upon a restore, the task of syncing the restored database with the restored files is a major problem.
I completed an assignment last year, where the customer had 130GB of data in the db, and 700GB of documents stored outside the db. After ten years of problems, they bit the bullet, and moved the documents into the db.
Guess what, what was supposed to be a simple job (long but simple, because the references were supposed to be absolutely correct), ended up being massive, because there were so many (a) duplicates, and (b) invalid references.
The resulting database was 630GB, there were 100GB of dupes. 2K pagesize.
Responses to Comments
Slash or Backslash
Easy.
In the database, store slash only.
You need a way of identifying the target system, and an IsWindoze indicator. It should be higher up in the table hierarchy, not at the level where the Filename is located.
Whenever you report or display the Filename column, if IsWindoze, change the slashes to backslashes.
You will have a similar problem with the DriveLetter and colon D:, which Unix does not have. Allow it only if IsWindoze.
Late answer: it depends on your engine.
A page size of 2k hasn't been used since the 1990s for SQL Server. Oracle defaults to 8K, SQL Server is 8K. Only Sybase AFAIK is still in the last century.
SQL Server now offers FILESTREAM which combines the best of both worlds, as Oracle has done for longer with BFILE
SQL Server and Oracle offer on disk and backup compression
I'm sure PostgresSQL at least offers similar features.
Note: this is mainly to offer alternatives to PerformanceDBA's FUD
The preferred method is to store the file in the filesystem and store the location of the file in the database. The reasoning for this has to do with how databases physically allocate space on disk (usually in 8k or 16k chunks). Dropping large files in there causes your database to use different mechanisms to store the files (SQL Server calls this row overflow data). Typically these kind of pages are located out of the normal table, so every logical read for a row results in two physical reads on disk. Needless to say, this isn't good for performance.
I have a design decision to make regarding documents uploaded to my web site: I can either store them on my file server somewhere, or I can store them as a blob in my database (MSSQL 2005). If it makes any difference to the design decision, these documents are confidential and must have a certain degree of protection.
The considerations I've thought of are:
Storing on the file server makes for HUUUUUUUGE numbers of files all dumped in a single directory, and therefore slower access, unless I can work out a reasonable semantic definition for a directory tree structure
OTOH, I'm guessing that the file server can handle compression somewhat better than the DB... or am I wrong?
My instincts tell me that the DB's security is stronger than the file server's, but I'm not sure if that's necessarily true.
Don't know how having terabytes of blobs in my DB will affect performance.
I'd very much appreciate some recommendations here. Thanks!
In SQL Server 2005, you only have the choice of using VARBINARY(MAX) to store the files inside the database table, or then keep them outside.
The obvious drawback of leaving them outside the database is that the database can't really control what happens to them; they could be moved, renamed, deleted.....
SQL Server 2008 introduces the FILESTERAM attribute on VARBINARY(MAX) types, which allows you to leave the files outside the database table, but still under transactional control of the database - e.g. you cannot just delete the files from the disk, the files are integral part of the database and thus get copied and backed up with it. Great if you need it, but it could make for some huge backups! :-)
The SQL Server 2008 launch presented some "best practices" as to when to store stuff in the database directly, and when to use FILESTREAM. These are:
if the files are typically less than 256 KB in size, the database table is the best option
if the files are typically over 1 MB in size, or could be more than 2 GB in size, then FILESTREAM (or in your case: plain old filesystem) is your best choice
no recommendation for files between those two margins
Also, in order not to negatively impact performance of your queries, it's often a good idea to put the large files into a separate table alltogether - don't have the huge blobs be part of your regular tables which you query - but rather create a separate table, which you only ever query against, if you really need the megabytes of documents or images.
So that might give you an idea of where to start out from!
I strongly suggest you to consider the filesystem solution. The reasons are:
you have better access to the files (precious in case of debugging), meaning that you can use regular console-based tools
you can quickly and easily take advantage of the OS to distribute the load, for example using a distributed filesystem, add redundancy via a hardware RAID etc.
you can take advantage of the OS access control lists to enforce permissions.
you don't clog your database
If you are worried about large amounts of entries in your directories, you can always create a branching schema. for example:
filename : hello.txt
filename md5: 2e54144ba487ae25d03a3caba233da71
final filesystem position: /path/2e/54/hello.txt
There's a LOT of "it depends" behind this popular subject. Since you say the documents are sensitive and confidential, off the cuff I'd go with storing in the database. Here are a few reasons:
Potentially better security. It is often easier to hack a file system than a database.
Better volume control. Thousands of files in one folder can strain an OS, where a database can take millions of rows in one table without blinking.
Better searching and scanning. Add categorizing columns when you load the data, or try out full text indexing to scan the actual documents.
Backups may be more efficient -- just add another database to your backup plan, and you're covered (once you work out space details, of course). And those backup files are another layer of obfuscation on anyone trying to get at your sensitive documents.
SQL Server 2008 has data compression options that may help here. That, or have the application do it? (More security through obfuscation, perhaps)
SQL Server 2008 also has the filestream data type, which may help here, but I'm not familiar enough with it to give a recommendation for your situation.