I wish to have large number (e.g. million) of log files on a system. But OS has limit on opened files. It is not efficient to create million files in single folder.
Is there ready solution, framework or database that will create log files and append data to log files in efficient manner?
I can imagine various techniques to optimize management of large number of log files but there might something that does that out of box.
e.g. I wish that log file was re-created every day or when it reach 50MB. Old log files must be stored. e.g uploaded to Amazon S3.
I can imagine that log database writes all logs in single file but later processes it appends records in millions of files.
May be there is special file system that is good for such task. I can't find anything. I am sure there might be solution.
PS I wish to run logging on single server. I say 1 million because it is more then default limit on opened files. 1 million files 1MB is 1TB and it could be stored on regular harddrive.
I look for existing solution before I will write my own. I am sure there might a set of logging servers. I just do not know how to search for them.
I would start thinking of a Cassandra of Hadoop as a store for log data and eventually if you want these data in a form of a files write a procedure that will make a select on one of these databases and place them in formatted files.
Related
I just wanted to ask if it is safe to design a file table that in future will hold about 5-15 million of 0.5-10mb max files?
Will NTFS handle it?
I had a problem once on old Windows Server 2008 R2 that when I had a folder with more than 2.5 million files, then creating a new file inside that folder took about 30 seconds.... getting file list took about 5 minutes. Is that a NTFS problem?
Can it be a problem for this? Or file stream/file tables will create subfolders itself to handle so many files etc?
Or disabling 8.3 naming convention is enough and it will work fine then?
Thanks and regards
Will NTFS handle it?
Yes. Just do not open file explorer. THAT - the program, not the operating system - can not handle that as well. Command line or server that do not try to load all files into a list work well.
In my experience, in short, yes, NTFS can handle it, but avoid exploring FILESTREAM directories (explorer can’t handle this volume of files, it’ll crash). Some white papers recommend the use of FileStream when file size is 256KB or larger, but the performance its evident in files larger than 1MB.
Here are some tricks recommended for best practices:
Disabling the indexing service (disable indexing on the NTFS volumes
where FILESTREAM data is stored.)
Multiple datafiles for FileStreamfilegroup in separate volumes.
Configuring correct NTFS cluster size (64KB recommended)
Configuring antivirus (cant delete some file of FILESTREAM or your DB will be corrupted).
Disabling the Last AccessTime attribute.
Regular disk defragmentation.
Disabling short file names (8dot3)
Keep FILESTREAM data containers on a separate disk volume (mdf, ndf
and log).
Right now, we're doing some tests to migrate our FileUpload database (8TB and growing with 25MM of records) from varbinary(max) to use FileTable. Our approach is to split a very large database in a database per year.
I would like know if you are currently working on this in production environment and know your experience.
You can find more info in a free ebook: Art of FileStream
I know function sys.fn_get_audit_file can read multiple .sqlaudit files at once. But I could not find any information regarding maximum number of files this function can read. Any idea about the file number limitation?
Also I need to monitor certain databases which contains sensitive information. I have three SQL Server Instances and three Databases on each. Is it okay to have all Audit Log files (3 * 3 = 9) to be placed into one common folder in network and use sys.fn_get_audit_file to read the logs at once. The logs are going to be really big. I intend to use this approach and write a C# console app to dump the logs into one SQL table. Any suggestion?
Hypothetical scenario - I have a 10 node Greenplum cluster with 100 TB of data in 1000 tables that needs to be offloaded to S3 for reasons. Ideally, the final result is a .csv file that corresponds to each table in the database.
I have three possible approaches, each with positives and negatives.
COPY - There is a question that already answers the how, but the problem with psql COPY in a distributed architecture, is it all has to go through the master, creating a bottleneck for the movement of 100TB of data.
gpcrondump - This would create 10 files per table and the format is TAB Delimited, which would require some post-gpcrondump ETL to unite the files into a single .csv, but it takes full advantage of the distributed architecture and automatically logs successful/failed transfers.
EWT - Takes advantage of the distributed architecture and writes each table to a single file without holding it in local memory until the full file is built, yet will probably be the most complicated of the scripts to write because you need to implement the ETL, you can't do it separately, after the dump.
All the options are going to have different issues with table locks as we move through the database and figuring out which tables failed so we can re-address them for a complete data transfer.
Which approach would you use and why?
I suggest you use the S3 protocol.
http://www.pivotalguru.com/?p=1503
http://gpdb.docs.pivotal.io/43160/admin_guide/load/topics/g-s3-protocol.html
What is the standard way to store uploaded files on server? In database as binary or on hard disk with path stored in database?
Microsoft did a research in this topic: https://www.microsoft.com/en-us/research/publication/to-blob-or-not-to-blob-large-object-storage-in-a-database-or-a-filesystem/
Storing very small files will get you the best performance in the database. Storing larger files give you the best performance on your hard drive. I researched this for a company where I work for. The file system performance will be better than the database when the file size is 512 kB or larger. The performance of the database will drop rapidly after this point.
Storing files in the database will give you the advantage that you can keep everything in sync. You can configure that a file BLOB will be removed when the file record is removed. However, storing large files will give you very bad performance and creating backups could take very long.
Relative newcomer to sql and pg here so this is a relatively open question regarding backing up daily data from a stream. Specific commands / scripts would be appreciated if it's simple, otherwise I'm happy to be directed to more specific articles/tutorials on how to implement what needs to be done.
Situation
I'm logging various data streams from some external servers on the amount of a few GB/day every day. I want to be able to store this data onto larger harddrives which will then be used to pull information from for analysis at a later date.
Hardware
x1 SSD (128GB) (OS + application)
x2 HDD (4TB each) (storage, 2nd drive for redundancy)
What needs to be done
The current plan is to have the SSD store a temporary database consisting of the daily logged data. When server load is low (early morning), dump the entire temporary database onto two separate backup instances on each of the two storage disks. The motivation for storing a temp db is to reduce the load on the harddrives. Furthermore, the daily data is small enough that it will be able to copy over to the storage drives before server load picks up.
Questions
Is this an acceptable method?
Is it better/safer to just push data directly to one of the storage drives, consider that the primary database, and automate a scheduled backup from that drive to the 2nd storage drive?
What specific commands would be required to do this to ensure data integrity (i.e. while a backup is in progress, new data will still be being logged)
At a later date when budget allows the hardware will be upgraded but the above is what's in place for now.
thanks!
First rule when building a backup system - do the simplest thing that works for you.
Running pg_dump will ensure data integrity. You will want to pay attention to what the last item backed up is to make sure you don't delete anything newer than that. After deleting the data you may well want to run a CLUSTER or VACUUM FULL on various tables if you can afford the logging.
Another option would be to have an empty template database and do something like:
Halt application + disconnect
Rename database from "current_db" to "old_db"
CREATE DATABASE current_db TEMPLATE my_template_db
Copy over any other bits you need (sequence numbers etc)
Reconnect the application
Dump old_db + copy backups to other disks.
If what you actually want is two separate live databases, one small quick one and one larger for long-running queries then investigate tablespaces. Create two tablespaces - the default on the big disks and the "small" one on your SSD. Put your small DB on the SSD. Then you can just copy from one table to another with foreign-data-wrappers (FDW) or dump/restore etc.