tokyo cabinet: .tcb.wal file created along with .tcb file. Db size doesnot decreases while deleting records - tokyo-cabinet

I am using tokyo cabinets B+ tree API to create a lookup database. On linux environment I see a .tcb.wal file created along with the actual .tcb database file. The size of this file is 0. I wonder whether its a lock file that is created to help synchronization. Also when I delete records from the database the size of the file does not decrease. Any reasons why its behaving like that?

The extension .wal stands for Write Ahead Logging file. This file is only relevant if you use any transaction functions; most applications do not use these. (For details, search for "ahead" in the documentation.)
The file size does not change for every deletion for efficiency reasons. Similarly, if you create an empty database, it will reserve space for faster insertions.

Related

Is using fs::rename to move files between different file systems atomic?

On Linux, if we move (rename) a file from one file system to another like this:
fs::rename("/src/a", "/dest/a")?;
is it possible the file /dest/a to become visible/available to other potential readers (processes that scan /dest/ for example) before the whole file data (contents) is fully copied to the destination file system?
From the docs
This will not work if the new name is on a different mount point.
So you shouldn't be using fs::rename to move between mount points. It's a simple filesystem rename and is not actually moving any data (which is why you can fs::rename a 2-terabyte file fairly quickly), so it only works if the source and destination are on the same filesystem.
If the source and destination are on the same mount point, then the answer is no. It's not possible for someone to read it before it's fully available, since, again, no data is actually transferred: it's just a single operation of "this pointer now points here and this one doesn't exist anymore".

How to make atomic operation with both file system and database in Postgres?

I think the following should be a pretty common pattern :
A database is used to store file paths
The files themselves are stored in the file system
Issues may occur when say we want to modify a file path : we need to both modify
the database file path and to move the file in the filesystem. It is important that this is done "atomically". Indeed, while we are doing the modification, another process may attempt to read the file path in the datadase and then tries to access the file in the file system. We should make sure that the tuple
("file path", "actual file location")
remains consistant all the time.
Is there a canonical/simple way to achieve this with Postgres/Linux ?
One of the major features of the database is that the processes see it consistently. That also means that different clients see different state of the database.
This means that when you correct a file path in the database and commit the change any transactions that started before the commit can see the old path for some time after the commit.
So actually to make sure nobody would try to read the old file path you have to wait until all transactions from before the commit would end. That can take milliseconds or, in extreme situations, days. If you have a
I'd try to implement the following scheme (pseudocode):
sql("begin")
os.hardlink(old_path, new_path)
sql("update files set path=? where path=?, new_path, old_path)
sql("insert into files_to_clean values (?, txid_current())", old_path)
sql("commit")
if random()<CLEANUP_PROBABILITY:
sql("begin")
for delete_path in sql("
delete from files_to_clean
where txid<txid_snapshot_xmin(txid_current_snapshot())
returning path skip locked
"):
os.delete(delete_path)
sql("commit")

Best way to store 4.7 million binary files

I have parsed the whole english wikipedia and saved each parsed article in a separate protocol buffer file.Each file has a unique id (wikiid). I have now of 4.7 million parsed articles total size of 180 gb. I know ext4 can handle this amount of files but is it a good practice? or should I use database? I will not need to update it frequently.
Keep it as files - db is relatively more expensive to scale and maintain.
Though you may want to be careful in how you name/store them -instead one directory having all the 4.7M files - have a directory structure that goes to say 4 levels. Preprocess the 4.7 M files to store in a directory structure. Say id of a file D1D2D3d4fewmorechars.txt - so now store this file in /D1/D2/D3/D4/D1D2D3D4fewmorechars.txt.
Or the other option is use file systems such as XFS, ext3/4 - that use directory indexing techniques such as hashed directories.
Check this link - https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory

Managing log file size

I have a program which logs its activity.
I want to implement a log file mechanism to keep the log file under a certain size, lets say 10 MB.
The log file itself just holds commands the program executed; those commands are variable length.
Right now, the program runs on a windows environment, but I'm likely to port it to UNIX soon.
I've came up with two methods for managing the log files:
1. Keep multiple files of lower size, and if the new command exceeds the current file length, truncate the oldest file to zero size, and start writing there.
2. Keep a header in the file, which holds metadata regarding the first command in the file, and the next place to write to in the file. Also I think, each command should hold metadata about it's length this way.
My questions are as follows:
In terms of efficiency which of these methods would you use, and why?
Is there a unix command / function to this easily?
Thanks a lot for your help,
Nihil.
On UNIX/Linux platforms there's a logrotate program that manages logfiles. Details can be found for example here:
http://linuxcommand.org/man_pages/logrotate8.html

Following multiple log files efficiently

I'm intending to create a programme that can permanently follow a large dynamic set of log files to copy their entries over to a database for easier near-realtime statistics. The log files are written by diverse daemons and applications, but the format of them is known so they can be parsed. Some of the daemons write logs into one file per day, like Apache's cronolog that creates files like access.20100928. Those files appear with each new day and may disappear when they're gzipped away the next day.
The target platform is an Ubuntu Server, 64 bit.
What would be the best approach to efficiently reading those log files?
I could think of scripting languages like PHP that either open the files theirselves and read new data or use system tools like tail -f to follow the logs, or other runtimes like Mono. Bash shell scripts probably aren't so well suited for parsing the log lines and inserting them to a database server (MySQL), not to mention an easy configuration of my app.
If my programme will read the log files, I'd think it should stat() the file once in a second or so to get its size and open the file when it's grown. After reading the file (which should hopefully only return complete lines) it could call tell() to get the current position and next time directly seek() to the saved position to continue reading. (These are C function names, but actually I wouldn't want to do that in C. And Mono/.NET or PHP offer similar functions as well.)
Is that constant stat()ing of the files and subsequent opening and closing a problem? How would tail -f do that? Can I keep the files open and be notified about new data with something like select()? Or does it always return at the end of the file?
In case I'm blocked in some kind of select() or external tail, I'd need to interrupt that every 1, 2 minutes to scan for new or deleted files that shall (no longer) be followed. Resuming with tail -f then is probably not very reliable. That should work better with my own saved file positions.
Could I use some kind of inotify (file system notification) for that?
If you want to know how tail -f works, why not look at the source? In a nutshell, you don't need to periodically interrupt or constantly stat() to scan for changes to files or directories. That's what inotify does.

Resources