Does google file system provide read consistency. I am confused because I know that the primary maintains write consistency in GFS. If a system provided write consistency is it not that it provides read consistency also?
Thanks
Manjit
The read operation won't mutate the data, there is no concern about the consistency.
Related
In regards to DBMS integrity, how is an operating system's buffered I/O a threat? I have read multiple articles on why DBMS make use of their own, local cache rather than using OS buffered O/I (a good number of the reads were right here in stackoverflow), however I haven't seen any indication that the buffered O/I might pose an integrity threat to DBMS.
I believe I have found the answer I needed. It relates to transfer errors within the database effecting the integrity, "...when a piece of data is present in the destination table, but not in the source table of a relational database.", as per Talend.com article What is "Data Integrity and Why Is It Important".
I have a basic question with regards to FileSystem usage
I want to use a embedded KeyValue store, which is very write oriented. (persistent) Say my value size is
a) 10 K
b) 1 M
and read and updates are equal in number
Cant I simply create files containing the value and there name acting as keys.
Wont it as fast as using a KeyValue store as LevelDB or RocksDB.
Can anybody please help me understand .
In principle, yes, a filesystem can be used as a key-value store. The differences only come in when you look at individual use cases and limitations in the implementations.
Without going into too much details here, there are some things likely to be very different:
A filesystem splits data into fixed size blocks. Two files can't typically occupy parts of the same block. Common block sizes are 4-16 KiB; you can calculate how much overhead your 10 KiB example would cause. Key/value stores tend to account for smaller-sized pieces of data.
Directory indexes in filesystems are often not capable of efficiently iterating over the filenames/keys in sort order. You can efficiently look up a specific key, but you can't retrieve ranges without reading pretty much all of the directory entries. Some key/value stores, including LevelDB, support efficient ordered iterating.
Some key/value stores, including LevelDB, are transactional. This means you can bundle several updates together, and LevelDB will make sure that either all of these updates make it through, or none of them do. This is very important to prevent your data getting inconsistent. Filesystems make this much harder to implement, especially when multiple files are involved.
Key/value stores usually try to keep data contiguous on disk (so data can be retrieved with less seeking), whereas modern filesystems deliberately do not do this across files. This can impact performance rather severely when reading many records. It's not an issue on solid-state disks, though.
While some filesystems do offer compression features, they are usually either per-file or per-block. As far as I can see, LevelDB compresses entire chunks of records, potentially yielding better compression (though they biased their compression strategy towards performance over compression efficiency).
Lets try to build Minimal NoSQL DB server using Linux and modern File System in 2022, just for fun, not for serious environment.
DO NOT TRY THIS IN PRODUCTION
—————————————————————————————————————————————
POSIX file Api for read write,
POSIX ACL for native user accounts and group permission management.
POSIX filename as key ((root db folder)/(tablename folder)/(partition folder)/(64bitkey)). Per db and table we can define permission for read/write using POSIX ACL. (64bitkey) is generated in compute function.
Mount BTRFS/OpenZFS/F2fs as filesystem to provide compression (Lz4/zstd) and encryption (fscrypt) as native support. F2fs is more suitable as it implements LSM which many nosql db used in their low level architecture.
Meta data is handled by filesystem so no need to implement it.
Use Linux and/or filesystem to configure page or file or disk block cache according to read write patterns as implemented in business login written in compute function or db procedure.
Use RAID and sshfs for remote replication to create Master/Slave high availability and/or backup
Compute function or db procedure for writing logic could be NodeJS file or Go binary or whatever along with standard http/tcp/ws server module which reads and write contents to DB.
I would like my program (written in Python) to monitor a given file system's hierarchy, record it into persistent data storage, and be able to update it when the file system changes. It might be read into volatile memory for quick access.
I've found some posts that suggested "the best persistent storage method to use with Python" here and here, as well as another post that answered "how to represent a filesystem in a relational database" here.
From the above links, it appears that SQLite is a good choice for persistence, as it is quick. However, I couldn't find much opinion on how good it is to use a database to store and represent filesystem hierarchy.
My considerations for the method implemented are:
Performance/Speed
Scalibality: I need to monitor and keep updated a hierarchy potentially up to hundreds of thousands of files
Ease of use: when I read the file system hierarchy into memory
Any other suggested considerations?
Is it a good idea to use a RDBMS to represent a filesystem hierarchy? What are the pros and cons in this method? Do you have other suggested methods, and what are the pros and cons of such methods?
Basically, I was wondering how write-anywhere file systems provide any advantage over the other kinds of filesystems out there, and how the write-anywhere model manages to do this (in a broad sense)?
Thanks
There are three popular file systems out there that follow in a very board sense the write-anywhere file system approach: The original WAFL used by NetApp (old technical report), ZFS, BTRFS.
The key properties of these file systems are
that there are no pre-assigned parts of the underlying block storage for data and meta data (hence the write-anywhere) and
that data is never overwritten, but redirected to a different location on the block storage. The latter property is shared with Flash Transition Layers or special Flash file systems, but usually they don't have property 1.
They have a few nice advantages (as a short summary):
It is easier and more straightforward to implement advantages file system features like snapshots, CDP, data deduplication.
Consistency is easier. Recovery after a crash is faster. In theory, a file system check should never be necessary.
RAID writes can be optimized. Multiple unrelated writes can be placed in a single RAID group, so that the IOs needed for the writes is reduced.
This question is from a decomposition of What are good programming practices to prevent malware in standalone applications?
The question has to do with malware dynamically getting into a program by infecting data files which the program reads/writes.
Is it safer to require data be stored in a database and only use service calls, no direct file operations when accessing data for a program? Let's say your program loads many images, numeric data tables, or text information as it runs. Assume this is after the program is loaded and initialized to where it can make service calls.
Is it easier to infect a file or a database?
It is easier to infect user-space API than kernel space API.
In other words, the point is moot if you can't trust the services you're using to read the data.
I would say it is a function of the definition of security (read prevention, write prevention, etc) and who potentially has access and how much the risk is.
An entity you control may generally be 'safer' than handing off control to an external entity - but not necessarily.
Nothing is generally easy to specify wrt security as it always a risk vs cost trade off.