Does the Google File System allow listing directory contents? - database

In the GFS paper, Section 4.1 describes how GFS is able to make concurrent mutations within a directory while only requiring a read lock on the directory for each client - there's no actual inode in GFS, so clients are free to create, remove, or mutate /x/y/somefile while only requiring a read lock on /x/ and /x/y/.
If there are no inodes, then is it still possible to maintain an explicit tree structure? The only way I can see this working is if the master maintains a flattened, 1-dimensional mapping from directory or file names to their metadata, allowing for fast file creation and manipulation.
Suppose that some client of GFS wanted to scan the names of all files in a directory - for instance, ls. Without an iteration over all metadata nodes, how is this possible?
It might be possible for a client to maintain their own version of what they think the directory tree looks like in the GFS, but this will only work if each client keeps to their own directory.

A master lookup table offers access to a single conceptual tree of nodes. It does this by listing all paths of names to nodes. Some nodes are directories. The only data is owned by non-directory leaf nodes. Eg these paths:
/a/b/foo
/a/b/c/bar
/a/baz/
describe this tree:
\
a/--b/--foo
| \
| c/--bar
baz/
Every path identifies a node. The nodes that are the children of a node are the ones whose paths are one name longer in the lookup table. To list a node's children nodes is to list all the paths in the lookup table that are one name longer than its path. What the paper means by metatdata is info like whether and how a node is locked and for a non-directory leaf node where its (unshared) data is.
One doesn't navigate by visiting directory nodes that own data that gives child and parent node names and whether they are directories, as in Unix/Linux. Copying a leaf means copying its data to another leaf's, like Unix/Linux cat, not cp. I presume one can copy a subtree, which would add new paths in the lookup table and copy data for non-directory leaves.
One cannot use technical terms like "file" or "directory" as if they mean the same thing in both systems. What one can do is consider GFS and Unix/Linux to both manage the same kind of tree of paths of names through directory nodes to directory leaves and non-directory data-owning leaves. But after that the other parts of the file system state (metadata and data) and their operators differ. In your mind put "GFS-" and "Unix/Linux-" in front of every technical term other than those refering to trees of named nodes.
EDIT: Examples.
1.
Suppose that some client of GFS wanted to scan the names of all files
in a directory - for instance, ls. Without an iteration over all
metadata nodes, how is this possible?
A directory listing would return the paths in the lookup table that extend the given directory's path. GFS will offer file server commands to do such things or that support doing such things, hiding its implementation. It would be enough (but slow) to be able iterate through the lookup table. Eg ls /a/b:
/a/b/foo
/a/b/c/bar
2.
To copy source node children to be target node children: For each path that extends the source's path, add to the lookup table the path got by replacing that prefix by the target path. Presumably the copy command creating the new nodes copies associated data for non-directories. Eg copy children of /a/ to /a/b/c/ adds:
/a/b/c/b/foo
/a/b/c/b/c/bar
/a/b/c/baz/
giving:
\
a/--b/--foo
| \
| c/--bar
| |--b/--foo
| | \
| | c/--bar
| baz/
baz/

Related

clone some relationships according to a condition

I exported two tables named Keys and Acc tables as CSV files from SQL Server and imported them successfully to Neo4J by using the commands below.
CREATE INDEX ON :Keys(IdKey)
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///C:/Keys.txt' AS line
MERGE (k:Keys { IdKey: line[0] })
SET k.KeyNam=line[1], k.KeyLib=line[2], k.KeyTyp=line[3], k.KeySubTyp=line[4]
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///C:/Acc.txt' AS line
MERGE (callerObject:Keys { IdKey : line[0] })
MERGE (calledObject:Keys { IdKey : line[1] })
MERGE (callerObject)-[rc:CALLS]->(calledObject)
SET rc.AccKnd=line[2], rc.Prop=line[3]
Keys stands for the source code objects, Acc stands for relations among them. I imported these two tables three times for three different application projects. So to maintain IdKey property being unique for three applications, I concatenated a five character prefix to IdKey to identify the Object for Application while exporting from sql server because we can not create index based on multiple fields as I learnt from manuals. Now my aim is constructing the relations among applications. For example:
Node1 is a source code object of Application1
Node2 is another source code object of Application1
Node3 is a source code object of Application2
There is already a CALL relation created from Node1 to Node2 because of the record in Acc already imported.
The Name of the Node2 is equal to name of Node3. So we can say that Node2 and Node3 are in fact the same source codes. So we should create a relation from Node1 to Node3. To realize it, I wrote a command below. But I want to be sure that it is correct. Because I do not know how long it will execute.
MATCH (caller:Keys)-[rel:CALLS]->(called:Keys),(calledNew:Keys)
WHERE calledNew.KeyNam = called.KeyNam
and calledNew.IdKey <> called.IdKey
CREATE (caller)-[:CALLS]->(calledNew)
This following query should be efficient, assuming you also create an index on :Keys(KeyNam).
MATCH (caller:Keys)-[rel:CALLS]->(called:Keys)
WITH caller, COLLECT(called.KeyNam) AS names
MATCH (calledNew:Keys)
WHERE calledNew.KeyNam IN names AND NOT (caller)-[:CALLS]->(calledNew)
CREATE (caller)-[:CALLS]->(calledNew)
Cypher will not use an index when doing comparisons directly between property values. So this query puts all the called names for each caller into a names collection, and then does a comparison between calledNew.KeyNam and the items in that collection. This causes the index to be used, and will speed up the identification of potential duplicate called nodes.
This query also does a NOT (caller)-[:CALLS]->(calledNew) check, to avoid creating duplicate relationships between the same nodes.

How to determine newly added elements into my private branch

In a major development, I have added multiple files to the source control into my private branch. There were also existing files that was modified and checked into my private branch. Now as we are approaching to merge the changes to our project branch, I would like to validate all the elements I have newly added to my private branch, to ascertain if the locations are correct (ex, they should have been placed in another location and a symlink should have been added)
I listed all the elements in my private branch, but could not figure out, which of these elements were newly added.
Is there a reliable way to do so?
You can do a query finding all elements in a given branch since a certain date for a certain user:
cleartool find . -type f -branch "brtype(abranch)" -element "{created_since(10-Jan)}" -user aloginname -print
(this would search only files, as mentioned in "how to find files in a given branch", and also in "how can I list a certain user's activity in a branch")
The other approach is to create a dedicated (simple base ClearCase) view to display those elements, as in "Get all versions from a specific time" or in "how to find out all the activities happend in a branch in the last month?".
But generally, the first query is enough.

MEL: Traverse through hierarchy

I'm writing a MEL script that will rename all of the joints in a joint hierarchy to a known format. The idea is that you would select the Hip joint and the script would rename the Hips, and go through every other joint and rename it based on its position in the hierarchy.
How can you traverse through a joint hierarchy in MEL?
If you assign to $stat_element the name of your top joint in the hierarchy and run the following code, it will add a prefix "myPrefix_" to all children elements of that joint.
string $stat_element = "joint1";
select -r $stat_element;
string $nodes[] = `ls -sl -dag`;
for($node in $nodes){
rename -ignoreShape $node ("myPrefix_" + $node);
}
Hope this helps
If you need to make detailed decisions as you go along, instead of bulk-renaming, traversing the hierarchy is pretty simple. The command is 'listRelatives'; With the 'c' flag it returns children of a node and with the 'p' flag it returns the parent. (Note that -p returns a single objects, -c returns an array)
Joint1
Joint2
Joint3
Joint4
listRelatives -p Joint2
// Result: Joint1 //
listRelatives -c Joint2
// Result: Joint3, Joint4
The tricky bit is the renaming, since maya will not always give you the name you expect (it won't allow duplicate names at the same level of the hierarchy). You'll need to keep track of the renamed objects or you won't be able to find them after they are renamed in case the new names don't match your expectations.
If you need to keep track of them, you can create a set with the set command before renaming; no matter what becomes of the names, all of the objects will still be in the set. Alternatively, you can traverse the hierarchy by selecting objects and renaming the current selection -- this won't record the changes but you won't have problems with objects changing names in the middle of your operation and messing up your commands.
It can be messy to do this in MEL if you have non-unique names because the handle you have for the object is the name itself. Once you rename a parent of a node with a non-unique name, the child's name is different. If you stored the list of all names before starting to rename, you will get errors as the rename command will attempt to rename nodes that don't exist. There are 2 solutions I know of using MEL. But first, here's the pyMel solution which is much easier and I recommend you use it.
PyMel Solution:
import pymel.core as pm
objects = pm.ls(selection=True, dag=True, type="joint")
pfx = 'my_prefix_'
for o in objects:
o.rename(pfx + o.name().split('|')[-1])
As pm.ls returns a list of real objects, and not just the names, you can safely rename a parent node and still have a valid handle to its children.
If you really want to do it in MEL, you need to either rename from the bottom up, or recurse so that you don't ask for the names of children before dealing with the parent.
The first MEL solution is to get a list of long object names and sort them based on their depth, deepest first. In this way you are guaranteed to never rename a parent before its children. The sorting bit is too convoluted to be bothered with here, and the recursive solution is better anyway.
Recursive MEL solution:
global proc renameHi(string $o, string $prefix) {
string $children[] = `listRelatives -f -c -type "joint $o`;
for ($c in $children) {
renameHi( $c ,$prefix ) ;
}
string $buff[];
int $numToks = tokenize($o, "|", $buff);
string $newName = $buff[( $numToks - 1)];
$newName = ($prefix + $newName);
rename($o,$newName);
}
string $prefix = "my_prefix_";
string $sel[] = `ls -sl -type "joint"`;
for ($o in $sel) {
renameHi($o, $prefix);
}
This recursive solution drills down to the leaf nodes and renames them before renaming their parents.

What DBMS should I use to store openstreetmap as a graph?

Background:
I need to store the following data in a database:
osm nodes with tags;
osm edges with weights (that is an edge between two nodes extracted from 'way' from an .osm file).
Nodes that form edges, which are in the same 'way' sets should have the same tags as those ways, i.e. every node in a 'way' set of nodes which is a highway should have a 'highway' tag.
I need this structure to easily generate a graph based on various filters, e.g. a graph consisting only of nodes and edges which are highways, or a 'foot paths' graph, etc.
Problem:
I have not heard about the spatial index before, so I just parsed an .osm file into a MySQL database:
all nodes to a 'nodes' table (with respective coordinates columns) - OK, about 9,000,000 of rows in my case:
(INSERT INTO nodes VALUES [pseudocode]node_id,lat,lon[/pseudocode];
all ways to an 'edges' table (usually one way creates a few edges) - OK, about 9,000,000 of rows as well:
(INSERT INTO edges VALUES [pseudocode]edge_id,from_node_id,to_node_id[/pseudocode];
add tags to nodes, calculate weights for edges - Problem:
Here is the problematic php script:
$query = mysql_query('SELECT * FROM edges');
$i=0;
while ($res = mysql_fetch_object($query)) {
$i++;
echo "$i\n";
$node1 = mysql_query('SELECT * FROM nodes WHERE id='.$res->from);
$node1 = mysql_fetch_object($node1);
$tag1 = $node1->tags;
$node2 = mysql_query('SELECT * FROM nodes WHERE id='.$res->to);
$node2 = mysql_fetch_object($node2);
$tag2 = $node2->tags;
mysql_query('UPDATE nodes SET tags="'.$tag1.$res->tags.'" WHERE nodes.id='.$res->from);
mysql_query('UPDATE nodes SET tags="'.$tag2.$res->tags.'" WHERE nodes.id='.$res->to);`
Nohup shows the output for 'echo "$i\n"' each 55-60 seconds (which can take more than 17 years to finish if the size of the 'edges' table is more than 9,000,000 rows, like in my case).
Htop shows a /usr/bin/mysqld process which takes 40-60% of CPU.
The same problem exists for the script which tries to calculate the weight (the distance) of an edge (select all edges, take an edge, then select two nodes of this edge from 'nodes' table, then calculate the distance, then update the edges table).
Question:
How can I make this SQL updates faster? Should I tweak any of MySQL config settings? Or should I use PostgreSQL with PostGIS extension? Should I use another structure for my data? Or should I somehow utilize the spatial index?
If I understand you right there is two things to discuss.
First, your idea of putting the highway-tag on the start and stop nodes. A node can have more than one edge connected, where to put the tag from the second edge? Or third or fourth if it is a crossing? The reason the highway tag is putted in the edges table in the first place is that from a relational point of view that is where it belongs.
Second, to get the whole table and process it outside the database is not the right way. What a relational database is really good at is taking care of this whole process.
I have not worked with mysql, and I fully agree that you will probably get a lot more fun if migrating to PostGIS since PostGIS has a lot better spatial capabilities (even if you don't need any spatial capabilities for this particular task) from what I have heard.
So if we ignore the first problem and just for showing the concept say that there is only two edges connected to one node and that each node has two tag-fields. tag1 and tag2. Then it could look something like this in PostGIS:
UPDATE nodes set tag1=edges.tags from edges where nodes.id=edges.from;
UPDATE nodes set tag2=edges.tags from edges where nodes.id=edges.to;
If you disable the indexes that should be very fast.
Again,
if I have understood you right.
PostgreSQL
Openstreetmap itself uses PostgreSQL, so I guess that's recommended.
See: http://wiki.openstreetmap.org/wiki/PostgreSQL
You can see OSM's database schema at: http://wiki.openstreetmap.org/wiki/Database_Schema
So you can use the same fields, fieldtypes and indexes that OSM uses for maximum compatibility.
MySQL
If you want to import .osm files into a MySQL database, have a look at:
http://wiki.openstreetmap.org/wiki/OsmDB.pm
Here you will find perl code that will create MySQL tables, parse a OSM file and import it into your MySQL database.
Making it faster
If you are updating in bulk, you don't need to update the indexes after every update.
You can just disable the indexes, do all your updates and re-enable the index.
I'm guessing that should be a whole lot faster.
Good luck

How to store images in your filesystem

Currently, I've got images (max. 6MB) stored as BLOB in a InnoDB table.
As the size of the data is growing, the nightly backup is growing slower and slower hindering normal performance.
So, the binary data needs to go to the file system. (pointers to the files will be kept in the DB.)
The data has a tree like relation:
- main site
- user_0
- album_0
- album_1
- album_n
- user_1
- user_n
etc...
Now I want the data to be distributed evenly trough the directory structure. How should I accomplish this?
I guess I could try MD5('userId, albumId, imageId'); and slice up the resulting string to get my directory path:
/var/imageStorage/f/347e/013b/c042/51cf/985f7ad0daa987d.jpeg
This would allow me to map the first character to a server and evenly distribute the directory structure over multiple servers.
This would however not keep images organised per user, likely spreading the images for 1 album over multiple servers.
My question is:
What is the best way to store the image data in the file system in a balanced way, while keeping user/album data together ?
Am I thinking in the right direction? or is this the wrong way of doing things altogether?
Update:
I will go for the md5(user_id) string slicing for the split up on highest level.
And then put all user data in that same bucket. This will ensure an even distribution of data while keeping user data stored close together.
/var
- imageStorage
- f/347e/013b
- f347e013bc04251cf985f7ad0daa987d
- 0
- album1_10
- picture_1.jpeg
- 1
- album1_1
- picture_2.jpeg
- picture_3.jpeg
- album1_11
- picture_n.jpeg
- n
- album1_n
I think I will use albumId splitted up from behind (I like that idea!) as to keep the number of albums per directory smaller (although it won't be necessary for most users).
Thanks!
Just split your userid from behind. e.g.
UserID = 6435624
Path = /images/24/56/6435624
As for the backup you could use MySQL Replication and backup the slave
database to avoid problems (e.g. locks) while backuping.
one thing about distributing the filenames into different directories, if you consider splitting your md5 filenames into different subdirectories (which is generally a good idea), I would suggest keeping the complete hash as filename and duplicate the first few chars as directory names. This way you will make it easier to identify files e.g. when you have to move directories.
e.g.
abcdefgh.jpg -> a/ab/abc/abcdefgh.jpg
if your filenames are not evenly distributed (not a hash), try to choose a splitting method that gets an even distribution, e.g. the last characters if it is an incrementing user-id
I'm using this strategy given a unique picture ID
reverse the string
zerofill it with leading zero if there's an odd number of digits
chunk the string into two-digits substrings
build the path as below
17 >> 71 >> /71.jpg
163 >> 0361 >> /03/61.jpg
6978 >> 8796 >> /87/96.jpg
1687941 >> 01497861 >> /01/49/78/61.jpg
This method ensures that each folder contains up to 100 pictures and 100 sub-folders and the load is evenly distributed between the left-most folders.
Moreover, you just need the ID of the picture to reach the file, no need to read picture table containing other metadata.
User data are not stored close together indeed and the ID-Path relation is predictable, it depends on your needs.

Resources