Multiple processes providing DBus objects - dbus

I'm writing a program which works on a single document. If you want to open multiple documents, you simply open a process per document (yes, the process isolation is important in this case). Let’s call this these processes the servers.
Each server will provide a single object representing the document, and I’d like for client applications to be able to discover these objects. Ideally, the client interface wouldn’t be able to tell whether the documents were owned by different processes or not.
My vague solution would be to have all processes share a well-known connection name (org.example.MyApplication), and provide objects with their PID in them to avoid duplicates (/org/example/MyApplication/).
However, processes can’t share the same well-known connection name, so that’s not going to work.
I imagine I could get the client application to monitor new connections, and scan them to see if the expected object path exists, but that seems like a bad idea.
Any ideas how I can do this?

One approach, used by KDE, is to use well-known names suffixed with PIDs, like org.kde.StatusNotifierItem-2055-1. The client can call org.freedesktop.DBus.ListNames and filter the list.
Or the first server could grab the well known name and the subsequent ones would call it to register their documents, to be discoverable by clients:
src name = :0.42
src path = /org/example/MyApplication/2
dest name = org.example.MyApplication
dest path = /org/example/MyApplication/Documents
dest method = Publish(:0.42, /org/example/MyApplication/Documents/2)

Related

What type of database for storing ML experiments

So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.

Couchbase, two user registering with same username but different datacenters?

Let's say I have two users, Alice in North America and Bob in Europe. Both want to register a new account with the same username, at the same time, on different datacenters. The datacenters are configured to replicate between each other using eventual consistency.
How can I make sure only one of them succeeds at registering the username? Keep in mind that the connection between the datacenters might even be offline at the time (worst case, but daily occurance on spotify's cassandra setup).
EDIT:
I do realize the key uniqueness is the big problem here. The thing is that I need all usernames to be unique. Imagine using twitter if you couldn't tag a specific person, but had to tag everyone with the same username.
With any eventual consistency system, and particularly in the presence of a network partition, you essentially have two choices:
Accept collisions, and pick a winner later.
Ensure you never have a collision.
In the case of Couchbase:
For (1) that means letting two users register with the same address in both NA and EU, and then later picking one as the "winner" (when the network link is present - not a very desirable outcome for something like a user account. A slight variation on this would be something like #Robert's suggestion and putting them in a staging area (which means the account cannot be made "active" until the partition is resolved), and then telling the "winning" user they have successfully registered, and the "loser" that the name is taken and to try again.
For (2) this means making the users unique, even though they pick the same username - for example adding a NA:: / EU:: prefix to their username document. When they login the application would need some logic to try looking up both document variations - likely trying the prefix for the local region first. (This is essentially the same idea as "realms" or "servers" that many MMO games use).
There are variations of both of these, but ultimately given an AP-type system (which Couchbase across XDCR is) you've essentially chosen Availability & Partition-Tolerance over Consistancy, and hence need to reconcile that at the application layer.
Put the user name registrations into a staging table until you can perform a replication to determine if the name already exists in one of the other data centers.
You tagged Couchbase, so I will answer about that.
As long as the key for each object is different, you should be fine with Couchbase. It is the keys that would be unique and work great with XDCR. Another solution would be to have a concatenated key made up of the username and other values (company name, etc) if that suits your use case, again giving you a unique key for the object. Yet another would be to have a key/value in a JSON document that is the username.
It's not clear to me whether you're using Cassandra or Couchbase.
As far as Cassandra is concerned, since version 2.0, you can use Lightweight Transactions which are created for the goal. A Serial Consistency has been created just to achieve what you need. In the above link you can read what follows:
For example, suppose that I have an application that allows users to
register new accounts. Without linearizable consistency, I have no way
to make sure I allow exactly one user to claim a given account — I
have a race condition analogous to two threads attempting to insert
into a [non-concurrent] Map: even if I check for existence before
performing the insert in one thread, I can’t guarantee that no other
thread inserts it after the check but before I do.
As far as the missing connection between two or more cluster its your choice how to handle it. If you can't guarantee the uniqueness at insert-time you can both refuse the registration or dealing with it, accepting and apologize later.
HTH, Carlo

Pick random number from list of array values captured during correlation in Loadrunner

I have correlated the values from my script and have captured into a list of array using Ord=all, now I wanted to display the values randomly and pass it to a file, in a certain format.
Can someone help me understand how random function is used in Loadrunner.
script:
web_reg_save_param("param", "rb=\\", "lb=\\", "Ord=all", LAST);
values:
param_1 = blah-blah
param_2 = blah-blah
and so on n on....
... pass it to a file, ...
Greater than 99% of the time why people want to do this is because they intend to take a value as output generated by one virtual user type and pass it as input to another virtual user type. In general this does not work for the following reasons:
All parameter files are loaded into RAM at the beginning of the test, so a new value written to the tail end of a file will only show up in the next test, not the current test
In a properly designed test virtual user types are distributed to different load generators. This would mean that you would need to write the file to a common location for all of the virtual users to access, such as a shared network drive. You would now be adding two extra finite resource calls to your virtual users, a network request and a disk write request. This will slow your virtual users down, possibly introducing a bottleneck into your entire test design
Let's be blunt, very few LoadRunner users have got the skills to manage tens, hundreds or thousands of users all reading, writing (and potentially deleting) records from the same file. This is a non trivial programming operation. By asking how to write the information to a file you have placed yourself in the skills arena where you don't have the programming maturity to be up to the task on this one. In all likelihood you will introduce all sorts of delays due to locking as all of the users try to access the same time.
HP includes a service to allow users to pass data from one user to another via a broker. This is the Virtual Table Server (VTS). VTS would then manage the locks and all of the reads, writes and deletes to its internal data files which simplifies the act of pasing data from one user to the another immensely. VTS is a "use once" queue for passing data, so there is no reason why you could not also use a queue solution such as RabbitMQ or a Queue table in your database provider to accomplish the same task. Just be sure to not use a queuing solution running on the same infrastructure as your application under test

Keeping my database and file system in sync

I'm working on a piece of software that stores files in a file system, as well as references to those files in a database. Querying the uploaded files can thus be done in the database without having to access the file system. From what I've read in other posts, most people say it's better to use a file system for file storage rather then storing binary data directly in a database as BLOB.
So now I'm trying to understand the best way to set this up so that both the database a file system stay in sync and I don't end up with references to files that don't exist, or files taking up space in the file system that aren't referenced. Here are a couple options that I'm considering.
Option 1: Add File Reference First
//Adds a reference to a file in the database
database.AddFileRef("newfile.txt");
//Stores the file in the file system
fileStorage.SaveFile("newfile.txt",dataStream);
This option would be problematic because the reference to the file is added before the actual file, so another user may end up trying to download a file before it is actually stored in the system. Although, since the reference to the the file is created before hand the primary key value could be used when storing the file.
Option 2: Store File First
//Stores the file
fileStorage.SaveFile("newfile.txt",dataStream);
//Adds a reference to the file in the database
//fails if reference file does not existing in file system
database.AddFileRef("newfile.txt");
This option is better, but would make it possible for someone to upload a file to the system that is never referenced. Although this could be remedied with a "Purge" or "CleanUpFileSystem" function that deletes any unreferenced files. This option also wouldn't allow the file to be stored using the primary key value from the database.
Option 3: Pending Status
//Adds a pending file reference to database
//pending files would be ignored by others
database.AddFileRef("newfile.txt");
//Stores the file, fails if there is no
//matching pending file reference in the database
fileStorage.SaveFile("newfile.txt",dataStream); database
//marks the file reference as committed after file is uploaded
database.CommitFileRef("newfile.txt");
This option allows the primary key to be created before the file is uploaded, but also prevents other users from obtaining a reference to a file before it is uploaded. Although, it would be possible for a file to never be uploaded, and a file reference to be stuck pending. Yet, it would also be fairly trivial to purge pending references from the database.
I'm leaning toward option 2, because it's simple, and I don't have to worry about users trying to request files before they are uploaded. Storage is cheap, so it's not the end of the world if I end up with some unreferenced files taking up space. But this also seems like a common problem, and I'd like to hear how others have solved it or other considerations I should be making.
I want to propose another option. Make the filename always equal to the hash of its contents. Then you can safely write any content at all times provided that you do it before you add a reference to it elsewhere.
As contents never change there is never a synchronization problem.
This gives you deduplication for free. Deletes become harder though. I recommend a nightly garbage collection process.
What is the real use of the database? If it's just a list of files, I don't think you need it at all, and not having it saves you the hassle of synchronising.
If you are convinced you need it, then options 1 and 2 are completely identical from a technical point of view - the 2 resources can be out of sync and you need a regular process to consolidate them again. So here you should choose the options that suits the application best.
Option 3 has no advantage whatsoever, but uses more resources.
Note that using hashes, as suggested by usr, bears a theoretical risk of collision. And you'd also need a periodical consolidation process, as for options 1 and 2.
Another questions is how you deal with partial uploads and uploads in progress. Here option 2 could be of use, but you could also use a second "flag" file that is created before the upload starts, and deleted when the upload is done. This would help you determine which uploads have been aborted.
To remedy the drawback you mentioned of option 1 I use something like fileStorage.FileExists("newfile.txt"); and filter out the result for which it returns a negative.
In Python lingo:
import os
op = os.path
filter(lambda ref: op.exists(ref.path()), database.AllRefs())

Is this a functional syncing algorithm?

I'm working on a basic syncing algorithm for a user's notes. I've got most of it figured out, but before I start programming it, I want to run it by here to see if it makes sense. Usually I end up not realizing one huge important thing that someone else easily saw that I couldn't. Here's how it works:
I have a table in my database where I insert objects called SyncOperation. A SyncOperation is a sort of metadata on the nature of what every device needs to perform to be up to date. Say a user has 2 registered devices, firstDevice and secondDevice. firstDevice creates a new note and pushes it to the server. Now, a SyncOperation is created with the note's Id, operation type, and processedDeviceList. I create a SyncOperation with type "NewNote", and I add the originating device ID to that SyncOperation's processedDeviceList. So now secondDevice checks in to the server to see if it needs to make any updates. It makes a query to get all SyncOperations where secondDeviceId is not in the processedDeviceList. It finds out its type is NewNote, so it gets the new note and adds itself to the processedDeviceList. Now this device is in sync.
When I delete a note, I find the already created SyncOperation in the table with type "NewNote". I change the type to Delete, remove all devices from processedDevicesList except for the device that deleted the note. So now when new devices call in to see what they need to update, since their deviceId is not in the processedList, they'll have to process that SyncOperation, which tells their device to delete that respective note.
And that's generally how it'd work. Is my solution too complicated? Can it be simplified? Can anyone think of a situation where this wouldn't work? Will this be inefficient on a large scale?
Sounds very complicated - the central database shouldn't be responsible for determining which devices have recieved which updates. Here's how I'd do it:
The database keeps a table of SyncOperations for each change. Each SyncOperation is has a change_id numbered in ascending order (that is, change_id INTEGER PRIMARY KEY AUTOINCREMENT.)
Each device keeps a current_change_id number representing what change it last saw.
When a device wants to update, it does SELECT * FROM SyncOperations WHERE change_id > current_change_id. This gets it the list of all changes it needs to be up-to-date. Apply each of them in chronological order.
This has the charming feature that, if you wanted to, you could initialise a new device simply by creating a new client with current_change_id = 0. Then it would pull in all updates.
Note that this won't really work if two users can be doing concurrent edits (which edit "wins"?). You can try and merge edits automatically, or you can raise a notification to the user. If you want some inspiration, look at the operation of the git version control system (or Mercurial, or CVS...) for conflicting edits.
You may want to take a look at SyncML for ideas on how to handle sync operations (http://www.openmobilealliance.org/tech/affiliates/syncml/syncml_sync_protocol_v11_20020215.pdf). SyncML has been around for a while, and as a public standard, has had a fair amount of scrutiny and review. There are also open source implementations (Funambol comes to mind) that can also provide some coding clues. You don't have to use the whole spec, but reading it may give you a few "ahah" moments about syncing data - I know it helped to think through what needs to be done.
Mark
P.S. A later version of the protocol - http://www.openmobilealliance.org/technical/release_program/docs/DS/V1_2_1-20070810-A/OMA-TS-DS_Protocol-V1_2_1-20070810-A.pdf
I have seen the basic idea of keeping track of operations in a database elsewhere, so I dare say it can be made to work. You may wish to think about what should happen if different devices are in use at much the same time, and end up submitting conflicting changes - e.g. two different attempts to edit the same note. This may surface as a change to the user interface, to allow them to intervene to resolve such conflicts manually.

Resources