Related
So I'm thinking to write some small piece of software, which to run/execute ML experiments on a cluster or arbitrary abstracted executor and then save them such that I can view them in real time efficiently. The executor software will have access for writing to the database and will push metrics live. Now, I have not worked too much with databases, thus I'm not sure what is the correct approach for this. Here is a description of what the system should store:
Each experiment will consist of a single piece of code/archive of code such that it can be executed on the remote machine. For now we will assume allow dependencies and etc are installed there. The code will accept command line arguments. The experiment also will consists of a YAML scheme defining the command line arguments. In the code byitself will specify what will be logged in (e.g. I will provide a library in the language for registering channels). Now in terms of logging, you can log numerical values, arrays, text, etc so quite a few types. Each channel will be allowed a single specification (e.g. 2 columns, first int iteration, second float error). The code will also provide special copy of parameters at the end of the experiments.
When one submit an experiments, it will need to provide its unique group name + parameters for execution. This will launch the experiment and log everything.
Implementing this for me is easiest to do with a flat file system. Each project will have a unique name. Each new experiment gets a unique id and folder inside the project. I can store the code there. Each channel gets a file, which for simplicity can be an csv delimeter, with a special schema file describing what type of values are stored there so I can load them there. The final parameters can also be copied in the folder.
However, because of the variety of ways I can do this, and the fact that this might require a separate "table" for each experiment, I have no idea if this is possible in any database systems? Additionally, maybe I'm overseeing something very obvious or maybe not, if you had any experience with this any suggestions/advices are most welcome. The main goal is at the end to be able to serve this to a web interface. Maybe noSQL could accommodate this maybe not (I don't know exactly how those work)?
The data for ML primarily would be unstructured data. That kind of data will not naturally fit into a RDBMS. Essentially a document database like mongodb is far better suited....for such cases.
What is the best one way permutation function I could use to digest an e-mail so I can use it as a primary key without storing personal data?
I'm getting my first F2P game ready: a simple yet (hopefully) addictive 2D casual puzzler based on aiming mechanics. It's made with Unity and will be released on Android very soon.
In order for the player to keep the same data across different devices, I have an SQL table with the device e-mail as the primary key, then another string as the savegame data.
But I don't want to store the user e-mail for privacy reasons.
So I thought of digesting it with some function that would use the original e-mail to generate a new string that:
is unique (will never collide with another string generated from a different e-mail address)
is not decypherable (there should be no way to obtain the original e-mail from the digested string - or at least it should be hard enough)
This way I could still use the Android device e-mail to retrieve the savegame data, without storing personal data from the player.
As far as I've researched, the solution seems to be called a one way permutation function. The problem is that I can't seem to find an appropriate function on the internet; instead, all answers seem to be plagued with solutions for password hashing, which is very interesting (salting, MD5, SHAXXX...) but don't meet my first requirement of no collision.
Thank you in advance for any answer on this topic.
What you need is a cryptographic hash function such as SHA-256. Such functions are designed to be collision resistant, Git uses an older version SHA-1. Most languages/systems have support of this, just Google "Android SHA-256" along with your language of choice.
One option is to append a creation timestamp.
Update: Since SHA-256 does not provide sufficient collision resistance consider s GUID, from RFC 4122: "A UUID is 128 bits long, and can guarantee uniqueness across space and time.". Of course you need to find a good implementation.
Which version of the UUID should you use? I saw a lot of threads explaining what each version entails, but I am having trouble figuring out what's best for what applications.
There are two different ways of generating a UUID.
If you just need a unique ID, you want a version 1 or version 4.
Version 1: This generates a unique ID based on a network card MAC address and current time. If any of these things is sensitive in any way, don't use this. The advantage of this version is that, while looking at a list of UUIDs generated by machines you trust, you can easily know whether many UUIDs got generated by the same machine, or infer some time relationship between them.
Version 4: These are generated from random (or pseudo-random) numbers. If you just need to generate a UUID, this is probably what you want. The advantage of this version is that when you're debugging and looking at a long list of information matched with UUIDs, it's quicker to spot matches.
If you need to generate reproducible UUIDs from given names, you want a version 3 or version 5. If you are interacting with other systems, this choice was already made and you should check with version and namespaces they use.
Version 3: This generates a unique ID from an MD5 hash of a namespace and name. If are dealing with very strict resource requirements (e.g. a very busy Arduino board), use this.
Version 5: This generates a unique ID from an SHA-1 hash of a namespace and name. This is the more secure and generally recommended version.
If you want a random number, use a random number library. If you want a unique identifier with effectively 0.00...many more 0s here...001% chance of collision, you should use UUIDv1. See Nick's post for UUIDv3 and v5.
UUIDv1 is NOT secure. It isn't meant to be. It is meant to be UNIQUE, not un-guessable. UUIDv1 uses the current timestamp, plus a machine identifier, plus some random-ish stuff to make a number that will never be generated by that algorithm again. This is appropriate for a transaction ID (even if everyone is doing millions of transactions/s).
To be honest, I don't understand why UUIDv4 exists... from reading RFC4122, it looks like that version does NOT eliminate possibility of collisions. It is just a random number generator. If that is true, than you have a very GOOD chance of two machines in the world eventually creating the same "UUID"v4 (quotes because there isn't a mechanism for guaranteeing U.niversal U.niqueness). In that situation, I don't think that algorithm belongs in a RFC describing methods for generating unique values. It would belong in a RFC about generating randomness. For a set of random numbers:
chance_of_collision = 1 - (set_size! / (set_size - tries)!) / (set_size ^ tries)
That's a very general question. One answer is: "it depends what kind of UUID you wish to generate". But a better one is this: "Well, before I answer, can you tell us why you need to code up your own UUID generation algorithm instead of calling the UUID generation functionality that most modern operating systems provide?"
Doing that is easier and safer, and since you probably don't need to generate your own, why bother coding up an implementation? In that case, the answer becomes use whatever your O/S, programming language or framework provides. For example, in Windows, there is CoCreateGuid or UuidCreate or one of the various wrappers available from the numerous frameworks in use. In Linux there is uuid_generate.
If you, for some reason, absolutely need to generate your own, then at least have the good sense to stay away from generating v1 and v2 UUIDs. It's tricky to get those right. Stick, instead, to v3, v4 or v5 UUIDs.
Update:
In a comment, you mention that you are using Python and link to this. Looking through the interface provided, the easiest option for you would be to generate a v4 UUID (that is, one created from random data) by calling uuid.uuid4().
If you have some data that you need to (or can) hash to generate a UUID from, then you can use either v3 (which relies on MD5) or v5 (which relies on SHA1). Generating a v3 or v5 UUID is simple: first pick the UUID type you want to generate (you should probably choose v5) and then pick the appropriate namespace and call the function with the data you want to use to generate the UUID from. For example, if you are hashing a URL you would use NAMESPACE_URL:
uuid.uuid3(uuid.NAMESPACE_URL, 'https://ripple.com')
Please note that this UUID will be different than the v5 UUID for the same URL, which is generated like this:
uuid.uuid5(uuid.NAMESPACE_URL, 'https://ripple.com')
A nice property of v3 and v5 URLs is that they should be interoperable between implementations. In other words, if two different systems are using an implementation that complies with RFC4122, they will (or at least should) both generate the same UUID if all other things are equal (i.e. generating the same version UUID, with the same namespace and the same data). This property can be very helpful in some situations (especially in content-addressible storage scenarios), but perhaps not in your particular case.
Postgres documentation describes the differences between UUIDs. A couple of them:
V3:
uuid_generate_v3(namespace uuid, name text) - This function generates a version 3 UUID in the given namespace using the specified input name.
V4:
uuid_generate_v4 - This function generates a version 4 UUID, which is derived entirely from random numbers.
Since it's not mentioned yet: you can use uuidv1 if you want to be able to sort your entities by creation time without a separate, explicit timestamp. While that's not 100 % precise and in many cases not the best way to go (due to the lack of explicity), it comes handy in some scenarios, e.g. when you're working with a Cassanda database.
Version 1: UUIDs using a timestamp and monotonic counter.
Version 3: UUIDs based on the MD5 hash of some data.
Version 4: UUIDs with random data.
Version 5: UUIDs based on the SHA1 hash of some data.
Version 6: UUIDs using a timestamp and monotonic counter.
Version 7: UUIDs using a Unix timestamp.
Version 8: UUIDs using user-defined data.
Read more at Rust documentation.
We are in evaluating technologies that we'll use to store data that we gather during the analysis of C/C++ code. In the case of C++, the amount of data can be relatively large, ~20Mb per TU.
After reading the following SO answer it made me consider that HDF5 might be a suitable technology for us to use. I was wondering if people here could help me answer a few initial questions that I have:
Performance. The general usage for the data will be write once and read "several" times, similar to the lifetime of a '.o' file generated by a compiler. How does HDF5 compare against using something like an SQLite DB? Is that even a reasonable comparison to make?
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format. After reading the user guide I understand that HDF5 is similar to XML or a DB, in that information is associated with a tag/column and so a tool built to read an older structure will just ignore the fields that it is not concerned with? Is my understanding on this correct?
A significant chunk of the information that we wish to write out will be a tree type of structure: scope hierarchy, type hierarchy etc. Ideally we would model scopes as having parents, children etc. Is it possible to have one HDF5 object "point" to another? If not, is there a standard technique to solve this problem using HDF5? Or, as is required in a DB, do we need a unique key that would "link" one object to another with appropriate lookups when searching for the data?
Many thanks!
How does HDF5 compare against using something like an SQLite DB?
Is that even a reasonable comparison to make?
Sort of similar but not really. They're both structured files. SQLite has features to support database queries using SQL. HDF5 has features to support large scientific datasets.
They're both meant to be high performance.
Over time we will add to the information that we are storing, but will not necessarily want to re-distribute a completely new set of "readers" to support a new format.
If you store data in structured form, the data types of those structures are also stored in the HDF5 file. I'm a bit rusty as to how this works (e.g. if it includes innate backwards compatibility), but I do know that if you design your "reader" correctly it should be able to handle types that are changed in the future.
Is it possible to have one HDF5 object "point" to another?
Absolutely! You'll want to use attributes. Each object has one or more strings describing the path to reach that object. HDF5 groups are analogous to folders/directories, except that folders/directories are hierarchical = a unique path describes each one's location (in filesystems w/o hard links at least), whereas groups form a directed graph which can include cycles. I'm not sure whether you can store a "pointer" to an object directly as an attribute, but you can always store an absolute/relative path as a string attribute. (or anywhere else as a string; you could have lookup tables galore if you wanted.)
We produce HDF5 data on my project, but I don't directly deal with it usually. I can take a stab at the first two questions:
We use a write once, read many times model and the format seems to handle this well. I know a project that used to write both to an Oracle database and HDF5. Eventually they removed the Oracle output since performance suffered and no one was using it. Obviously, SQLite is not Oracle, but the HDF5 format was better suited for the task. Based on that one data point, a RDBMS may be better tuned for multiple inserts and updates.
The readers our customers use are robust when we add new data types. Some of the changes are anticipated, but we don't have to worry about breaking thing when adding more data fields. Our DBA recently wrote a Python program to read HDF5 data and populate KMZ files for visualization in Google Earth. Since it was a project he used to learn Python, I'd say it's not hard to build readers.
On the third question, I'll bow to Jason S's superior knowledge.
I'd say HDF5 is a completely reasonable choice, especially if you are already interested in it or plan to produce something for the scientific community.
I've always preferred to use long integers as primary keys in databases, for simplicity and (assumed) speed. But when using a REST or Rails-like URL scheme for object instances, I'd then end up with URLs like this:
http://example.com/user/783
And then the assumption is that there are also users with IDs of 782, 781, ..., 2, and 1. Assuming that the web app in question is secure enough to prevent people entering other numbers to view other users without authorization, a simple sequentially-assigned surrogate key also "leaks" the total number of instances (older than this one), in this case users, which might be privileged information. (For instance, I am user #726 in stackoverflow.)
Would a UUID/GUID be a better solution? Then I could set up URLs like this:
http://example.com/user/035a46e0-6550-11dd-ad8b-0800200c9a66
Not exactly succinct, but there's less implied information about users on display. Sure, it smacks of "security through obscurity" which is no substitute for proper security, but it seems at least a little more secure.
Is that benefit worth the cost and complexity of implementing UUIDs for web-addressable object instances? I think that I'd still want to use integer columns as database PKs just to speed up joins.
There's also the question of in-database representation of UUIDs. I know MySQL stores them as 36-character strings. Postgres seems to have a more efficient internal representation (128 bits?) but I haven't tried it myself. Anyone have any experience with this?
Update: for those who asked about just using the user name in the URL (e.g., http://example.com/user/yukondude), that works fine for object instances with names that are unique, but what about the zillions of web app objects that can really only be identified by number? Orders, transactions, invoices, duplicate image names, stackoverflow questions, ...
I can't say about the web side of your question. But uuids are great for n-tier applications. PK generation can be decentralized: each client generates it's own pk without risk of collision.
And the speed difference is generally small.
Make sure your database supports an efficient storage datatype (16 bytes, 128 bits).
At the very least you can encode the uuid string in base64 and use char(22).
I've used them extensively with Firebird and do recommend.
For what it's worth, I've seen a long running stored procedure (9+ seconds) drop to just a few hundred milliseconds of run time simply by switching from GUID primary keys to integers. That's not to say displaying a GUID is a bad idea, but as others have pointed out, joining on them, and indexing them, by definition, is not going to be anywhere near as fast as with integers.
I can answer you that in SQL server if you use a uniqueidentifier (GUID) datatype and use the NEWID() function to create values you will get horrible fragmentation because of page splits. The reason is that when using NEWID() the value generated is not sequential. SQL 2005 added the NEWSEQUANTIAL() function to remedy that
One way to still use GUID and int is to have a guid and an int in a table so that the guid maps to the int. the guid is used externally but the int internally in the DB
for example
457180FB-C2EA-48DF-8BEF-458573DA1C10 1
9A70FF3C-B7DA-4593-93AE-4A8945943C8A 2
1 and 2 will be used in joins and the guids in the web app. This table will be pretty narrow and should be pretty fast to query
Why couple your primary key with your URI?
Why not have your URI key be human readable (or unguessable, depending on your needs), and your primary index integer based, that way you get the best of both worlds. A lot of blog software does that, where the exposed id of the entry is identified by a 'slug', and the numeric id is hidden away inside of the system.
The added benefit here is that you now have a really nice URL structure, which is good for SEO. Obviously for a transaction this is not a good thing, but for something like stackoverflow, it is important (see URL up top...). Getting uniqueness isn't that difficult. If you are really concerned, store a hash of the slug inside a table somewhere, and do a lookup before insertion.
edit: Stackoverflow doesn't quite use the system I describe, see Guy's comment below.
Rather than URLs like this:
http://example.com/user/783
Why not have:
http://example.com/user/yukondude
Which is friendlier to humans and doesn't leak that tiny bit of information?
You could use an integer which is related to the row number but is not sequential. For example, you could take the 32 bits of the sequential ID and rearrange them with a fixed scheme (for example, bit 1 becomes bit 6, bit 2 becomes bit 15, etc..).
This will be a bidirectional encryption, and you will be sure that two different IDs will always have different encryptions.
It would obviously be easy to decode, if one takes the time to generate enough IDs and get the schema, but, if I understand correctly your problem, you just want to not give away information too easily.
We use GUIDs as primary keys for all our tables as it doubles as the RowGUID for MS SQL Server Replication. Makes it very easy when the client suddenly opens an office in another part of the world...
I don't think a GUID gives you many benefits. Users hate long, incomprehensible URLs.
Create a shorter ID that you can map to the URL, or enforce a unique user name convention (http://example.com/user/brianly). The guys at 37Signals would probably mock you for worrying about something like this when it comes to a web app.
Incidentally you can force your database to start creating integer IDs from a base value.
It also depends on what you care about for your application. For n-tier apps GUIDs/UUIDs are simpler to implement and are easier to port between different databases. To produce Integer keys some database support a sequence object natively and some require custom construction of a sequence table.
Integer keys probably (I don't have numbers) provide an advantage for query and indexing performance as well as space usage. Direct DB querying is also much easier using numeric keys, less copy/paste as they are easier to remember.
I work with a student management system which uses UUID's in the form of an integer. They have a table which hold the next unique ID.
Although this is probably a good idea for an architectural point of view, it makes working with on a daily basis difficult. Sometimes there is a need to do bulk inserts and having a UUID makes this very difficult, usually requiring writing a cursor instead of a simple SELECT INTO statement.
I've tried both in real web apps.
My opinion is that it is preferable to use integers and have short, comprehensible URLs.
As a developer, it feels a little bit awful seeing sequential integers and knowing that some information about total record count is leaking out, but honestly - most people probably don't care, and that information has never really been critical to my businesses.
Having long ugly UUID URLs seems to me like much more of a turn off to normal users.
I think that this is one of these issues that cause quasi-religious debates, and its almost futile to talk about. I would just say use what you prefer. In 99% of systems it will no matter which type of key you use, so the benefits (stated in the other posts) of using one sort over the other will never be an issue.
I think using a GUID would be the better choice in your situation. It takes up more space but it's more secure.
YouTube uses 11 characters with base64 encoding which offers 11^64 possibilities, and they are usually pretty manageable to write. I wonder if that would offer better performance than a full on UUID. UUID converted to base 64 would be double the size I believe.
More information can be found here: https://www.youtube.com/watch?v=gocwRvLhDf8
Pros and Cons of UUID
Note: uuid_v7 is time based uuid instead of random. So you can
use it to order by creation date and solve some performance issues
with db inserts if you do really many of them.
Pros:
can be generated on api level (good for distributed systems)
hides count information about entity
doesn't have limit 2,147,483,647 as 32-bit int
removes layer of errors related to passing one entity id userId: 25 to get another bookId: 25 accidently
more friendly graphql usage as ID key
Cons:
128-bit instead 32-bit int (slightly bigger size in db and ~40% bigger index, around ~30MB for 1 million rows), should be a minor concern
can't be sorted by creation (can be solved with uuid_v7)
non-time-ordered UUID versions such as UUIDv4 have poor database index locality (can be solved with uuid_v7)
URL usage
Depending on app you may care or not care about url. If you don't care, just use uuid as is, it's fine.
If you care, then you will need to decide on url format.
Best case scenario is a use of unique slug if you ok with never changing it:
http://example.com/sale/super-duper-phone
If your url is generated from title and you want to change slug on title change there is a few options. Use it as is and query by uuid (slug is just decoration):
http://example.com/book/035a46e0-6550-11dd-ad8b-0800200c9a66/new-title
Convert it to base64url:
you can get uuid back from AYEWXcsicACGA6PT7v_h3A
AYEWXcsicACGA6PT7v_h3A - 22 characters
035a46e0-6550-11dd-ad8b-0800200c9a66 - 36 characters
http://example.com/book/AYEWXcsicACGA6PT7v_h3A/new-title
Generate a unique short 11 chars length string just for slug usage:
http://example.com/book/icACEWXcsAY-new-title
http://example.com/book/icACEWXcsAY/new-title
If you don't want uuid or short id in url and want only slug, but do care about seo and user bookmarks, you will need to redirect all request from
http://example.com/sale/phone-1-title
to
http://example.com/sale/phone-1-title-updated
this will add additional complexity of managing slug history, adding fallback to history for all queries where slug is used and redirects if slugs doesn't match
As long as you use a DB system with efficient storage, HDD is cheap these days anyway...
I know GUID's can be a b*tch to work with some times and come with some query overhead however from a security perspective they are a savior.
Thinking security by obscurity they fit well when forming obscure URI's and building normalised DB's with Table, Record and Column defined security you cant go wrong with GUID's, try doing that with integer based id's.