I am trying to connect data in two databases, both created automatically by a different UI application. In one, all the keys are in this format "D8FC23D7-97D6-42F5-A52F-1CE93087B3A4".
Is there any reason this would be done? I also saw keys that look similar in a GIS database. I can't tell if these are supposed to be some computed key, maybe to detect what I am trying to do, or just random with some other intent.
PS I am using SQL Server. From what I can gather, this is not something that would be auto generated by SQL Server.
This is a GUID, also called UUID, a universally unique identifier (confer, for example, wikipedia or rfc4122). The idea behind a guid is that applications can generate identifiers, that are globally unique, without the need of a central unit doing any choreography (see motivation from rfc4122 below).
Various systems, databases, and programming languages offer functionality for generating UUIDs (e.g. SELECT NEWID() in sql server); the benefit is that with UUID generators, application can generate globally identified units in autarkical manner.
UUIDs can serve as database keys, though in most cases you will find much more lightweight and much more proper keys.
One of the main reasons for using UUIDs is that no centralized
authority is required to administer them (although one format uses
IEEE 802 node identifiers, others do not). As a result, generation
on demand can be completely automated, and used for a variety of
purposes. The UUID generation algorithm described here supports very
high allocation rates of up to 10 million per second per machine if
necessary, so that they could even be used as transaction IDs.
UUIDs are of a fixed size (128 bits) which is reasonably small
compared to other alternatives. This lends itself well to sorting,
ordering, and hashing of all sorts, storing in databases, simple
allocation, and ease of programming in general.
Since UUIDs are unique and persistent, they make excellent Uniform
Resource Names. The unique ability to generate a new UUID without a
registration process allows for UUIDs to be one of the URNs with the
lowest minting cost.
Related
I'm looking to store around 50-100 million documents in a database and be able to do queries at a very fast speed. A document would look something like this:
{
name: 'example',
value: '300,201,512'
}
The value column is always unique, name is not.
Now I want to be able to only check if there exists a document with a specific value using a query. What database would be the best choice and which design would be best to approach the fastest speed for a query like that?
NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help.
In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable.
Note: When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information).
These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people).
These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems.
Fasted document based db
1) MongoDB
2) DynamoDB
Here is difference for your reference
I will give preference to DynamoDB
Currently, we are working on aws datalake, really fast performance
store data in s3 and get back via athena.
If you want to import data on to some database then try using MS SQL Server 2008 R2, because it is very user friendly and allow you to do your work more accurately and precisely. If you want to do that without any cost, then MySQL will be a better option to do so(better MySQL editor is SQLYog). I hope it would be beneficial for you.
Short Answer:
I think, 100 million documents in your mentioned structure and conditions is not BIG ENOUGH to use NoSQL. You can handle them with PostgreSQL and MySQL and etc.
Note that: for a long time Wikipedia used MySQL (not now). see Reference
Consider a toy forum database with tables like User, Post, Votes, etc.
Assume that we have a machine-learning system for classifying users based on some features. The features were engineered by a team of modelers and may include values directly from the User column (say, user_age), as well as more complicated queries, which may require aggregations over linked tables (say, vote_count_for_latest_post or number_of_posts_with_positive_votes, etc).
Assume there may be hundreds of such features, which may used in different models for various purposes. All the features are recomputed in a single offline huge table daily for the modelers. The modelers may then select, say, 10-20 features to be used in a production model. The features for the production model would be computed on-line at the time of the query.
What would be a reasonably convenient and efficient way to organize and maintain such features.
To explain what I mean, consider three options:
Maintain the large "offline table" generation script, which contains all the logic for feature computation. Cherry-pick the necessary pieces of this script into a stored procedure which is used in production.
This is good in that both offline table generation script and the storedproc may be made rather efficient - we can only join the necessary secondary tables once and have all the aggregations done on them in one place.
The bad part is this approach is very hard to maintain because there is no modularity in how and where the features are defined. You cannot often simply copy-paste a part of the "offline" script into an "on-line" proc, because the selects and joins may be structured differently in the two.
Maintain each feature as a scalar user-defined function of user_id. This provides maximum modularity and makes both the on-line and off-line scripts very straightforward, but it would probably make the offline table computation rather slow - if 100 features require aggregation over the same table, chances are the server might not be smart enough to avoid 100 extra traversals (did not test myself, though)
Maintain groups of related features as table-valued user-defined functions or views. This might provide a good compromise between efficiency and maintainability.
Although my personal preference seems to be the latter option, I wonder whether I am inventing the wheel here. Perhaps there are established solutions for this problem out there already?
I'm working on a project aimed to analyze biometric data collected from various terminals. The process is not very performance critical. Rather it's I/O bounded. Amount of data is very huge. (hundreds of millions records per table). Unfortunately database is relational. And there are 20 foreign keys. Changing values of referenced keys is very common during completion of job. So there will be lots of UPDATE and SET NULL s during collecting data.
Currently, semantics of database is designed. All programs are almost completed, and also a MySQL prototype for database is created. It works fine with sample (small-scale) data.
I do a search to find a suitable DBMS for the project. Googling around "DBMS comparisons" ,... didn't help. People say antithesis things. Some say MySQL will perform faster inserts and updates, some say Oracle9 is better...
I can't find any reliable, benchmark-based comparison between DBMS. I use MySQL in everyday projects, but this one looks more critical.
What we need:
License and cost of DBMS is not important, but of course an open source (GPL or LGPL) is preferred (since whole project is will be published under LGPL).
Very fast inserts, very fast updates, a lot of foreign keys is needed.
DBMS should response to 0 - 100 connections at a time.
Terminals are connected to server by a local network (LAN).
What I'm actually looking for, is a benchmark of various DBMS's. It may contain charts, separated comparisons of different operations (insert, update, delete) in various situations (on a relation with referenced fields, or normal table)...
For this sort of answer, I would recommend PostgreSQL, Informix, or Oracle. PostgreSQL is open source (BSDL, GPL compatible, as everyone agrees). The reasons have to do with some aspects of data modelling that may be extremely helpful in your case. In general you have two important questions:
1) How far can I tune my db for what I am doing? How far can I scale it?
and
2) How can I model my data?
On the first, Oracle and PostgreSQL are more complex but more flexible. That flexibility may come in handy. On the second, the flexibility may save you a lot of effort later. Moreover it opens up new doors regarding optimization which are not possible in a straight relational model. First I would recommend looking at this: http://db.cs.berkeley.edu/papers/Informix/www.informix.com/informix/corpinfo/zines/whitpprs/illuswp/wave.htm as it will give you some background as to what I am thinking. Additionally, if you look at what Stonebraker is talking about you will see that straight benchmarks are really an apples to oranges comparison here.
The idea of going with an ORDBMS means a few important things:
You can model data functionally dependent on your data. For example you can have a function in Java or Python which manipulates your data and returns a result. You can index the output of those functions, trading insert for select performance if you need to, or not, trading between insert and select performance.
Less data being stored means faster inserts.
An ability to extend your data with custom types and functions, providing higher performance access to your data.
PostgreSQL 9.2 will support up to approx 14000 writes per sec on sufficient hardware, which is nothing to sneeze at. Of course this depends on the width of the write, hardware performance on the server, etc. PostgreSQL is used by Affilias to manage the .org and .info top-level domains (web-scale!) and also by Skype's infrastructure (still, even after Microsoft bought them).
Finally as a part of your information pipeline, if you are processing huge amounts of data and need to do some preprocessing before sending to PostgreSQL, you might look at array-native db's (for a NoSQL approach common in scientific work) or VoltDB (for an in-memory store for high-throughput processing). Despite the fact that they are extremely different systems, VoltDB and Postgres were actually started by the same individual.
Finally regarding benchmark charts, the major db vendors more or less ban publication of such in their license agreements so you won't find them.
it's not a realy problem, but it surprises me:
when I use Grails with diffrent DBs, I get different counter increments...:
with the ootb hsqldb, every table gets its own counter which is always increased by 1
with an oracle db, it seems that all tables use the same global counter
now I am using javadb/derby and the generated id are huge!
where can I find some more information about this behaviour and which one is the best?
hsql seems to keep the counters small
with oracle, I get a global unique id - also a nice feature
but what about the derby behaviour?
It really depends on the default id generation strategy in the specific dialect. Grails allows you to customize the generation strategy with mapping closure.
The most 'safe' (i.e. being supported by every RDBMS) generation strategy is TABLE, and this is preferred choice of many JPA implementations. This is probably what you get in HSQLDB. However, Oracle support sequences and these objects are generally better optimized for handling key generation -- hence, the dialect for Oracle seems to use one global sequence. I'm not familiar with Derby, but probably there is identity column support there and what you get is some sort of UUID.
We are working on designing an application that is typically OLTP (think: purchasing system). However, this one in particular has the need that some users will be offline, so they need to be able to download the DB to their machine, work on it, and then sync back once they're on the LAN.
I would like to note that I know this has been done before, I just don't have experience with this particular model.
One idea I thought about was using GUIDs as table keys. So for example, a Purchase Order would not have a number (auto-numeric) but a GUID instead, so that every offline client can generate those, and I don't have clashes when I connect back to the DB.
Is this a bad idea for some reason?
Will access to these tables through the GUID key be slow?
Have you had experience with these type of systems? How have you solved this problem?
Thanks!
Daniel
Using Guids as primary keys is acceptable and is considered a fairly standard practice for the same reasons that you are considering them. They can be overused which can make things a bit tedious to debug and manage, so try to keep them out of code tables and other reference data if at all possible.
The thing that you have to concern yourself with is the human readable identifier. Guids cannot be exchanged by people - can you imagine trying to confirm your order number over the phone if it is a guid? So in an offline scenario you may still have to generate something - like a publisher (workstation/user) id and some sequence number, so the order number may be 123-5678 -.
However this may not satisfy business requirements of having a sequential number. In fact regulatory requirements can be and influence - some regulations (SOX maybe) require that invoice numbers are sequential. In such cases it may be neccessary to generate a sort of proforma number which is fixed up later when the systems synchronise. You may land up with tables having OrderId (Guid), OrderNo (int), ProformaOrderNo (varchar) - some complexity may creep in.
At least having guids as primary keys means that you don't have to do a whole lot of cascading updates when the sync does eventually happen - you simply update the human readable number.
#SqlMenace
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
Not true. Primary key != clustered index.
If the clustered index is another column ("inserted_on" springs to mind) then the inserts will be sequential and no page splits or excessive fragmentation will occur.
This is a perfectly good use of GUIDs. The only draw backs would be a slight complexity in working with GUIDs over INTs and the slight size difference (16 bytes vs 4 bytes).
I don't think either of those are a big deal.
Will access to these tables through
the GUID key be slow?
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
In SQL Server 2005 MS introduced NEWSEQUENTIALID() to fix this, the only problem for you might be that you can only use NEWSEQUENTIALID as a default value in a table
You're correct that this is an old problem, and it has two canonical solutions:
Use unique identifiers as the primary key. Note that if you're concerned about readability you can roll your own unique identifier instead of using a GUID. A unique identifier will use information about the date and the machine to generate a unique value.
Use a composite key of 'Actor' + identifier. Every user gets a numeric actor ID, and the keys of newly inserted rows use the actor ID as well as the next available identifier. So if two actors both insert a new row with ID "100", the primary key constraint will not be violated.
Personally, I prefer the first approach, as I think composite keys are really tedious as foreign keys. I think the human readability complaint is overstated -- end-users shouldn't have to know anything about your keys, anyways!
Make sure to utilize guid.comb - takes care of the indexing stuff. If you are dealing with performance issues after that then you will be, in short order, an expert on scaling.
Another reason to use GUIDs is to enable database refactoring. Say you decide to apply polymorphism or inheritance or whatever to your Customers entity. You now want Customers and Employees to derive from Person and have them share a table. Having really unique identifiers makes data migration simple. There are no sequences or integer identity fields to fight with.
I'm just going to point you to What are the performance improvement of Sequential Guid over standard Guid?, which covers the GUID talk.
For human readability, consider assigning machine IDs and then using sequential numbers from those machines as a possibility. This will require managing the assignment of machine IDs, though. Could be done in one or two columns.
I'm personally fond of the SGUID answer, though.
Guids will certainly be slower (and use more memory) than standard integer keys, but whether or not that is an issue will depend on the type of load your system will see. Depending on your backend DB there may be issues with indexing guid fields.
Using guids simplifies a whole class of problems, but you pay for it part will performance and also debuggability - typing guids into those test queries will get old real fast!
The backend will be SQL Server 2005
Frontend / Application Logic will be .Net
Besides GUIDs, can you think of other ways to resolve the "merge" that happens when the offline computer syncs the new data back into the central database?
I mean, if the keys are INTs, i'll have to renumber everything when importing basically. GUIDs will spare me of that.
Using GUIDs saved us a lot of work when we had to merge two databases into one.
If your database is small enough to download to a laptop and work with it offline, you probably don't need to worry too much about the performance differences between ints and Guids. But do not underestimate how useful ints are when developing and troubleshooting a system! You will probably need to come up with some fairly complex import/synch logic regardless of whether or not you are using Guids, so they might not help as much as you think.
#Simon,
You raise very good points. I was already thinking about the "temporary" "human-readable" numbers i'd generate while offline, that i'd recreate on sync. But i wanted to avoid doing with with foreign keys, etc.
i would start to look at SQL Server Compact Edition for this! It helps with all of your issues.
Data Storage Architecture with SQL Server 2005 Compact Edition
It specifically designed for
Field force applications (FFAs). FFAs
usually share one or more of the
following attributes
They allow the user to perform their
job functions while disconnected from
the back-end network—on-site at a
client location, on the road, in an
airport, or from home.
FFAs are usually designed for
occasional connectivity, meaning that
when users are running the client
application, they do not need to have
a network connection of any kind. FFAs
often involve multiple clients that
can concurrently access and use data
from the back-end database, both in a
connected and disconnected mode.
FFAs must be able to replicate data
from the back-end database to the
client databases for offline support.
They also need to be able to replicate
modified, added, or deleted data
records from the client to the server
when the application is able to
connect to the network
First thought that comes to mind: Hasn't MS designed the DataSet and DataAdapter model to support scenarios like this?
I believe I read that MS changed their ADO recordset model to the current DataSet model so it works great offline too. And there's also this Sync Services for ADO.NET
I believe I have seen code that utilizes the DataSet model which also uses foreign keys and they still sync perfectly when using the DataAdapter. Havn't try out the Sync Services though but I think you might be able to benefit from that too.
Hope this helps.
#Portman By default PK == Clustered Index, creating a primary key constraint will automatically create a clustered index, you need to specify non clustered if you don't want it clustered.