Creating a Non-Int and Non-Guid Unique Identifier - sql-server

I'm looking for a way SQL Server can generate a unique identifier that is not an increment Int or a GUID.
The Unique ID can be a combination of letters and numbers and has no other characters, and as previously mentioned Must be Unique.
ie AS93K239DFAK
And if possible must always start with AS or end with an K
It would be nice if this unique id can be generated automatically when there is an Insert like GUIDs and IsIdentity = Yes does. It can be a random number, it is not predetermined in the app.
Is doing something like this possible, or does it have to be generated application-side?

From comments, it sounds like you would be OK with using an IDENTITY field and padding it with 0s and adding a prefix/suffix. Something like this should work:
1 - Add an IDENTITY field which will be auto-incremented
2 - Add a calculated field in the table with the definition of:
[InvoiceNo] AS ('AS' + RIGHT(('000000000' + CAST(idfield AS varchar(9))), 9) + 'FAK')
This will give you invoiceno in the format of:
AS000000001FAK
AS000000002FAK
...
AS000995481FAK

I've never seen a randomly generated invoice number. Most of them are usually a combination of multiple identifying fields. For example, one segment might be the companyID, another might be the InvoieID, and a third might be a Date value
For example, AS-0001-00005-K or AS-001-00005-021712-K, which would stand for CompanyId 1, Invoice #5, generated on 2/17/12
You said in a comment that you don't want to let the company know a count of how many past invoices there are, and this way they won't know the count except for how many invoices they have received, which is a value they should know anyways.
If you're concerned about giving away how many companies there are, use an alpha company code instead, so your end result looks like AS-R07S-00005-K or ASR07S00005K

So you can do it this way, just don't expect it to perform well.
(1) populate a big massive table with some exhaustive set of invoice values - which should be at least double the number of invoices you think you'll ever need. Populate the data in random order in advance.
(2) create a stored procedure that pulls the next invoice off the pile, and then either deletes it or marks it as taken.
But, be sure that this solution makes sense for your business. In many countries it is actually law for invoice numbers to be sequential. I'm guessing we're not really talking about invoices, but wanted to make sure it's at least considered.

What is so confusing about the random part being unique? If you have a two digit invoice number there can only be 100 unique values (00 - 99). A GUID has 2 to the power 128 values and is statistically unique. If you use a 8 characters of a GUID then with even 1 million invoices you have a betting chance of getting a collision. With 1 million invoices if you use 12 characters of GUID then you have very good chance of NOT getting a collision. If you use 16 characters of a GUID then you are pretty much statistically unique if you have less than 1 billion invoices. I would use 12 characters but check against actual values for uniqueness and you only have lottery chance of getting collision.

How are you inserting these new invoices to the table? A straight up batch insert or are you doing some business logic/integrity checks in a stored procedure first and 'creating' the invoices one by one?
In the second case, you could easily build a unique ID in the procedure. You could store a seed number in a table and take the number from there then cast it as a varchar and append the alphanumeric characters, you can then increment the seed. This also gives you the option of creating a gap between unique IDs if you needed to import some records into the gap at a later date.

Related

databases and frontend: load balancing for analyzing data

I have a scrapper which gets news-articles over the day by different sources.
I want to display data like 'most common words in the last 30 days (in source X)' on my page.
For now I have saved the articles to my database consisting of the timestamp the article was released and a string of the content.
With a few datasets this works fine, but I do no understand how to balance the load, that the front end has most flexibility but not too much data to count.
I thought you could run a script, which takes all the articles from one day and create a new tables containing each word with its count. I came across two points here:
1 - How do I create a table for this? Since every article has different length and different sets of words I would need a table with as many fields, as the number of words in the longest article. I could say I will only save the first 20, but I don't really like the idea.
2 - If the script takes all the articles from one day and calculates the word_counts, I have a minimum resolution of 1 day. So I won't be able to differentiate any further. I chose the script to run for each day to reduce the data that I will need to send to the front on demand.
Don't create a table with a separate column for each of the first 20 words. Please. I beg you. Just don't.
Two possible approaches.
Use a fulltext search feature in your DBMS. You didn't tell us which one you use, so it's hard to be more specific.
Preprocess: Create a table with columns article_id, word_number, and word. This table will have a large number of rows, one for each word in each article. But that's OK. SQL databases are made for handling vast tables of simple rows.
The unique key on the table contains two columns: article_id and word_number. A non-unique key for searching should contain word, article_id, word_number.
When you receive an incoming article, assign it an article_id number. Then break it up into words and insert each word into the table.
When you search for a word do SELECT article_id FROM words WHERE word=?. Fast. And you can use SQL set manipulation to do more complex searches.
When you remove an article from your archive, DELETE the rows with that article_id value.
To get frequencies do SELECT COUNT(*) frequency, word FROM words GROUP BY word ORDER BY 1 DESC LIMIT 50.

Am I Properly Normalizing this Data

I am completing normalization exercises from the web to test my abilities to normalize data. This particular problem was found at: https://cs.senecac.on.ca/~dbs201/pages/Normalization_Practice.htm (Exercise 1)
The table this problem is based of is as follows:
The unnormalized table that can be created from this table is:
To comply with First Normal form, I have to get rid of repeating fields in the table by moving visitdate, procedure_no, and procedure_name to their own respective tables:
This also complies with 2NF and 3NF which makes me question whether I have performed the process of normalization correctly. Please provide feedback if I did not properly move from UNF to 1NF.
In a first step you could create the following tables (assuming pet_id is unique in the table):
Pets: pet_id, pet_name, pet_type, pet_age, owner
Visits: pet_id, visit_date, procedure
Going further you could split procedure since the description is repeating:
Pets: pet_id, pet_name, pet_type, pet_age, owner
Visits: pet_id, visit_date, procedure_id
Procedures: procedure_id, description
Although there can be multiple procedures on the same visit_date for the same pet_id, I see no reason to split those further: a date could (in theory) be stored in 2 bytes, and splitting that data would create more overhead (plus an extra index).
You would also want to change pet_age to pet_birth_date since the age changes over time.
Since this is the first exercise in your list, the above will probably be more than enough.
Going even further:
An owner can have multiple pets, so another table could be created:
Pet_owners: owner_id, owner_name
and then only use owner_id in the Pets table. In a real system there would be customer_id, name, address, phone, email, etc. - so that should always be in a separate table.
You could even do the same for pet_type and store the id in 1 or 2 bytes, but it all depends on the type of queries you want to do later on the data.
The question is poorly presented. Look at the last two columns. The askers do not mean that each column's types are sets. They mean that pairs of values on the same line make an element of a set. They should have had one column whose values were triplets--date, number & name. That's what they did when they used just one column (the last one) for number & name. Notice that their solution in the pdf linked to by the page you link to has a table that has all three of date, number & name.
But how are you supposed to know that the values should be paired? After all if the date column gave the set of a pet's visit dates & the procedure column gave the set of procedure number & names a pet ever had then we wouldn't be supposed to take a pair of values on the same line as an element of a set. Unfortunately you are just supposed to magically guess correctly. (A hint is that the number of dates & number-name pairs for a pet are always the same.)
The above took the blank areas in the illustration to be there to make room for the vertical display of set-valued attributes; the portrayed table has 4 rows. But maybe they are there because you are supposed to get a relation from this illustration by interpreting a blank subrow as representing the most recent non-blank subrow. Then the table wouldn't have any set-valued columns; the portrayed table has 9 rows. It happens that this interpretation disagrees with the linked answer's UNF & 1NF sections.
If they weren't going to explain the table & were just relying on your guesses it would have been clearer if they put a visit's procedure date, number & name under one column--just as they put a procedure number & name in one column. But really, they should always tell you how to read the illustration. But really, you should always ask how read an illustration. If you have any interpretation conventions from a related course/textbook then you should have put it in your question for us to know.
Unfortunately "UNF" tables are almost always similarly poorly given without any description about how they are to be interpreted. Also "1NF" has no standard meaning & there is no standard notion of "normalizing to 1NF".

Need help in developing DB logic

This is a mini-project of mine - Airline reservation system - lets call this airline FlyMi : I have a database (Not decided which one, friend of mine wants to go with MongoDB). Anyhoo, this is my requirement :
I have a table which has details of the flight - Flight number, schedule etc. I'm going to use this table to perform various operations - booking , cancellation , modification
This is where I'm stuck : For the desktop app and the web application - I'm offering an option to select seats. This means I've got to keep track of which seats are booked , which ones are not. And assume I have an UI , which shows seats as Red - Booked Green - Not Booked.And all of this - for each and every flight. My question is : What do you think would be the most efficient way to track seat bookings , for each flight in that airline?
Current Idea : Keep a table named passenger - with all the details such as name , address etc. which keep track of all passengers, and maintain a passenger ID such that , first 4 characters are flight ID, Last 2 character are seat numbers they have chosen, with random number in-between ( I say random because I think it is immaterial here). So, for any flight , If I have to find out number of un-booked seats, I will have to scan through every passenger , who has booked, and who has booked in that flight. I think this is really in-efficient. Provide me with the most efficient logic to do this.
Don't use "smart keys".
This is a bad idea called "smart keys" or "encoding information in keys".
See this answer which contains this excerpt:
Despite it now being easy to implement a Smart Key, it is hard to recommend that you create one of your own that isn't a natural key, because they tend to eventually run into trouble, whatever their advantages, because it makes the databases harder to refactor, imposes an order which is difficult to change and may not be optimal for your queries, requires a string comparison if the Smart Key includes non-numeric characters, and is less effective than a composite key in helping range-based aggregations. It also violates the basic relational guideline that every column should store atomic values
Smart Keys also tend to outgrow their original coding constraints
(Notice that seat locations are typically identified by smart keys in that they are a row number and a count across a row. But they are also typically visibly physically permanently bolted into that formation. And imagine if they were labelled and rearranged.)
Educate yourself about database design.
Just describe your business in the most straightforward terms. That is how relational model databases & DBMSs work.
Find enough fill-in-the-[named-]blanks sentence templates to describe your business situations:
"customer [cid] has name [firstname] [lastname]
AND customer [cid] has a phone number [phonenumber] of type [type] ..."
"customer [cid] can use credit card #[card_no]"
"seat [seatid] is at row [row] and column [column]"
"seat [seatid] is booked"
"seat [seatid] is temporarily committed to an unfinished booking"
...
For each such parameterized sentence template (aka predicate) have a base table where the names of the blanks/parameters are column names. Each row in a table states the statement (proposition) got from filling in the blanks per its column values; each row not in a table states NOT the statement from filling in the blanks per its column values.
Then for each table find every functional dependency (FD) that holds. (When a predicate can be expressed in the form "... AND column = F(column1,...)" then we say that column set {column1,...} functionally determines column column and that FD set → column holds.) Then identify every candidate key (CK). (A superkey is a column set that functionally determines every column. Ie that is unique, ie where each subrow of values for those columns appears only in one row of a table. A CK is a superkey that doesn't contain a smaller superkey.) Then find every join dependency (JD). (Some predicates say "... AND ..." for some number of ANDs & "..."s. There is a JD when the table for each predicate "..." would look like what you get from taking only its columns from the original table.) Note that every FD comes with an associated (binary) JD.
Then normalize your tables to fifth normal form (5NF). This means decomposing (ie replacing a table in which JD "... AND ..." holds by tables whose predicates are the "..."s) until each JD that holds is implied by the CKs (ie must hold when the JDs from the FDs from the CKs hold.) (For performance reasons one can also then denormalize by combining to base tables that aren't in 5NF.)
See this answer and this one.
Then we query by describing the rows we want. We do this by connecting base table predicates with logical operators (ie AND, OR, NOT, FOR SOME, FOR ALL etc) and function calls to give the predicates for the tables we want and/or by connecting base table names by relation operators (ie JOIN, UNION, MINUS/EXCEPT, PROJECT/SELECT, RENAME/AS) to give the values of the tables we want and/or both (eg RESTRICT/WHERE).
The JOIN of two tables holds the rows that make a true statement from, ie has as predicate, the AND of their predicates; and the UNION the OR, the MINUS/EXCEPT the AND NOT; and that PROJECT/SELECT columns of a table puts FOR SOME all-other-columns before its predicate; and RESTRICT/WHERE puts AND condition after its predicate; and the RENAME/AS of column renames that parameter in its predicate. So a table expression corresponds to a predicate: A table (base table or query result) value contains the rows that make a true statement from its (base table's or query expression's) predicate.
See this answer.
The same goes for constraints, which are true statements that collectively describe the application situations and database states than can arise given the situations that can arise and the base table predicates.
See this answer.

Dynamic PIVOT with varchar columns

I'm trying to to pivot rows into columns. I basically have lots of lines where every N rows means a row on a table I'd like to list as a result set. I'll give a short example:
I have a table structure like this:
Keep it in mind that I removed lots of rows to simplify this example. Every 6 rows means 1 row in the result set, which I would like to be like this:
All columns are varchar types (that's why I couldn't get it done with pivot)
Number os columns are dynamic, so it's the number of rows in source table
Logically, Number of rows (table rows in result set) are equally dynamic
(Not really an answer, but it's what I've got.)
This is a name/value pair table, right? Your query will require something that identifies which "set" of rows is associated with one another. Without something like this, I don't see how the query can be written. The key factor is that you must never assume that data will be returned from SQL (Server, at least) in any particular order. How the data is stored internally generally, but not always, determines how it is returned when order is not specified.
Another consideration: what if (when?) a row is missing -- say, Product 4 has no Price B column? That would break a simple "every six rows" rule. "Start fresh with every new Code row" would it problems if a Code is missed or when (not if) data is not returned in the anticipated order.
If you have some means of grouping items, let us know in an updated question, but otherwise I don't think this one is particularly solvable.
I actually did it.
I wrote a SQL while...do based on the number of columnns registered for the resultset. This way I could write a dynamic SQL clause for N columns based on the values read. In the end I just inserted the resultset in a temp table, and voi lá.
Thanks anyways!

Looking for data in a table through colums with categories/state values

Say we have a fruits that that is having a high number of reads but also inserts though almost not update nor delete.
We have 2 columns that stores values that have a small number of options. Lets say categories[Banana, apple, orange or pear] and status[finished, ongoing, spoiled, destroyed or ok].
Finally, we have a column last name of owner.
Notes:
I am going to searchs sometimes by category and other by status.
In all cases, lastname will be used for the search.
I will always perform exact match on categories/status but start with in last name.
Ex of common queries:
SELECT * FROM fruit_table WHERE category='BANANA' and last_name LIKE 'Cool%'
SELECT * FROM fruit_table WHERE status='Spoiled' and last_name LIKE 'Co%'
SELECT * FROM fruit_table WHERE category='BANANA' and last_name LIKE 'smith%'
How can I prepare it so we have low response time? Will a index help(taking into account that the values in the column are not disperse at all)?Might bitmap index help here? What about partitioning?
Finally, Apologies about the title, I did not know how to formulate it properly.
Bitmap indexes help immensely with items that have a limited number
of available choices.
A standard b-tree index (or non-clustered in SQL Server) will work well
for the last_name column.
I would do those two first, as they are easy and then see how things work.
It is generally a bad practice to prematurely optimize. However, adding indices is quick way to increase speed without much effort. For more information on indices in Oracle, read this question.

Resources