I am looking for some assistance in using Soundex() and Difference() in SQL Server with the goal of comparing one observation in a column to another observation in that same column.
Here is some context of how the data looks. All of these street addresses are intended to represent the same place, but with human error changing a small portion so it isn't exactly the same:
Street Address
zip
state
1440 thisisastreetaddress ave
12345
OH
1440 thisismystreetaddress st
12345
OH
1440 thisisthestreetaddress rd
12345
OH
The goal would be to use Soundex() and Difference() to group addresses that are similar to each other. Problem is, the table has over 2 million customers so the solution cannot operate on massive amounts of CPU. The end product doesn't have to be pretty, I don't mind sorting manually (which I've realized is what may need to happen to an extent), but I would appreciate the script grouping them for me.
Let me know if its possible with soundex() and difference() and, if its not with those two, any other functions or ways you may go about solving this. Also may be important to include that it can be a multiple software solution as the team and myself know python, R, power BI, etc.
Thanks!
Related
I'm trying to query some data in Postgres and I'm wondering how I might use some sort of pattern matching not merely to select rows - e.g. SELECT * FROM schema.tablename WHERE varname ~ 'phrase' - but to select columns in the SELECT statement, specifically doing so based on the common names of those columns.
I have a table with a bunch of estimates of rates over the course of many years - say, of apples picked per year - along with upper and lower 95% confidence intervals for each year. (For reference, each estimate and 95% CI comes from a different source - a written report, to be precise; these sources are my rows, and the table describes various aspect of each source. Based on a critique from below, I think it's important that the reader know that the unit of analysis in this relational database is a written report with different estimates of things picked per year - apples in one Table, oranges in another, pears in a third.)
So in this table, each year has three columns / variables:
rate_1994
low_95_1994
high_95_1994
The thing is, the CIs are mostly null - they haven't been filled in. In my query, I'm really only trying to pull out the rates for each year: all the variables that begin with rate_. How can I phrase this in my SELECT statement?
I'm trying to employ regexp_matches to do this, but I keep getting back errors.
I've done some poking around StackOverflow on this, and I'm getting the sense that it may not even be possible, but I'm trying to make sure. If it isn't possible, it's easy to break up the table into two new ones: one with just the rates, and another with the CIs.
(For the record, I've looked at posts such as this one:
Selecting all columns that start with XXX using a wildcard? )
Thanks in advance!
If what you are basically asking is can columns be selected dynamically based on an execution-time condition,
No.
You could however use PL/SQL to build up a query as a string and then execute it using EXECUTE IMMEDIATE.
I'm creating a small game composed of weapons. Weapons have characteristics, like the accuracy. When a player crafts such a weapon, a value between min and max are generated for each characteristic. For example, the accuracy of a new gun is a number between 2 and 5.
My question is... should I store the minimum and maximum value in the database or should it be hard coded in the code ?
I understand that putting them in the database allows me to change these values easily, however these won't change very often and doing this mean having to make a database request when I need these values. Moreover, its means having way much more tables... however, is it a good practice to store this directly in the code ?
In conclusion, I really don't know what solution to chose as both have advantages and disadvantage.
If you have attributes of an entity, then you should store them in the database.
That is what databases are for, storing data. I can see no advantage to hardcoding such values. Worse, the values might be used in different places in your code. And, when you update them, you might end up with inconsistent values throughout the code.
EDIT:
If these are default values, then I can imagine storing them in the code along with all the other information about the weapon -- name of the weapon, category, and so on. Those values are the source information for the weapons.
I still think it would be better to have a Weapons table or WeaponDefaults table so these are in the database. Right now, you might think the defaults are only used in one place. You would be surprised how software can grow. Also, having them in the database makes the values more maintainable.
I would have to agree #Gordon_Linoff.
I Don't think you will end up with "way more tables", maybe one or two. If you had a table that had fields of ID, Weapon, Min, Max ...
Then you could do a recordset search when needed. As you said, these variables might never change but changing them in a single spot, seems much more Admin-Friendly then scouring code that you have let alone for a long time. My Two cents worth.
I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.
As this question is sure to reveal, I'm relatively new to database design. Please pardon my lack of understanding.
I want to store values like the following (example from Google calendar) in a database:
What's the best way to do this? Would this be one database field or several?
If the former, does this disobey normalization rules?
Thanks!
I suggest you to create a relation many-many, you can achive that separating the columns in more logical way (normalization) ... for the example above:
You should have a table called "schedules"(or whatever make sense to you), another something like "repeat_on" and a third table called "days"(here you have monday-sunday with their IDs). In the table of the middle(repeat_on) you should create foreign keys(for the other 2 tables: schedule_id and day_id) to do the magic.
This way you can combine whatever you want, for example:
schedule day
1 1
1 3
1 7
Meaning that you have to do the same on monday, wednesday and sunday.
IMO, normalization is an art. However, I usually take fields like your example and keep them in one table as we will never have more than 7 days. However, if there is any chance of growth I would put it in a separate table.
If the options are mutually exclusive, then you can use one field to store the choice. Ideally set up constraints for the field such that only the allowed values can be stored.
If more than one option can be chosen, you should have one field per option, with values 'y' and 'n' (or 't'/'f' for true/false). Again, you should add a constraint to only allow these values. If you DBMS supports it, use a BIT datatype, which only allows 1 and 0.
This may be overblown for your example, but you may want to look at the following article:
http://martinfowler.com/apsupp/recurring.pdf
I know this is an indirect answer but couldn't hurt to read.
Something like days of the week is tricky. PachinSV and Dustin Laine both make good points. If you have a list of things to choose from, having a code table to list the things and an intersection table to say which ones are chosen is a good basic design.
The reason days of the week are tricky is that the domain (i.e. list of days) is pretty small and there is no way the domain will ever expand. Also, one of the advantages of the intersection table approach is that you can run a query against everything that happens on a Wednesday (for example). This is great when your code table is something like category tags for blog articles, since asking to see everything with the tag "How-To" is a reasonable question. For the case of days of the week recurrence, does it make any actual business sense to say show me everything that recurs on Wednesdays? I don't think so. For sure you'll query on dates and date ranges, but days of the week, and only in the context of recurrence? I can't think of a practical reason to do that.
Therefore, there is an argument to be made that the days of the week are attributes and not independent entities so having seven bit flags on your table is still 3NF.
I have an idea I have yet to implement, because I have some fear I may be barking up the wrong tree... mainly because Googling on the topic returns so few results.
Basically I have some SQL queries that are slow, in large part because they have subqueries that are time-consuming. For example, they might do things like "give me a count of all bicycles that are red and ridden by boys between the ages of 10-15". This is expensive as it sloshes through all of the bicycles, but the end result is a single number. And, in my case, I don't really need that number to be 100% up to date.
The ultimate solution for problems of this sort seems to be to apply an OLAP-based engine to pre-cache these permutations. However, in my case I'm not really trying to slice and dice the data around a ton of metrics, and I'd love not to have to complicate my architecture with yet another process/datastore running.
So... my idea was basically memoizing these subqueries in the database. I might have a table called "BicycleStatistics" and it might store the output of that subquery above as a name value pair of it's inputs and outputs.
Ex name: "c_red_g_male_a_10-15" value: 235
And have a mechanism that memoizes those values to that table as the queries are run.
Has anyone been in this situation and tried anything similar? The reason I think a solution like this is valuable over the "throw a lot of RAM in your DB and let the database handle it" is (A) my database is bigger than the amount of RAM I can conveniently throw at it, and (B) the database is going to ensure I get the exact right number for these statistics, and my big win, above, is that I'm ok with the numbers being a day or two out of date.
Thanks for any thoughts/feedback.
Tom
Materialized views are a way of achieving this requirement, if your DBMS supports them.