Apologies for the fuzzy title...
My problem is this; I have a SQL Server table persons with about 100.000 records. Every person has an address, something like "Nieuwe Prinsengracht 12 - III". The customer now wants to separate the street from the number and addition (so each address becomes two or three fields). The problem is that we can not be sure of the format the current address is in, it could also simply be something like "Velperweg 30".
The only thing we do know about it is that it's a piece of text, followed by a number, possibly followed by some more text (which can contain a number).
A possible solution would be to do this with regexes, but I would much (much, much) rather do this using a query. Is there any way to use regexes in a query? Or do you have any other suggestions how to solve such a problem?
Something like this maybe?
SELECT
substring([address_field], 1, patindex('%[1-9]%', [address_field])-1) as [STREET],
substring([address_field], patindex('%[1-9]%', [address_field]), len([address_field])) as [NUMBER_ADDITON]
FROM
[table]
It relies on the assumption that the [street] field will not contain any numbers, and the [number_addition] field will begin with a number.
SQL Server and T-SQL are rather limited in their processing prowess - if you're really serious about heavy-lifting and regexes etc., you're best bet is probably creating an assembly in C# or VB.NET that does all that tricky Regex business, and then deploying that into SQL-CLR and use the functions in T-SQL.
"Pure" T-SQL cannot really handle much string manipulation beyond SUBSTRING and CHARINDEX - but that's about it.
In answer to your "Is there any way to use regexes in a query?", then yes there is, but it needs a little .NET knowledge. Create a CLR assembly with a user-defined function that does your regex work. Visual Studio 2008 has a template project for this. Deploy it to your SQL server and call it from your query.
Name and Address parsing and standardization is probably one of the most difficult problems we can encounter as programmers for precisely the reasons you've mentioned.
I assume that whoever you work for their main business is not address parsing. My advice is to buy a solution rather than build one of your own.
I am familiar with this company. Your address examples appear to be non US or Canadian so I don't know if their products would be useful, but they may be able to point you to another vendor.
Other than a user of their products I am not affiliated with them in any way.
This sounds like the common "take a piece of complex text that could look like anything and make it look like what we now want it to look like" problem. These tend to be very hard to do using only T-SQL (which does not have native regex functionality). You will probably have to work with complex code outside of the database to solve this problem.
TGnat is correct. Address standardization is complicated.
I've encountered this problem before.
If your customer doesn't want to spring for the custom software, develop a simple GUI that allows a person to take an address and split it manually. You'd delete the address row with the old format and insert the row with the new address format.
It wouldn't take long for typists familiar with your addresses to manually make 100,000 changes. Of course, it's up to the customer if he wants to spend the money on custom software or typists.
But you shouldn't be stuck with the data cleaning bill, either.
I realize that this is an old question, but for future reference, I still decided to add an answer using regex (also so I don't forget it myself). Today, I ran into a similar problem in Excel, in which I had to split the address in street and house number too. In the end, I've copied the column to SublimeText (a shareware text editor), and used a regex to do the job (CTRL-H, enable regex):
FIND: ^('?\d?\d?\d?['-\.a-zA-Z ]*)(\d*).*$
REPLACE FOR THE HOUSE NUMBER: $2
REPLACE FOR THE STREET NAME: $1
Some notes:
Some addresses started with a quote, e.g. 't Hofje, so I needed to add '?
Some addresses contained digits at the start, e.g. 17 Septemberplein or 2e Molendwarsstraat, so I added \d?\d?\d?
Some addresses contained a -, e.g. Willem-Alexanderlaan or a '
Related
I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.
I am trying to create an old-school Text Adventure Game. I'm a bit stuck on creating the World Map and rooms.
Should the room descriptions be part of the source code or should it be separated out? I was thinking of placing all such descriptions and room properties in a MySQL database and then have code to organize the logic of each room; putting each room description in with the actual source code seems a bit untidy.
Is this the preferred method of organising Descriptions in an adventure game? I was also thinking that this might be preferable since I could then query the database to find common properties about the data.
Any comments would be appreciated.
No, don't include level/room description within code, it is not dynamic this way.
Many many development frameworks now tend to go with separating code from data. So, for usual cases, we put game rooms data within files and read those to build the level and maybe enable the user to construct a new level on his own and eventually create a new file to carry the room data.
I work in a company where they build games, and they have the rooms separated from the code, they have it in mysql. Actually also the items that go in each room are in a table, and there is also a table that says which item is at which room at that moment.
Besides if you want to expand your game or do statistics about it is much better doing it with a database.
I will address two issues here. First, you are right to keep the data that defines the game away from the engine that will use it. This makes it so that you dont have to recompile everything in order to fix a typo or the like in the case of a text based game.
Secondly though, I would just question the use of MySQL. If you are making a dos typed game that is to be installed on people's systems you dont want a pre-req to be 'Install MySQL', hehe. There is a little program out there that is written in C that is free for all to use called SQLite that would suit your needs much better. If on the other hand the web is the medium for the release of this text based game, then have at it :)
You could just use a system like ADRIFT, then all you need to worry about are the descriptions and logic.
Should the room descriptions be part of the source code or should it
be separated out?
Separated out.
Try Prolog language.
It has similar database to SQL (actually logical predicates)
With some skill You may be able to check whether after some change is Your adventure finishable.
You may easily create this description by some logical predicates if You don't mind it being very "computer like".
You can see examples of Prolog text adventures in simple Google search.
I suggest using engines that already have a vibrant community around them. That way, your source code is only that; the source code of the game. I'd go with either TADS 3 or Inform 7
I would construct such a game as an interpreter which reads in room data, and based on the room data, allows for a set of valid commands (move, take, drop, change...). For movement you would have a pre-built graph with nodes being rooms and edges being allowed moves.
I would separate the descriptions from the code, having an object Room, that owns an object Description that calls a "database" through some Facade, so that you may use a file or a database or anything you wish. It would also eventually allow you to add some scripting to the room itself, like having objects in your description that have behaviors.
I am designing a database and recently named a column in a table DayOfWeek, completely forgetting that DayOfWeek is a built-in function in SQL Server. Now I am deciding if I should just leave it as is and reference the column with square brackets [DayOfWeek] or change the column name to avoid any conflicts in the future. I am not too far into the project so changing it is not too hard. The debate in my head is that the column name of DayOfWeek just makes so much sense for its purpose, so I really want to use it... but it is a reserved word... and could cause pain in the future (especially if I always have to put square brackets around it when referencing the column).
What does everyone think?
I would change it - i've got a legacy table called user - it is a pain with the square brackets all the time. perhaps call it DayOfWeekName or DayOFWeekId
Josh
Jeff,
If you're not too far down the track to rename the column (relatively) painlessly then I'd recommend you change it. You've identified one probable-future-headache for the maintentance crew, and I guess it would be actually less costly (over time) to clean it up now, especially considering that renaming something isn't the hell-on-wheels since the advent of truly effective search and replace functionality in text-editors and IDE's.
The truly hard part of renaming it is gaining the understanding required to do the job safely. You are unique (being the author) in having that understanding. If you asked me (just for instance) to do the job, then it probably wouldn't be a cost effective business proposition.
So... +1 for fixing the sucker yourself... and +2 for not doing it again ;-)
Cheers. Keith.
I always make sure I never use a keyword as the start of any variable/object/function simply because you're never sure if your target language will like it. It can often generate wacky errors that take a while to track down. Even if syntax checking does pick it up, it means you've wasted more time than if it had been a different name like FurryKitten.
I would avoid DayOfWeek and opt for something which is a completely different, maybe Weekday or DayName. It just saves hassle.
Plus - square brackets just create headaches, and there are a lot of SQL developers out there who don't use the brackets - new developers will end up creating "non-bracketed" code out of habit for some time after they join the team. Uncommon conventions should be avoided if possible.
change it if is easy. But as a good practice I'd use square brackets everywhere. Once you get used to, is no more of a pain than putting space between words.
We have a table here called User. If at all possible, I would change it. Although Josh is correct, you can place square brackets to denote it is a table, that task gets old very quickly. Using reserved words as tables also makes it difficult for other developers. If someone doesn't know that is a reserved word, it can be difficult determining why a query doesn't work.
If you using any DB adapter / abstraction library to get data from database, then don't worry. Class will escape column names for you. If you writing SQL queries yourself, then it can make some troubles. You need to escape column names in your queries.
P.S. I remmember, few times I was in situation like you. But I was named column as "order" :-)
I would avoid also using reserved words, you can get very easly in trouble if you're not paying attention.
When posting example code or filing bug reports based on a real production app, it would be helpful to have some way to change the table and column names to not potentially give away information about the internals of the app. Doing it by hand without breaking things is time consuming. Does anything automatic exist? Ideally it would use real English words so they are more easily referred to than random text strings.
As long as you don't use real data, I don't see what the issue is. Most apps are fairly obvious based on the requirements. ie CRM system = (customer name, address, etc...) or (customer name, addressid, etc.. with some address table with parts of the address, etc...). By knowing your schema I have no idea how you implement your app. Generally without the stored procedures/program code it would be hard to steal any intellectual property. Even if you were the NSA or something (InternetIP, PacketHeadingID, PacketDetailID, TimeStampID). Even with the structure of the tables I still would have no information on how your system to log all the internet traffic actually works. I also wouldn't know anything that is logged.
I don't know of anything off hand to do what you are requesting, but I would think it is fairly easy to write a script to do it on your own. Look at the table columns and datatypes and call text columns "TextColumn1", int columns "IntColumn2", etc. and build a table of substitutions, then perform the substitutions globally in the script file. I would think this is a fairly easy Python/Perl/PowerShell/Ruby/VbScript program.
I agree that there's no real need to do so, but if you feel that way, take a look at anonymizers, usually used to protect the data and not the schemas, but you could easily apply those approaches to schemas as well.
See this paper (which is the description of this framework) especially page 8 an onwards for different anonymization methods, although replacing column names for static strings might probably be good enough anyway.
I have a variety of time-series data stored on a more-or-less georeferenced grid, e.g. one value per 0.2 degrees of latitude and longitude. Currently the data are stored in text files, so at day-of-year 251 you might see:
251
12.76 12.55 12.55 12.34 [etc., 200 more values...]
13.02 12.95 12.70 12.40 [etc., 200 more values...]
[etc., 250 more lines]
252
[etc., etc.]
I'd like to raise the level of abstraction, improve performance, and reduce fragility (for example, the current code can't insert a day between two existing ones!). We'd messed around with BLOB-y RDBMS hacks and even replicating each line of the text file format as a row in a table (one row per timestamp/latitude pair, one column per longitude increment -- yecch!).
We could go to a "real" geodatabase, but the overhead of tagging each individual value with a lat and long seems prohibitive. The size and resolution of the data haven't changed in ten years and are unlikely to do so.
I've been noodling around with putting everything in NetCDF files, but think we need to get past the file mindset entirely -- I hate that all my software has to figure out filenames from dates, deal with multiple files for multiple years, etc.. The alternative, putting all ten years' (and counting) data into a single file, doesn't seem workable either.
Any bright ideas or products?
I've assembled your comments here:
I'd like to do all this "w/o writing my own file I/O code"
I need access from "Java Ruby MATLAB" and "FORTRAN routines"
When you add these up, you definitely don't want a new file format. Stick with the one you've got.
If we can get you to relax your first requirement - ie, if you'd be willing to write your own file I/O code, then there are some interesting options for you. I'd write C++ classes, and I'd use something like SWIG to make your new classes available to the multiple languages you need. (But I'm not sure you'd be able to use SWIG to give you access from Java, Ruby, MATLAB and FORTRAN. You might need something else. Not really sure how to do it, myself.)
You also said, "Actually, if I have to have files, I prefer text because then I can just go in and hand-edit when necessary."
My belief is that this is a misguided statement. If you'd be willing to make your own file I/O routines then there are very clever things you could do... And as an ultimate fallback, you could give yourself a tool that converts from the new file format to the same old text format you're used to... And another tool that converts back. I'll come back to this at the end of my post...
You said something that I want to address:
"leverage 40 yrs of DB optimization"
Databases are meant for relational data, not raster data. You will not leverage anyone's DB optimizations with this kind of data. You might be able to cram your data into a DB, but that's hardly the same thing.
Here's the most useful thing I can tell you, based on everything you've told us. You said this:
"I am more interested in optimizing my time than the CPU's, though exec speed is good!"
This is frankly going to require TOOLS. Stop thinking of it as a text file. Start thinking of the common tasks you do, and write small tools - in WHATEVER LANGAUGE(S) - to make those things TRIVIAL to do.
And if your tools turn out to have lousy performance? Guess what - it's because your flat text file is a cruddy format. But that's just my opinion. :)
I'd definitely change from text to binary but keep each day in a separate file still. You could name them in such a way that insertions in between don't cause any strangeness with indices, such as by including the date and possible time in the filename. You could also consider the file structure if you have several fields per location for example. Is it common to look for a small tile from a large number of timesteps? In that case you might want to store them as tiles containing data from several days. You didn't mention how the data is accessed which plays a big role in how to organise it efficiently.
Clarifications:
I'm surprised you added "database" as one of the tags, and considered it as an option. Why did you do this?
Essentially, you have a 2D, single component floating point image at every time step. Would you agree with this way of viewing your data?
You also mentioned the desire to insert a day between two existing ones - which seems to be a very odd thing to do. Why would you need to do that? Is there a new day between May 4 and May 5 that I don't know about?
Is "compression" one of the things you care about, or are you just sick of flat files?
Would a float or a double be sufficient to store your data, or do you feel you need more arbitrary precision?
Also, what programming language(s) do you want to access this data with?
your answer on how to store the data depends entirely on what you're going to do with the data. for example, if you only ever need to retrieve by specifying the date or a date range, then storing in a database as a BLOB makes some sense. but if you need to find records that have certain values, you'll need to do something different.
please describe how you need to be able to access the data/
Matt, thanks very much, and likewise longneck and jirv.
This post was partly an experiment, testing the quality of stackoverflow discourse. If you guys/gals/alien lifeforms are representative, I'm sold.
And on point, you've clarified my thinking considerably. Mind, I still might not necessarily implement your advice, but know that I will be thinking about it very seriously. >;-)
I may very well leave the file format the same, add to the extant C and/or Ruby routines to tack on the few low-level features I lack (e.g. inserting missing timesteps), and hang an HTTP front end on the whole thing so that the data can be consumed by whatever box needs it, in whatever language is currently hoopy. While it's mostly unchanging legacy software that construct these data, we're always coming up with new consumers for it, so the multi-language/multi-computer requirement (gee, did I forget that one?) applies to the reading side, not the writing side. That also obviates a whole slew of security issues.
Thanks again, folks.