Replacing Unicode characters using TRANSLATE function - sql-server

A customer asked to create a custom character mapper function from specific names to ASCII in their SQL database.
Here is a simplified fragment that works (shortened for brevity):
select TRANSLATE(N'àáâãäåāąæậạả',
N'àáâãäåāąæậạả',
N'aaaaaaaaaaaa');
While analyzing the results on customer's dataset, I noticed one more unmapped symbol ă. So I added it to the mapper as follows:
select TRANSLATE(N'àáâãäåāąæậạảă',
N'àáâãäåāąæậạảă',
N'aaaaaaaaaaaaa');
Unexpectedly, it started failing with the message:
The second and third arguments of the TRANSLATE built-in function must contain an equal number of characters.
Obviously, TRANSLATE thinks that ă is special and consists of more than one character. Actually, even Notepad thinks the same (copy ă and try to delete it using Backspace key - something unusual will happen. Delete key works normally, though).
Then I thought - if TRANSLATE considers it a two-char symbol, let's add a two char mapping then:
select TRANSLATE(N'àáâãäåāąæậạảă',
N'àáâãäåāąæậạảă',
N'aaaaaaaaaaaaaa');
No errors this time, yay. But the input string was not processed correctly, ă was not replaced with a.
What is the correct (case-sensitive) way to replace such "double symbols"? Can it be done using TRANSLATE at all? I don't want to add a bunch of REPLACE for every such symbol I find.

Related

Remove Duplicate adjacent Sub-String from String in Microsoft SQL Server

I am using SQL Server 2008 and I have a column in a table, which has values like below. It basically shows departure and arrival information.
-->Heathrow/Dublin*Dublin/Heathrow
-->Gatwick/Liverpool*Liverpool/Carlisle *Carlisle/Gatwick
-->Heathrow/Dublin*Liverpool/Heathrow
(The 3rd example shown above is slightly different where the person did not depart from Dublin, instead departed from a Liverpool).
This makes the column too lengthy, and I want to remove only the adjacent duplicates, so the information can be shown like below:
-->Heathrow/Dublin/Heathrow
-->Gatwick/Liverpool/Carlisle/Gatwick
-->Heathrow/Dublin***Liverpool/Heathrow
So, this would still show the correct travel route, but omits only the contiguous duplicates. Also, in the 3rd case, since the departure and arrival information location is not the same, Iwould like to show it as ***.
I found a post here that removes all duplicates (Find and Remove Repeated Substrings) but this is slightly different from the solution that I need.
Could someone share any thoughts please?
The first step is to adapt the process defined in the following link so that it splits based on /:
T-SQL split string
This returns a table which you would then loop through checking if the value contains an *. In that case you would get the text values before and after the * and compare them. Use CHARINDEX to get the position of the *, and SUBSTRING to get the values before and after. Once you have those check both values and append to your output string accordingly.
So you have a database column that contains this text string? Is your concern to display the data to the user in a new format, or to update the data in your database table with a new value?
Do you have access to the original data from which this text string was built? It would probably be easier to re-create the string in the format you desire than it would be to edit the existing string programmatically.
If you don't have access to this data, it would probably be a lot simpler to update your data (or reformat it for display) if you do the string manipulation in a high-level language such as c# or java.
If you're reformatting it for display, write the string manipulation code in whatever language is appropriate, right before displaying it. If you're updating your table, you could write a program to process the table, reading each record, building the replacement string, and updating the record before moving on to the next one.
The bottom line is that T-SQL is just not a good language for doing this sort of string examination and manipulation. If you can build a fresh string from the original data, or do your manipulation in a high-level language, you'll have an easier job of it and end up with more maintainable code.
I wrote a code for the first example you gave. You still need to
improve it for the rest ...
DECLARE #STR VARCHAR(50)='Heathrow/Dublin*Dublin/Heathrow'
IF (SELECT SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)) =
(SELECT SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))))
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))+1,'')
END
ELSE
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)))),'***')
END

Word wrap issues with SSIS Flat file destination

Background: I need to generate a text file with 5 records each of 1565 character length. This text file is further used to feed the data to a software.
Hence, they are some required fields and optional fields. I created a query with all the fields added together to get one single field. I populated optional fields with a blank.
For example:
Here is the sample input layout for each fields
Field CharLength Required
ID 7 Yes
Name 15 Yes
Address 15 No
DOB 10 Yes
Age 1 No
Information 200 No
IDNumber 13 Yes
and then i generated a query for each unique ID with the above fields into a single row which looks like following:
> SELECT Cast(1 AS CHAR(7))+CAST('XYZ' AS CHAR(15))+CAST('' AS CHAR(15))+CAST('22/12/2014' AS
CHAR(10))+CAST('' AS CHAR(1))+CAST(' AS CHAR(200))+CAST('123456' AS CHAR(13))
UNION
SELECT Cast(2 AS CHAR(7))+CAST('XYZ' AS CHAR(15))+CAST('' AS CHAR(15))+CAST('22/12/2014' AS
CHAR(10))+CAST('' AS CHAR(1))+CAST(''AS CHAR(200))+CAST('123456' AS CHAR(13))
Then, I created an SSIS package to produce the output text file through Flat file destination delimited.
Problem:
Even though the flat file is generated as per the desired length(1565). The text file looks differently when the word wrap is ON or OFF. When Word wrap is off , i get the record in single line. If the Word wrap is on, the line is broken into multiple. the length of the record in either case is same.
Even i tried to use VARCHAR + Space in the query instead of CHAR for each field, but there is no success. Its breaking the line for blank fields.
For example: Cast('' as varchar(1)) + Space(200-len(Cast('' as varchar(1)))) for Information field
Question: How do make it into a single line even though the word wrap is ON.
Since its my first post, please excuse me for format of the question
The purpose of word wrap is to put characters on the next line in instances of overflow rather than creating an extremely horizontal scrolling document.
Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between words rather than within words, when possible.
Because this is what word wrap is there's nothing you can do to change its behavior. What does it matter anyway? The document should still be parsed as you would expect. Just don't turn word wrap on.
As far as I'm aware, having word wrap on or off has no impact on the document itself, it's simply a presentation option.
Applications parsing a document parse it as if word wrap were off. Something that could throw off parsing is breaks for a new line, but that is a completely different thing from word wrap.

Parsing text in Julia- Invalid ascii sequence

I'm trying to format a data file so that my other program will properly handle it. I am trying to handle the following data and I am getting a very weird error that I can't seem to put my finger on.
https://snap.stanford.edu/data/wiki-RfA.html
I am trying to format the data as [SRC TGT VOT], so I'd like the first two lines of my output file to be
1 2 1
3 2 1
because user 1 (stored in dictionary of users first) votes for user 2 with VOT 1 and then user 3 votes for user 2 with VOT 1. My problem is that when I try to run my code below, I always end up getting a very strange "invalid ascii sequence" error- can anyone help me identify the issue or perhaps find a way around this? It'd obviously be best if I could learn what I am doing wrong. Thank you!
Note, I understand that this is a bit specific of a question and I appreciate any help- I'm sort of baffled by this error and don't know how to resolve it at the moment.
f=open("original_vote_data.txt") #this is the file linked above
arr=readlines(f)
i=edge_count=src=tgt=vot=1
dict=Dict{ASCIIString, Int64}()
edges=["" for k=1:198275]
while i<1586200
src_temp=(arr[i])[5:end-2]
if (haskey(dict, src_temp))
new_src= dict[src_temp]
else
dict[src_temp]=src
new_src=src
src=src+1
end
tgt_temp=(arr[i+1])[5:end-2]
if (haskey(dict, tgt_temp))
new_tgt= dict[tgt_temp]
else
dict[tgt_temp]=tgt
new_tgt=tgt
tgt=tgt+1
end
vot_temp=(arr[i+2])[5]
edges[edge_count]=string(new_src)* " " * string(new_tgt)* " " *string(vot_temp)
edge_count=edge_count+1
i=i+8
end
Here we go - I'll write up my comment as an answer since it seems to have solved the question.
My hunch that the error stemmed from the fourth line (dict=Dict{ASCIIString, Int64}) was based on the fact that ASCIIStrings will error if you try to store non-ASCII characters in them. Since this file is coming from an international site, it's not unlikely that there are users with unicode characters in their names (or elsewhere in the data). So the simple fix is to change all instances of ASCIIString to UTF8String.
Just to make this answer a bit more complete, I downloaded the file and tried running the program. The simplest way to debug this is to run the script at top-level in the REPL and then inspect the program state after the error. After the error is thrown, i==3017. Now just try running each line of the while loop incrementally. You'll quickly see that line 3017 contains "SRC:Guðsþegn\n" — unicode, as I suspected. When you try to create a new entry in dict with that as the key, the error should have a backtrace to setindex! in dict.jl, where you'll see that it's trying to convert the key (a UTF8String) to an ASCIIString. So changing the dictionary type to have UTF8String keys solves the problem.
As it turns out, the edges array only contains strings of three integers (or sometimes a hyphen), so the ASCIIString there is ok, but still a little dangerous. I'd probably store that information in a more dedicated array of ints instead of converting it to a space-separated string: you know the first two elements in the string are ints, but the last element is unvalidated text from the file itself… which may be unicode or a space itself (which could mess up processing down the line).

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Any python/django function to check whether a string only contains characters included in my database collation?

As expected, I get an error when entering some characters not included in my database collation:
(1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='")
Is there any function I could use to make sure a string only contains characters existing in my database collation?
thanks
You can use a regular expression to only allow certain characters. The following allows only letters, numbers and _(underscore), but you can change to include whatever you want:
import re
exp = '^[A-Za-z0-9_]+$'
re.match(exp, my_string)
If an object is returned a match is found, if no return value, invalid string.
I'd look at Python's unicode.translate() and codec.encode() functions. Both of these would allow more elegant handling of non-legal input characters, and IIRC, translate() has been shown to be faster than a regexp for similar use-cases (should be easy to google the findings).
From Python's docs:
"For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example)."
http://docs.python.org/library/stdtypes.html
http://docs.python.org/library/codecs.html

Resources