Searching Persian characters and words in SQL server with different encoding

Searching Persian characters and words in SQL server with different encoding - sql-server

I have a text file that contains Persian words and is saved using ANSI encoding. When I try to read the Persian words from the text file, I get some characters like '?'. To solve the problem, I changed the file encoding to UTF8 and re-wrote the text file. Here's the method for changing file encoding:
public void Convert2UTF8(string filePath)
{
//first, read the text file with "ANSI" endocing
StreamReader fileStream = new StreamReader(filePath, Encoding.Default);
string fileContent = fileStream.ReadToEnd();
fileStream.Close();
//Now change the file encoding and replace it with the UTF8
StreamWriter utf8Writer = new StreamWriter(filePath.Replace(".txt", ".txt"), false, Encoding.UTF8);
utf8Writer.Write(fileContent);
utf8Writer.Close();
}
Now the first problem is solved; However, there is another issue here: every time that I want to search a Persian word from the SQL server database table, the result is null while the record does exist in the database table.
What's the solution to find my Persian words that exist in the table? The code that I currently use is simply like the following:
SELECT * FROM [dbo].[WordDirectory]
WHERE Word = N'کلمه'
Word is the field that Persian words are saved in. The type of the field is NVARCHAR. My SQL server version is 2012.
Should I change the collation?

DECLARE #Table TABLE(Field NVARCHAR(4000) COLLATE Frisian_100_CI_AI)
INSERT INTO #Table (Field) VALUES
(N'همهٔ افراد بش'),
(N'می‌آیند و حیثیت '),
(N'ميشه آهسته تر صحبت کنيد؟'),
(N'روح'),
(N' رفتار')
SELECT * FROM #Table
WHERE Field LIKE N'%آهسته%'
The both Queries return the same result
RESULT Set: ميشه آهسته تر صحبت کنيد؟
You have to make sure that when you are inserting the values you prefix then witn N thats to tell sql server there can be unicode character in the passed string. Same is true when you are searching for them strings in Select statement.

Probably you have problem with Persian and Arabic versions of the 'ي' and 'ك' during search. These characters even look the same, have different Unicode numbers:
select NCHAR(1740), -- Persian ى
NCHAR(1610), -- Arabic ي
NCHAR(1705), -- Persian ك
NCHAR(1603) -- Arabic ك
more info: http://www.dotnettips.info/post/90

Related

How to use PATINDEX in SQL Server?

In SQL Server 2012, I have a column which has long text data. Somewhere within the text, there is some text of the format
{epa_file_num} = {138410-81}
If it exists, I want to extract out 138410-81 as a column value. In regular JS regex, I would use something like this { *epa_file_num *} *= *{ *\d*-?\d* *} to match the column, and then maybe a capturing group to get the value.
But how can I get it in SQL Server 2012?
Thanks

Not a regex, but this might do what you want:
DECLARE #Input VARCHAR(MAX)='{some name} = {a value} some text {epa_file_num} = {138410-81} other text'
SET #Input=REPLACE(#Input,' ','')
SET #Input=SUBSTRING(#Input,NULLIF(PATINDEX('%{epa_file_num}={%',#Input),0)+LEN('{epa_file_num}={'),LEN(#Input))
SET #Input=SUBSTRING(#Input,1,NULLIF(CHARINDEX('}',#Input),0)-1)
SELECT #Input
First, I remove all the spaces, then I look for {epa_file_num}= and take everything after this string, until the next }.

TSQL - Split delimited String into columns

The Problem
I have a number of filename strings that I want to parse into columns using a tilda as delimiter. The strings take on the static format:
Filepath example C:\My Documents\PDF
Surname example Walker
First Name example Thomas
Birth Date example 19991226
Document Created Datetime example 20180416150322
Document Extension example .pdf
So a full concatenated example would be something like:
C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf
I want to ignore the file path and extension given in the string and only parse the following values into columns:
Surname, First Name, Birth Date, Document Created Datetime
So something like:
SELECT Surname = --delimitedString[0]
FirstName = --delimitedString[1]
--etc.
What I have tried
I know that I have several tasks I would need to perform in order to split the string, first I would need to trim off the extension and file path so that I can return a string delimited by tildas (~).
This is problem one for me, however problem 2 is splitting the new delimted string itself i.e.
Walker~Thomas~19991226~20180416150322
Ive had a good read through this very comprehensive question and It seems (as im using SQL Server 2008R2) the only options are to use either a function with loops or recursive CTE's or attempt a very messy attempt using SUBSTRING() with charIndex().
Im aware that If I had access to SQL Server 2016 I could use string_split but unfortunately I cant upgrade.
I do have access to SSIS but im very new to it so decided to attempt the bulk of the work within a SQL statement

Here is a way without a splitter that shouldn't be too complicated...
declare #var table (filepath varchar(256))
insert into #var values
('C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf')
;with string as(
select
x = right(filepath,charindex('\',reverse(filepath))-1)
from #var
)
select
SurName= substring(x,1,charindex('~',x) - 1)
,FirstName = substring(x,charindex('~',x) + 1,charindex('~',x) - 1)
from string

I know you mentioned wanting to avoid the charindex() option if at all possible, but I worked it out in a hopefully semi-readable way. I find it somewhat easy to read complex functions like this when I space each parameter on a different line and use indent levels. It's not the most proper looking, but it helps with legibility:
with string as (select 'C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf' as filepath)
select
substring(
filepath,
len(filepath)-charindex('\',reverse(filepath))+2, --start location, after last '\'
len(filepath)- --length of path
(len(filepath)-charindex('\',reverse(filepath))+2)- --less characters up to last '\'
(len(filepath)-charindex('.',filepath)) --less file extention
)
from string

Fritz already have a great start, my answer just add on top it
with string as (select 'C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf' as filepath)
, newstr as (
select
REPLACE(substring(
filepath,
len(filepath)-charindex('\',reverse(filepath))+2, --start location, after last '\'
len(filepath)- --length of path
(len(filepath)-charindex('\',reverse(filepath))+2)- --less characters up to last '\'
(len(filepath)-charindex('.',filepath)) --less file extention
) , '~', '.') as new_part
from string
)
SELECT
PARSENAME(new_part,4) as Surname,
PARSENAME(new_part,3) as [First Name],
PARSENAME(new_part,2) as [Birth Date],
PARSENAME(new_part,1) as [Document Created Datetime]
FROM newstr

SQL Server 2008 select statement that ignores non alphanumeric characters

I have an interesting SQL Server search requirement.
Say I have a table with Part Numbers as follows:
PARTNO DESCRIPTION
------ -----------
ABC-123 First part
D/12a92 Second Part
How can I create a search that will return results if I search, say, for 'D12A'?
I currently have a full text search set up for the description column, but I am looking to find parts that match the part no even when users don't include the / or - etc.
I'd rather do this in a single SQL statement rather than creating functions if possible as we only have read access to the DB.

You could do something like:
SELECT * FROM PART_TABLE
WHERE REPLACE(REPLACE(PARTNO,'/', ''),'-','') LIKE '%D12A%'
This would work for the 2 characters you specified and could be extended for more character like so:
SELECT * FROM PART_TABLE
WHERE REPLACE(REPLACE(REPLACE(PARTNO,'/', ''),'-',''),*,'') LIKE '%D12A%'
Probably not the most elegant of solutions unless your special characters are limited. Otherwise I'd suggest writing a Function to strip out non-alphanumeric characters.
Here is an example of such a function:
CREATE FUNCTION dbo.udf_AlphaNumericChars
(
#String VARCHAR(MAX)
)
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #RemovingCharIndex INT
SET #RemovingCharIndex = PATINDEX('%[^0-9A-Za-z]%',#String)
WHILE #RemovingCharIndex > 0
BEGIN
SET #String = STUFF(#String,#RemovingCharIndex,1,'')
#RemovingCharIndex = PATINDEX('%[^0-9A-Za-z]%',#String)
END
RETURN #String
END
------- Query Sample (untested)---------
SELECT *
FROM PART_TABLE
WHERE DBO.udf_AlphaNumericChars(PARTNO) LIKE '%D12A%'
Taken From: http://sqlserver20.blogspot.co.uk/2012/06/find-alphanumeric-characters-only-from.html

H2 DB CSVWRITE Duplicate Double Quotes Inside a String

I was trying to export a table in H2 DB into CSV using CSVWRITE function and found out if double quotes are included in a varchar column they will be duplicated.
Eg. - 'hello"howareyou' will be 'hello""howareyou' in the written csv.
Tried saving this varchar column with escape characters and few other combinations but result is the same.
Following is my table column I created to test this issue and the resulted CSV value I got.
My column CSV written value
------------------------------
hello"how hello""how
hello\"how hello\""how
hello""how hello""""how
hello\""how hello\""""how
hello\\"how hello\\""how
hello\\\\"how hello\\\\""how
hello["]how hello[""]how
hello&quote;how hello&quote;how
Following is my CSVWrite command:
CALL CSVWRITE(
'#DELTA_CSV_DIR#/DELTA.csv',
'SELECT ccc from temptemp',
null, '|', '');
Am I doing this wrong? or is there any option or workaround I can use to avoid this situation?
Thanks in advanced.

You are currently using the built-in CSVWRITE function with the following options:
fileName = '#DELTA_CSV_DIR#/DELTA.csv'
query = 'SELECT ccc from temptemp'
characterSet = default (UTF-8)
fieldSeparator = '|'
fieldDelimiter = '' (empty string)
As documented, the default escape character is a double quote, so that double quotes are escaped using a double quote (in the same way as you need to escape a backslash within a Java string with a backslash). The escape character is needed to escape the field separator.
You can disable the escape character as follows:
CALL CSVWRITE(
'#DELTA_CSV_DIR#/DELTA.csv',
'SELECT ccc from temptemp',
'fieldSeparator=| fieldDelimiter= escape=');
This is also using the more readable new format for options.

codeigniter insert : special characters trims my string values

I have a problem executing insert commands that are loaded from a text file. I use codeIgniter "file" helper to load an sql line then I perform a simple db->query(content of my file). The problem is when the sql is loaded from a file the special character trims the rest of the string.
Here is an example that works
INSERT INTO test(test) VALUES("<p>There is <strong>no special character</strong> in this string</p>");
Example that will not work
INSERT INTO test(test) VALUES("<p>this character <em>é</em> is a <strong>special character</strong></p>");
In the second example, only "<p> this character <em>" will be saved. This is weird because if I execute the same line in phpMyAdmin it works fine.
Anyone knows why this happens? or what I do wrong?
Thanks
Here is a simple "step to reproduce".
Simple table :
CREATE TABLE `test` (
`test` TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL
) ENGINE = InnoDB;
A file "application/view/text.txt" that contains :
INSERT INTO test(test) VALUES("<p>this character <em>é</em> is a <strong>special character</strong></p>");
The code I use to perform the insert
$this->load->helper('file');
$loaded_sql = read_file(BASEPATH . "../application/views/test.txt");
$this->db->query($loaded_sql);
My database config
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';
CI Config
$config['charset'] = 'UTF-8';

I finally got it. Needed to use "utf8_encode()" when reading the file to unsure that special character gets encoded properly. Text file must be encoded in ANSI (default notepad encoding). If file is UTF-8 or UNICODE it won't work!
Code that resolved the problem :
$loaded_sql = utf8_encode( read_file(BASEPATH . "../application/views/test.txt") );

Try to use before inserting record in table.
$this->db->db_set_charset('latin1', 'latin1_swedish_ci');
Make sure you have same setting in table & table column.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Searching Persian characters and words in SQL server with different encoding - sql-server

Related

How to use PATINDEX in SQL Server?

TSQL - Split delimited String into columns

SQL Server 2008 select statement that ignores non alphanumeric characters

H2 DB CSVWRITE Duplicate Double Quotes Inside a String

codeigniter insert : special characters trims my string values

Categories

Resources