Copy table from txt file with Chinese Characters - Postgresql on Windows

Copy table from txt file with Chinese Characters - Postgresql on Windows - database

I am trying to load a table from a txt file which has Chinese characters in it using the \copy command in PostgreSQL. I have a test table with only one columns Names Varchar(25) in it. When I run an insert statement from PSQL or PgAdmin like
insert into test values ('康狀態概');
it works and inserts the value correctly. When I put the same value in a txt file (say text.txt) and try to use the copy command and load the contents of test.txt
\copy test from 'D:/database/test.dat';
It throws an error ERROR: character with byte sequence 0xa6 0x82 in encoding "GBK" has no equivalent in encoding "UTF8" on WIndows only. I changed the locale and keyboard settings of the Windows server to Chinese BUT it does not work with or without changing the locale settings.
The same exercise works fine on Linux without any changes.
Can someone please suggest what needs to be done here?
Thanks

Related

"Could not find stored procedure 'ÿþ' " error

I am trying to execute a query, after reading the content from a SQL Script file, assigning it to a variable and then executing the content. Then I get this error saying Could not find stored procedure 'ÿþ'. Please help me understand the issue. Thank you.
Info:
SQL Server 2014
SSMS Version - 12.0.4100.1

ÿþ is one way to interpret the two bytes of the the UTF-16 byte order mark, which is \xFF and \xFE.
You get those two letters when you read a file that has been saved in the UTF-16 encoding with a tool that is unaware of—or, more likely, was not configured to use—Unicode.
For example, when you edit a text file with Windows Notepad and select "Unicode" as file encoding when you save it, Notepad will use UTF-16 to save the file and will mark it with the said two bytes at the start.
If whatever thing you use to read the file is unaware of the fact that the file is Unicode, then it will use the default byte encoding of your computer to decode that text file.
Now, if that default encoding happens to be Windows-1252, like in your case, then ÿþ is what you get, because there \xFF is ÿ and \xFE is þ.
Consequently, when presented with ÿþ, SQL Server thinks it must be the name of a stored procedure, because stored procedures are the only statements that you can run by mentioning just their name. And it dutifully reports that it can't find a procedure of that name.

Foreign language characters replaced by "?"

I am working on a program which takes file/folder names as input. Currently when I try to run a file which has got foreign language character in its name it is replaced by a ? For each of its character. I am running my exe on command prompt so trying to run the particular file results in an error. When I am using DIR on command prompt it displays ? For each character of the file name. Is there any way to display the actual foreign language characters in command prompt as I believe that could be causing my exe not to work any of those files.
This is the text that I am trying to read - 科普書籍推展教案 which is being replaced by ? on the console.

The command prompt can only display characters in your current ACP. So, if you have files with names outside the ACP, you're going to see ?. You can use changecp to pick a different CP, but here is no code page for full Unicode in the DOS box.
Inside your code, you need to learn to use 'W' API to work with full unicode pathnames. The safest thing is to just #define _UNICODE and use it uniformly.

BCP unexpected EOF

I try to do a bulk copy from a file into a table in sqlserver 2008. Its only a part of the file that I want to copy (in the BCP command example bellow it is only line number 3 I will copy).
Sometimes I receive the "Unexpected EOF encountered" error message.
The command is:
BCP tblBulkCopyTest in d:\bc_file.txt -F3 -L3 -c -t, -S(local) -Utest -Ptest
When the file looks like the following, the copy works fine (the line number 3 is inserted into my table):
44261233,207,0,0,2168
44261234,207,0,0,2570
44261235,207,0,0,2572
When the file looks like the following, I receive "Unexpected EOF encountered" error message:
Test
44261233,207,0,0,2168
44261234,207,0,0,2570
44261235,207,0,0,2572
It seems like when the file starts with something not in the correct format, the BCP command fails, even though it is not the line I want to copy (in the bcp command I have specified line number 3).
When the file looks like the following, the copy works fine:
44261233,207,0,0,2168
44261234,207,0,0,2570
44261235,207,0,0,2572
Test
So it is only when the file has some incorrect data "before" the lines I want to copy, that I receive the error.
Any suggestions how the BCP command will ignore the lines, not in question?

The way I solve that kind of errors every day:
1) Create a table with single column tblRawBulk(RawRow varchar(1000)).
2) Insert all of the rows there.
3) Remove unnecessary unformatted rows (e.g. 'test' in your example) with WHERE-parameter. Or even remove all unnecessary rows (except of rows, that you need to load) to simplify point 5.
4) Upload this table with BCP to some workfolder.
5) Download this new file.
It's no exact what you want, but may help if you'll write corresponding stored procedure, that can automatically do all this things.

PostgreSQL: Creating a .sql file to insert data into a table the fastest way

I'm working on a project where I have to parse a bunch of .csv files, all of different formats and containing different kinds of data through some C++ functions. After that I extract data from the files and create a .sql file that can be imported in psql to insert the data into a PostgreSQL database at a later stage.
But I am not able to figure out the correct syntax for the .sql file. Here is a sample table and a sample .sql file reproducing the same errors I am getting:
Table Creation Code:
CREATE TABLE "Sample_Table"
(
"Col_ID" integer NOT NULL,
"Col_Message" character varying(50),
CONSTRAINT "Sample_Table_pkey" PRIMARY KEY ("Col_ID" )
)
insertion.sql (after the copy line, fields separated by a single tab character)
copy Sample_Table (Col_ID, Col_Message) from stdin;
1 This is Spaaarta
2 Why So Serious
3 Baazinga
\.
Now if I execute the above sql file, I get the following error:
ERROR: syntax error at or near "1"
LINE 2: 1 This is Spaaarta
^
********** Error **********
If it can help, I'm running a PostgreSQL 9.1 release, and all the above queries were executed through PGAdmin III Software.

PgAdmin doesn't support executing COPY commands in the same way that psql does (or at least, it didn't the last time I tried it with version 1.14). Use psql to execute the script, or use INSERT statements.

Three things to check:
Is there actually exactly one tab characters between the columns? Spaces are a no-go.
Are there more error messages? I'm missing at least one. (See below)
When you force case sensitive table and column names you have to do this consequently. Therefore you must write this:
copy "Sample_Table" ("Col_ID", "Col_Message") from stdin;
Otherwise you will get theese errors:
psql:x.sql:1: ERROR: relation "sample_table" does not exist
psql:x.sql:5: invalid command \.
psql:x.sql:5: ERROR: syntax error at or near "1"
LINE 1: 1 This is Spaaarta
^
With these things in place I can use your example data successfully.
EDIT Bug change: The questioner now has
ERROR: invalid input syntax for integer: "1 'This is Spaaarta'"
So something with the 1 is not OK.
My guess is, that this is an encoding problem. Windows with it's UTF-16 stuff might be the culprit here.
Debugging these kind of problems other the web is not easy, because to many semi-intelligent programs are in the line, most of them like to adjust "a few" things.
But first check a few things in psql:
\encoding
show client_encoding;
show server_encoding;
According to the pastebin data these should be the same and one of "SQL_ASCII", "LATIN1" or "UTF-8".
If they already are or if adjusting them does not help: Unix/Linux/cygwin has a hexdump -C x.sql program, post its output to pastebin. DO NOT USE the hexdump from any Windows editor like ultraedit - they have fooled me several times. When transferring the file to Linux be sure to use binary transfer.

Strange Characters in database text: Ã, Ã, ¢, â‚ €,

I'm not certain when this first occured.
I have a new drop-shipping affiliate website, and receive an exported copy of the product catalog from the wholesaler. I format and import this into Prestashop 1.4.4.
The front end of the website contains combinations of strange characters inside product text: Ã, Ã, ¢, â‚ etc. They appear in place of common characters like , - : etc.
These characters are present in about 40% of the database tables, not just product specific tables like ps_product_lang.
Another website thread says this same problem occurs when the database connection string uses an incorrect character encoding type.
In /config/setting.inc, there is no character encoding string mentioned, just the MySQL Engine, which is set to InnoDB, which matches what I see in PHPMyAdmin.
I exported ps_product_lang, replaced all instances of these characters with correct characters, saved the CSV file in UTF-8 format, and reimported them using PHPMyAdmin, specifying UTF-8 as the language.
However, after doing a new search in PHPMyAdmin, I now have about 10 times as many instances of these bad characters in ps_product_lang than I started with.
If the problem is as simple as specifying the correct language attribute in the database connection string, where/how do I set this, and what to?
Incidently, I tried running this command in PHPMyAdmin mentioned in this thread, but the problem remains:
SET NAMES utf8
UPDATE: PHPMyAdmin says:
MySQL charset: UTF-8 Unicode (utf8)
This is the same character set I used in the last import file, which caused more character corruptions. UTF-8 was specified as the charset of the import file during the import process.
UPDATE2
Here is a sample:
people are truly living untetheredÃƒÆ’Ã‚Â¢ÃƒÂ¢Ã¢â‚¬Å¡Ã‚Â¬ÃƒÂ¯Ã¢â‚¬Â
Ã‚ï† buying and renting movies online, downloading software, and
sharing and storing files on the web.
UPDATE3
I ran an SQL command in PHPMyAdmin to display the character sets:
character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
So, perhaps my database needs to be converted (or deleted and recreated) to UTF-8. Could this pose a problem if the MySQL server is latin1?
Can MySQL handle the translation of serving content as UTF8 but storing it as latin1? I don't think it can, as UTF8 is a superset of latin1. My web hosting support has not replied in 48 hours. Might be too hard for them.

If the charset of the tables is the same as it's content try to use mysql_set_charset('UTF8', $link_identifier). Note that MySQL uses UTF8 to specify the UTF-8 encoding instead of UTF-8 which is more common.
Check my other answer on a similar question too.

This is surely an encoding problem. You have a different encoding in your database and in your website and this fact is the cause of the problem. Also if you ran that command you have to change the records that are already in your tables to convert those character in UTF-8.
Update: Based on your last comment, the core of the problem is that you have a database and a data source (the CSV file) which use different encoding. Hence you can convert your database in UTF-8 or, at least, when you get the data that are in the CSV, you have to convert them from UTF-8 to latin1.
You can do the convertion following this articles:
Convert latin1 to UTF8
http://wordpress.org/support/topic/convert-latin1-to-utf-8

This appears to be a UTF-8 encoding issue that may have been caused by a double-UTF8-encoding of the database file contents.
This situation could happen due to factors such as the character set that was or was not selected (for instance when a database backup file was created) and the file format and encoding database file was saved with.
I have seen these strange UTF-8 characters in the following scenario (the description may not be entirely accurate as I no longer have access to the database in question):
As I recall, there the database and tables had a "uft8_general_ci" collation.
Backup is made of the database.
Backup file is opened on Windows in UNIX file format and with ANSI encoding.
Database is restored on a new MySQL server by copy-pasting the contents from the database backup file into phpMyAdmin.
Looking into the file contents:
Opening the SQL backup file in a text editor shows that the SQL backup file has strange characters such as "sÃƒÂ¥". On a side note, you may get different results if opening the same file in another editor. I use TextPad here but opening the same file in SublimeText said "sÃ¥" because SublimeText correctly UTF8-encoded the file -- still, this is a bit confusing when you start trying to fix the issue in PHP because you don't see the right data in SublimeText at first. Anyways, that can be resolved by taking note of which encoding your text editor is using when presenting the file contents.
The strange characters are double-encoded UTF-8 characters, so in my case the first "Ãƒ" part equals "Ã" and "Â¥" = "¥" (this is my first "encoding"). THe "Ã¥" characters equals the UTF-8 character for "å" (this is my second encoding).
So, the issue is that "false" (UTF8-encoded twice) utf-8 needs to be converted back into "correct" utf-8 (only UTF8-encoded once).
Trying to fix this in PHP turns out to be a bit challenging:
utf8_decode() is not able to process the characters.
// Fails silently (as in - nothing is output)
$str = "sÃƒÂ¥";
$str = utf8_decode($str);
printf("\n%s", $str);
$str = utf8_decode($str);
printf("\n%s", $str);
iconv() fails with "Notice: iconv(): Detected an illegal character in input string".
echo iconv("UTF-8", "ISO-8859-1", "sÃƒÂ¥");
Another fine and possible solution fails silently too in this scenario
$str = "sÃƒÂ¥";
echo html_entity_decode(htmlentities($str, ENT_QUOTES, 'UTF-8'), ENT_QUOTES , 'ISO-8859-15');
mb_convert_encoding() silently: #
$str = "sÃƒÂ¥";
echo mb_convert_encoding($str, 'ISO-8859-15', 'UTF-8');
// (No output)
Trying to fix the encoding in MySQL by converting the MySQL database characterset and collation to UTF-8 was unsuccessfully:
ALTER DATABASE myDatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE myTable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
I see a couple of ways to resolve this issue.
The first is to make a backup with correct encoding (the encoding needs to match the actual database and table encoding). You can verify the encoding by simply opening the resulting SQL file in a text editor.
The other is to replace double-UTF8-encoded characters with single-UTF8-encoded characters. This can be done manually in a text editor. To assist in this process, you can manually pick incorrect characters from Try UTF-8 Encoding Debugging Chart (it may be a matter of replacing 5-10 errors).
Finally, a script can assist in the process:
$str = "sÃƒÂ¥";
// The two arrays can also be generated by double-encoding values in the first array and single-encoding values in the second array.
$str = str_replace(["Ãƒ","Â¥"], ["Ã","¥"], $str);
$str = utf8_decode($str);
echo $str;
// Output: "så" (correct)

I encountered today quite a similar problem : mysqldump dumped my utf-8 base encoding utf-8 diacritic characters as two latin1 characters, although the file itself is regular utf8.
For example : "é" was encoded as two characters "Ã©". These two characters correspond to the utf8 two bytes encoding of the letter but it should be interpreted as a single character.
To solve the problem and correctly import the database on another server, I had to convert the file using the ftfy (stands for "Fixes Text For You). (https://github.com/LuminosoInsight/python-ftfy) python library. The library does exactly what I expect : transform bad encoded utf-8 to correctly encoded utf-8.
For example : This latin1 combination "Ã©" is turned into an "é".
ftfy comes with a command line script but it transforms the file so it can not be imported back into mysql.
I wrote a python3 script to do the trick :
#!/usr/bin/python3
# coding: utf-8
import ftfy
# Set input_file
input_file = open('mysql.utf8.bad.dump', 'r', encoding="utf-8")
# Set output file
output_file = open ('mysql.utf8.good.dump', 'w')
# Create fixed output stream
stream = ftfy.fix_file(
input_file,
encoding=None,
fix_entities='auto',
remove_terminal_escapes=False,
fix_encoding=True,
fix_latin_ligatures=False,
fix_character_width=False,
uncurl_quotes=False,
fix_line_breaks=False,
fix_surrogates=False,
remove_control_chars=False,
remove_bom=False,
normalization='NFC'
)
# Save stream to output file
stream_iterator = iter(stream)
while stream_iterator:
try:
line = next(stream_iterator)
output_file.write(line)
except StopIteration:
break

Apply these two things.
You need to set the character set of your database to be utf8.
You need to call the mysql_set_charset('utf8') in the file where you made the connection with the database and right after the selection of database like mysql_select_db use the mysql_set_charset. That will allow you to add and retrieve data properly in whatever the language.

The error usually gets introduced while creation of CSV. Try using Linux for saving the CSV as a TextCSV. Libre Office in Ubuntu can enforce the encoding to be UTF-8, worked for me.
I wasted a lot of time trying this on Mac OS. Linux is the key. I've tested on Ubuntu.
Good Luck