BCP escape \n in data - sql-server

I'm using BCP to download data from SQL Server, using queryout option.
However, I notice that if the data content in any columns contain '\n', the content exported from BCP will be treated as newline.
For example, if the data in SQL Server is:
COLUMN_1 COLUMN_2
AAA NAME\nSURNAME
BBB NAMESURNAME
The exported file be like:
AAA NAME
SURNAME
BBB NAMESURNAME
Refer to BCP document, as I understand, the -c should not treat \n as newline.
-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
I'm not sure what I misunderstood.
Here is the command I use:
bcp "select [col_name] from [table_name] where [condition]" queryout test.dat -U[username] -P[password] -S[serverip.port] -c
Thank you.

If your data contains newline or crlf control characters then those characters WILL, naturally, be included in the data that is copied out.
Are the control characters supposed to be there? If so, then leave them and they will be imported into whatever your destination is. Just because your text view shows a "broken" line, does not mean that SQL Server cannot ingest that line again and keep the control character tucked into the data (again, if those control characters belong there... i've seen plenty of cases where they would be).
If the newline character "\n" (or any control character for that matter) is NOT desired, then it's just a matter of doing as you commented in Martin's answer. Just clean the data either before you query it ("update") or during query (as you commented with "select/replace") or after you've copied the data out.
I've used "file cleaner" applications in the past to "clean" a file of unwanted characters (this can be an issue with long-lived data or data that has traversed various platforms or has been touched by humans!!! Yuck!).

I assume that your text includes the actual \n control character rather than simply the characters \ and n next to each other?
Where this exists then your options are to either use native mode or change the row terminator to be something other than \n so that it recognises the correct pattern.
I'd suggest using native mode and test whether that re-imports the data correctly with the \n in place.

Related

BCP Fixed Width Import -> Unexpected EOF encountered in BCP data-file?

I have some sensitive information that I need to import into SQL Server that is proving to be a challenge. I'm not sure what the original database that housed this information was, but I do know it is provided to us in a Unix fixed length text file with LF row terminator. I have two files: a small file that covers a month's worth of data, and a much larger file that covers 5 years worth of data. I have created a BCP format file and command that successfully imports and maps the data to my SQL Server table.
The 5 year data is supposedly in the same format, so I've used the same command and format file on the text file. It starts processing some records, but somewhere in the processing (after several thousand records), it throws Unexpected EOF encountered and I can see in the database some of the rows are mapped correctly according to the fixed lengths, but then something goes horribly wrong and screws up by inserting parts of data in columns they most definitely do not belong in. Is there a character that would cause BCP to mess up and terminate early?
BCP Command: BCP DBTemp.dbo.svc_data_temp in C:\Test\data2.txt -f C:\test\txt2.fmt -T -r "0x0A" -S "stageag,90000" -e log.rtf
Again, format file and command work perfectly for the the smaller data set, but something in the 5 year dataset is screwing up BCP.
Thanks in advance for the replies!
So I found the offending characters in my fixed width file. Somehow whoever pulled the data originally (I don't have access to the source), escaped (or did not escape correctly) the double quotes in some of the text, resulting in some injection of extra spaces breaking the fixed width guidelines we were supposed to be following. After correcting the double quotes by hex editing the file, BCP was able to process all records using the format file without issue. I had used the -F and -L flags to examine certain rows of the data and to narrow it down to where I could visually compare the rows that were ok and the rows where the problems started to arise, which led me to discover the double quotes issue. Hope his helps for somebody else if they have an issue similar to this!

Import CSV files using BCP to Azure SQL Server

My BCP command looks like this:
BCP azuredatabase.dbo.rawdata IN "clientPathToCSVFile" -S servername -U user#servername -P pw -c -t,-r\n
My CSV file is in {cr}{lf} format.
My CSV looks like this
125180896918,20,9,57.28,2020-01-04 23:02:21,282992,1327,4,2850280,49552
125180896919,20,10,57.82,2020-01-04 23:02:21,282992,1298,4,2850280,48881
125180896920,16,11,58.20,2020-01-04 23:02:21,282992,1065,4,2850280,48612
125180896921,20,12,69.10,2020-01-04 23:02:21,282992,515,4,2850280,10032
125180896922,20,13,70.47,2020-01-04 23:02:21,282992,1280,4,2850280,48766
125180896923,1,1,105.04,2020-01-04 23:02:21,,1296,4,2969398,49161
As you can see there are also empty fields.
My output looks like this
Starting copy...
0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 547
So how do I correctly setup my command for BCP?
You stated that your data is in CRLF format (that means \r\n). But your bcp command is told to look for a line terminator of \n (using the -r option).
I would have expected to see the first half of your actual CRLF line terminator of "\r\n" be split in half with the \r being included in your last column and the \n being found as the line terminator, but it look like BCP loaded no rows because it found no \n in your file.
I have not worked with Azure/BCP much, so maybe someone else knows how BCP for Azure would handle this, but the SQL Server version of BCP would still find your \n and then load the \r into your last column.
Either that or your line terminator is now what you think it is. Have you viewed the file with a text editor (not notepad, not wordpad... something that will show hidden characters like line terminators and tabs and such).
Usually, when BCP loads no rows (and there are rows in the file to load), it could be a mixup with line terminators.

Which is the best character to use as a delimiter for ETL?

I recently unloaded a customer table from an Informix DB and several rows were rejected because the customer name column contained non-escaped vertical bars (pipe symbol) characters, which is the default DBDELIMITER in the source db. I found out that the field in their customer form has an input mask allowing any alphanumeric character to be entered, which can include any letters, numbers or symbols. So I persuaded the user to run a blanket update on that column to change the pipe symbol to a semicolon. I also discovered other rows containing asterisks and commas in different columns. I could imagine what would happen if this table were to be unloaded in csv format or what damage the asterisks could do!
What is the best character to define as a delimiter?
If tables are already tainted with pipes, commas, asterisks, tabs, backslashes, etc., what's the best way to clean them up?
I have to deal with large volumes of narrative data at my job. This is always a nightmare because users are apt to put ANY character in there, including unprintable characters. You can run a cleanup operation, but you have to do it every time you load data, and it likely won't work forever. Eventually someone will put in what every character you choose as a separator, which is not a problem if your CSV handling libraries can handle escaping properly, but many can't. If this is a one time load/unload, you're probably fine, but if you have to do it more often....
In the past I've changed the separator to the back-tick '`', the tilde '~', or the caret '^'. All failed in the current effort. The best solution I could come up with is to not use CSV format at all. I switched to XML. Even so there were still XML illegal characters, but these can be translated out with atlassian-xml-cleaner-0.1.jar.
Unload customer table with default pipe; string search for a character that doesn't exist. ie. "~"
unload to file delimiter "~"
select * from customer;
Clean your file (or not)
(vi replace string):g/theoldstring/s//thenewstring/g)
or
(unix prompt) sed 's/old-char/new-char/g' fileold > filenew
(Once clean id personally change back "~" in unload file to "|" or "," as csv standard)
Load to source db.
If you can, use a multi-character delimiter. It can still fail, but it should be much more highly unlikely.
Or, escape the delimiter while writing the export file (Informix docs say "LOAD TABLE" escapes by prefixing delimiter characters with backslash). Proper CSV has quoting and escaping so it shouldn't matter if a comma is in the data, unless your exporter and loader cannot handle proper CSV.

Strange Characters in database text: Ã, Ã, ¢, â‚ €,

I'm not certain when this first occured.
I have a new drop-shipping affiliate website, and receive an exported copy of the product catalog from the wholesaler. I format and import this into Prestashop 1.4.4.
The front end of the website contains combinations of strange characters inside product text: Ã, Ã, ¢, â‚ etc. They appear in place of common characters like , - : etc.
These characters are present in about 40% of the database tables, not just product specific tables like ps_product_lang.
Another website thread says this same problem occurs when the database connection string uses an incorrect character encoding type.
In /config/setting.inc, there is no character encoding string mentioned, just the MySQL Engine, which is set to InnoDB, which matches what I see in PHPMyAdmin.
I exported ps_product_lang, replaced all instances of these characters with correct characters, saved the CSV file in UTF-8 format, and reimported them using PHPMyAdmin, specifying UTF-8 as the language.
However, after doing a new search in PHPMyAdmin, I now have about 10 times as many instances of these bad characters in ps_product_lang than I started with.
If the problem is as simple as specifying the correct language attribute in the database connection string, where/how do I set this, and what to?
Incidently, I tried running this command in PHPMyAdmin mentioned in this thread, but the problem remains:
SET NAMES utf8
UPDATE: PHPMyAdmin says:
MySQL charset: UTF-8 Unicode (utf8)
This is the same character set I used in the last import file, which caused more character corruptions. UTF-8 was specified as the charset of the import file during the import process.
UPDATE2
Here is a sample:
people are truly living untetheredâ€ïâ€Â
Ã‚ï† buying and renting movies online, downloading software, and
sharing and storing files on the web.
UPDATE3
I ran an SQL command in PHPMyAdmin to display the character sets:
character_set_client utf8
character_set_connection utf8
character_set_database latin1
character_set_filesystem binary
character_set_results utf8
character_set_server latin1
character_set_system utf8
So, perhaps my database needs to be converted (or deleted and recreated) to UTF-8. Could this pose a problem if the MySQL server is latin1?
Can MySQL handle the translation of serving content as UTF8 but storing it as latin1? I don't think it can, as UTF8 is a superset of latin1. My web hosting support has not replied in 48 hours. Might be too hard for them.
If the charset of the tables is the same as it's content try to use mysql_set_charset('UTF8', $link_identifier). Note that MySQL uses UTF8 to specify the UTF-8 encoding instead of UTF-8 which is more common.
Check my other answer on a similar question too.
This is surely an encoding problem. You have a different encoding in your database and in your website and this fact is the cause of the problem. Also if you ran that command you have to change the records that are already in your tables to convert those character in UTF-8.
Update: Based on your last comment, the core of the problem is that you have a database and a data source (the CSV file) which use different encoding. Hence you can convert your database in UTF-8 or, at least, when you get the data that are in the CSV, you have to convert them from UTF-8 to latin1.
You can do the convertion following this articles:
Convert latin1 to UTF8
http://wordpress.org/support/topic/convert-latin1-to-utf-8
This appears to be a UTF-8 encoding issue that may have been caused by a double-UTF8-encoding of the database file contents.
This situation could happen due to factors such as the character set that was or was not selected (for instance when a database backup file was created) and the file format and encoding database file was saved with.
I have seen these strange UTF-8 characters in the following scenario (the description may not be entirely accurate as I no longer have access to the database in question):
As I recall, there the database and tables had a "uft8_general_ci" collation.
Backup is made of the database.
Backup file is opened on Windows in UNIX file format and with ANSI encoding.
Database is restored on a new MySQL server by copy-pasting the contents from the database backup file into phpMyAdmin.
Looking into the file contents:
Opening the SQL backup file in a text editor shows that the SQL backup file has strange characters such as "sÃ¥". On a side note, you may get different results if opening the same file in another editor. I use TextPad here but opening the same file in SublimeText said "så" because SublimeText correctly UTF8-encoded the file -- still, this is a bit confusing when you start trying to fix the issue in PHP because you don't see the right data in SublimeText at first. Anyways, that can be resolved by taking note of which encoding your text editor is using when presenting the file contents.
The strange characters are double-encoded UTF-8 characters, so in my case the first "Ã" part equals "Ã" and "Â¥" = "¥" (this is my first "encoding"). THe "Ã¥" characters equals the UTF-8 character for "å" (this is my second encoding).
So, the issue is that "false" (UTF8-encoded twice) utf-8 needs to be converted back into "correct" utf-8 (only UTF8-encoded once).
Trying to fix this in PHP turns out to be a bit challenging:
utf8_decode() is not able to process the characters.
// Fails silently (as in - nothing is output)
$str = "så";
$str = utf8_decode($str);
printf("\n%s", $str);
$str = utf8_decode($str);
printf("\n%s", $str);
iconv() fails with "Notice: iconv(): Detected an illegal character in input string".
echo iconv("UTF-8", "ISO-8859-1", "så");
Another fine and possible solution fails silently too in this scenario
$str = "så";
echo html_entity_decode(htmlentities($str, ENT_QUOTES, 'UTF-8'), ENT_QUOTES , 'ISO-8859-15');
mb_convert_encoding() silently: #
$str = "så";
echo mb_convert_encoding($str, 'ISO-8859-15', 'UTF-8');
// (No output)
Trying to fix the encoding in MySQL by converting the MySQL database characterset and collation to UTF-8 was unsuccessfully:
ALTER DATABASE myDatabase CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE myTable CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
I see a couple of ways to resolve this issue.
The first is to make a backup with correct encoding (the encoding needs to match the actual database and table encoding). You can verify the encoding by simply opening the resulting SQL file in a text editor.
The other is to replace double-UTF8-encoded characters with single-UTF8-encoded characters. This can be done manually in a text editor. To assist in this process, you can manually pick incorrect characters from Try UTF-8 Encoding Debugging Chart (it may be a matter of replacing 5-10 errors).
Finally, a script can assist in the process:
$str = "så";
// The two arrays can also be generated by double-encoding values in the first array and single-encoding values in the second array.
$str = str_replace(["Ã","Â¥"], ["Ã","¥"], $str);
$str = utf8_decode($str);
echo $str;
// Output: "så" (correct)
I encountered today quite a similar problem : mysqldump dumped my utf-8 base encoding utf-8 diacritic characters as two latin1 characters, although the file itself is regular utf8.
For example : "é" was encoded as two characters "é". These two characters correspond to the utf8 two bytes encoding of the letter but it should be interpreted as a single character.
To solve the problem and correctly import the database on another server, I had to convert the file using the ftfy (stands for "Fixes Text For You). (https://github.com/LuminosoInsight/python-ftfy) python library. The library does exactly what I expect : transform bad encoded utf-8 to correctly encoded utf-8.
For example : This latin1 combination "é" is turned into an "é".
ftfy comes with a command line script but it transforms the file so it can not be imported back into mysql.
I wrote a python3 script to do the trick :
#!/usr/bin/python3
# coding: utf-8
import ftfy
# Set input_file
input_file = open('mysql.utf8.bad.dump', 'r', encoding="utf-8")
# Set output file
output_file = open ('mysql.utf8.good.dump', 'w')
# Create fixed output stream
stream = ftfy.fix_file(
input_file,
encoding=None,
fix_entities='auto',
remove_terminal_escapes=False,
fix_encoding=True,
fix_latin_ligatures=False,
fix_character_width=False,
uncurl_quotes=False,
fix_line_breaks=False,
fix_surrogates=False,
remove_control_chars=False,
remove_bom=False,
normalization='NFC'
)
# Save stream to output file
stream_iterator = iter(stream)
while stream_iterator:
try:
line = next(stream_iterator)
output_file.write(line)
except StopIteration:
break
Apply these two things.
You need to set the character set of your database to be utf8.
You need to call the mysql_set_charset('utf8') in the file where you made the connection with the database and right after the selection of database like mysql_select_db use the mysql_set_charset. That will allow you to add and retrieve data properly in whatever the language.
The error usually gets introduced while creation of CSV. Try using Linux for saving the CSV as a TextCSV. Libre Office in Ubuntu can enforce the encoding to be UTF-8, worked for me.
I wasted a lot of time trying this on Mac OS. Linux is the key. I've tested on Ubuntu.
Good Luck

\n characters before EOF in file causing problems

I have written some data to a file manually i.e. not by my application.
My code is reading the data char by char and storing them in different arrays but my program gets stuck when I insert the condition EOF.
After some investigation I found out that in my file before EOF there are three to four \n characters. I have not inserted them. I don't understand why they are in my file.
Want to remove those pesky extra characters? First, see how many of them there are at the end of your file:
od -c <filename> | tail
Then, remove however many characters you don't like. If it's 3:
truncate -s -3 <filename>
But overall, if it were me, I'd change my program to discard undesired newline characters, unless they're truly invalid according to the input file format specification.
It is very easy to add additional newlines to the end of a file in every text editor. You have to push the cursor around to see them. Open your file in your editor and see what happens when you navigate to the end, you'll see the extra newlines.
There is no such thing as an EOF character in general. Windows treats control-Z as EOF in some cases. Perhaps you are talking about the return value from some API that indicates that it has reached the end of file?

Resources