Fix CSV file with new lines

Fix CSV file with new lines - sql-server

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e

A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo

If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Related

CSV file not recognized as csv, reason nominal value not declared in header

I am trying to load a dataset in weka, I have tried many solutions such as arff format, comas etc. but it was all a failure. Could any of you give me a working solution or load this dataset according to the format.
Here is a link to dataset

Instead of using Weka's functionality for reading CSV files, you could use ADAMS (developed at the same university; I'm the lead developer) instead.
Download the adams-ml-app snapshot and then use the Weka Investigator to load/save the file:
Load it as ADAMS Spreadsheets (.csv, .csv.gz)
Save it as Arff data files (.arff, .arff.gz) or Simple ARFF data files (.arff, .arff.gz)
The Reviews column contains an erroneous 3.0M, which prevents it from becoming numeric.
If you want to have an introduction to the Weka Investigator, then take a look at my talk from the Weka User Conference 2021: Taking Weka to the next level with ADAMS .

There are too many issues with lines in this file.
In line 23, I eliminated the odd looking brackets.
I removed all single quotes (')
I eliminated all repeated double quotes ("")
In line 10474 the first two fields (before the number) didn't seem to be separated, so I added a comma.
This allowed the file to go through initial screening, but...
The file contains a lot of odd emojis. I started to eliminate them one by one, but there are clearly more of these than I wish to deal with.
Each time I got rid of one, it would read farther into the file, then stop at the next one.
If I just try to read the top of the file, the first 20 lines before we get to any of these problems, it reads fine.
My partial editing can be found here: https://www.dropbox.com/s/ij707mb23dt1jvz/googleplaystore3.csv?dl=0
I think if you clear up the remaining emojis the file should be usable.

Add line numbers to Microsoft Report Builder output

I have a report made by Microsoft Report Builder and I need to add line numbers. When the data is represented as a table, this is easy to accomplish by adding a new column for the line number. However, when the data is a text block inside a text block, there is no way to add numbers to each line of text. I have thought of exporting the report to word, reading the word doc into python, and adding lines there, but at first attempt it seems a little difficult. Has anyone had to do this before?

Ragged Right in SSIS does not work properly

Hello: I have an SSIS package that imports a flat text file: the text file is a simple, fixed-width file that’s also CR/LF delimited. This means that: EACH record has a set of fixed length columns (the columns are defined using fixed lengths), but each record must also end with a CR/LF.
I’ve defined the package as follows:
PROBLEM:
Some records do not have all of the columns defined, and thus they are shorter. However, ALL records end with a CR/LF. First I tried to import it using “fixed width” file and the shorter records were misaligned because obviously it wasn’t fixed length. Now that I am using ragged right, I am still facing the same issue. Basically, for the shorter records, SSIS borrows from the next line to compensate for the thing. THE NEXT line, however, is just fine.
POSSIBLE SOLUTIONS:
1- Ignore the rest of the columns that are not needed (basically ignore it): this works fine but is not elegant. I was hoping for a better solution.
2- USE the record type at the beginning to split BEFORE defining columns. This works fine also but I have over 500 fields, and the point of using the Flat File import is to be able to generate the columns automatically.
3- Use a script component: that seems like a difficult thing to do.

How to retrieve the name of a file and store it in the database using SSIS package?

I'm doing an Excel loop through fifty or more Excel files. The loop goes through each Excel file, grabs all the data and inputs it into the database without error. This is the typical process of setting delay validation to true, and making sure that the expression for the Excel Connection is a string variable called EFile that is set to nothing (in the loop).
What is not working: trying to input the name of the Excel file into the database.
What's been tried (edit; SO changed my 2 to 1 - don't know why):
Add a derived column between the Excel file and database input, and add a column using the EFile expression (so under Expression in the Derived Column it would be #[User::EFile]). and add the empty. However, this inputs nothing a blank (nothing).
One suggestion was to add ANOTHER string variable and set its properties EvaluateAsExpression to True and set the Expression to the EFile variable (#[User::EFile]). The funny thing is that this does the same thing - inputs a blank into the database.
Numerous people on blogs claim they can do this, yet I haven't seen one actually address this (I have a blog and I will definitely be showing people how to do this when I get an answer because, so far, these others have fallen short). How do I grab an Excel file's name and input it in a database during a loop?
Added: Forgot to add, no scripts; the claim is that it can be done without them, so I want to see the solution without them.
Note: I already have the ability to import the data from the Excel files - that's easy (see my GitHub account, as I have two different projects for importing all sorts of txt, csv, xls, xlsx data). I am trying to also get the actual name of the file being imported also into the database. So, if there are fifty Excel files, along with the data in each file, the database will have the fifty file names alongside that data (so if each file has 1000 rows of data, each 1000 rows would also have the name of the file they came from next to them as an additional column). This point seems to cause a lot of confusion, as people assume I'm having trouble importing data in files - NOPE, see my GitHub; again that's easy. It's the FILENAME that needs to also be imported.
Test package: https://github.com/tmmtsmith/SSISLoopWithFileName
Solution: #jaimet pointed out that the Derived Column needed to be the #[User::CurrentFile] (see the test package). When I first ran the package, I still got a blank value in my database. But when we originally set up the connection, we do point it to an actual file (I call this "fooling the package"), then change the expression on the connecting later to the #[User::CurrentFile], which is blank. The Derived Column, using the variable #[User::CurrentFile], showed a string of 0. So, I removed the Derived Column, put the full file path and name in the variable, then added the variable to the Derived Column (which made it think the string was 91 characters long), then went back and set the variable to nothing (English teacher would hate the THENs about right now). When I ran the package, it inputted the full file path. Maybe, like the connection, it needs to initially think that a file exists in order for it to input the full amount of characters?
Appreciate all the help.

The issue is because of blank value in the variable #[User::FileNameInput] and this caused the SSIS package to assume that the value of this variable will always be of zero length in the Derived Column transformation.
Change the expression on the Derived column transformation from #[User::FileNameInput] to (DT_STR, 2000, 1252)#[User::FileNameInput].
Type casting the derived column to 2000 sets the column length to that maximum value. The value 1252 represents the code page. I assumed that you are using ANSI code page. I took the value 2000 from your table definition because the FilePath column had variable VARCHAR(2000). If the column data type had been NVARCHAR(2000), then the expression would be (DT_WSTR, 2000)#[User::FileNameInput]

Tim,
You're using the wrong variable in your Derived Column component. You are storing the filename in #[User::CurrentFile] but the variable that you're using in your Derived Column component is #[User::FileNameInput]
Change your Derived Column component to use #[User::CurrentFile] and you'll be good.
Hope that helps.
JT

If you are using a ForEach loop to process the files in a folder then I have have used the technique described in SSIS Junkie's blog to get the filename in to an SSIS variable: SSIS: Enumerating files in a Foreach loop
You can use the variable later in your flow to write it to the database.

TO all intents and purposes your method #1 should work. That's exactly how I would attempt to do it. I am baffled as to why it is not working. Could you perhaps share your package?
Tony, thanks very much for the link. Much appreciated.
Regards
Jamie

SQL 2005 CSV Import Quote Delimited with inner Quotes and Commas

I have a CSV file with quote text delimiters. Most of the 90000 rows are fine, but I have a few rows that have a text field that contains both a quote and a comma. For example the fields value would be:
AB",AB
When Delimited this becomes
"AB"",AB"
When SQL 2005 attempts to import this I get errors such as...
Messages
Error 0xc0202055: Data Flow Task: The column delimiter for column "Column 4" was not found.
(SQL Server Import and Export Wizard)
This only seems to happen when a quote and comma are in a text value together. Values like
AB"AB which becomes "AB""AB"
or
AB,AB which becomes "AB,AB"
work fine.
Here are some example rows...
"1464885","LEVER WM","","B","MP17"
"1465075",":PLT-BC !!NOTE!!","","B",""
"1465076","BRKT-STR MTR !NOTE!","","B",""
"1465172",":BRKT-SW MTG !NOTE!","","B","MP16"
"1465388","BUSS BAR !NOTE!","","B","MP10"
"1465391","PLT-BLKHD ""NOTE""","","B","MP20"
"1465564","SPROCKET:13TEETH,74MM OD,66MM","ID W/.25"" SETSCR","B","MP6"
"S01266330002","CABLE:224"",E122/261,8 CO","","B","MP11"
The last row is an example of the problem - the "", causes the error.

I've had MAJOR problems with SSIS. Things that Access, Excel and even DTS seemed to do very well, SSIS chokes on. Variable record-length data is another problem but, yes, these embedded qualifiers are a major problem. Especially if you do not have access to the import files because they're on someone else's server that you pay to gain access to and might even be 4 to 5 GB in size! Cant just to a "replace all" on that every import.
You may want to check into this at Microsoft Downloads called "UnDouble" and here is another workaround you might try.
Seems like with SSIS in SQL Server 2008, the bug is still there. I dont know why they havent addressed this in the parser but its like we went back in time with SSIS in basic import functionality.
UPDATE 11-18-2010: This bug still exists in SSIS. Amazing.

How about just:
Search/replace all "", with ''; (fix all the broken fields)
Search/replace all ;''; with ,"", (to "unfix" properly empty fields.)
Search/replace all '';''; with "","", (to "unfix" properly empty fields which follow a correct encapsulation of embedded delimiters.)
That converts your original to:
"1464885","LEVER WM","","B","MP17"
"1465075",":PLT-BC !!NOTE!!","","B",""
"1465076","BRKT-STR MTR !NOTE!","","B",""
"1465172",":BRKT-SW MTG !NOTE!","","B","MP16"
"1465388","BUSS BAR !NOTE!","","B","MP10"
"1465391","PLT-BLKHD ""NOTE""","","B","MP20"
"1465564","SPROCKET:13TEETH,74MM OD,66MM","ID W/.25"" SETSCR","B","MP6"
"S01266330002","CABLE:224'';E122/261,8 CO","","B","MP11"
Which seems to run the gauntlet fine in SSIS. You may have to step 3 recursively to account for 3 empty fields in a row ('';'';'';, etc.) but the bottom line here is that when you have embedded text qualifiers, you have to either escape them or replace them. Let this be a lesson in your CSV creation processes going forward.

Microsoft says doubled double quotes inside double quote delimited fields just don't work. A fix is planned for the end of 2011...
In the mean time we will have to use workarounds like described in the other answers.

I would just do a search/replace for ", and replace it with ,
Do you have access to the original file?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight