Format text file so I can import it into excel - arrays

I have a huge list of addresses and details I need to convert into an Excel spreadsheet and I think the best way would be to read the data and then write a second document that separates the lines so that they are tab-delimited whilst recognizing blank lines (between data entries) to preserve each separate address.
It is in the format:
AddressA1
AddressB1
Postcode1
Name1
PhoneNumber1
AddressA2
AddressB2
Postcode2
Name2
Name2
PhoneNumber2
AddressA3
AddressB3
Postcode3
Name3
PhoneNumber3
So the difficulty also comes when there are multiple names for a company, but I can hand format those if necessary (ideally they want to take on the same address as each other).
The resulting text document then, wants to be tab-delimited to:
Name|AddressA|AddressB|Postcode|Phone Number
I am thinking this would be easiest to do within a simple .bat command? or should I open the list in excel and run a script through that..?
I'm thinking if I can run through where it adds each entry to an array ($address $name etc) then I can use that to build a new text file by writing $name[i] tab $address[$i] etc
There are hundreds of entries and putting it in by hand is proving.. difficult.
I have some experience in MEL (basically C++) so I do understand programming in general, but somewhat at a loss in how .bat and Excel (VB?) handle and define empty lines and tabs.

The first step would be to bring the data into an Excel file. Once the data has been imported, we can re-package it to meet your specs. The first step:
Sub BringFileIn()
Dim TextLine As String, CH As String
Close #1
Open "C:\TestFolder\question.txt" For Input As #1
Dim s As String
Dim I As Long, J As Long
J = 1
I = 1
Do While Not EOF(1)
Line Input #1, TextLine
Cells(I, J) = TextLine
I = I + 1
Loop
Close #1
End Sub

Any text editor that can do regex search and replace across multiple lines can do the job nicely.
I have written a hybrid JScript/batch utility called REPL.BAT that performs a regex search and replace on stdin and writes the result to stdout. It is pure script that works on any modern Windows machine from XP forward - No 3rd party executable required. Full documentation is embedded within the script.
Assuming REPL.BAT is in your current directory, or better yet, somewhere within your PATH, then:
type file.txt|repl "\r?\n" "\t" mx|repl "\t\t" "\n" x|repl "^(([^\t]*\t){4})([^\t]*)$" "$1\t$3" x >newFile.txt
The above modifies the file in 3 steps and writes the result to a new file, leaving the original intact:
convert all newlines into tabs
convert consecutive tabs into newlines
insert an empty column (tab) before the last column on any line that contains only 5 columns.

Here's a method using only Word and Excel. I used the data that you posted. I am assuming that Name2 is the only optional field.
Paste your text into Word.
Replace all paragraph marks with a special
characters. (Ctrl-h, Search for ^p, Replace with |)
Replace all line breaks with a different special character. (Ctrl-h, Special character, search for Manual line break, replace with ;)
This is what it looks like in Word:
AddressA1;AddressB1;Postcode1;Name1;PhoneNumber1|AddressA2;AddressB2;Postcode2;Name2;Name2;PhoneNumber2|AddressA3;AddressB3;Postcode3;Name3;PhoneNumber3||
Then convert text to table (Insert -> Table -> Convert text to table), delimiting by ;. This gives 3 rows (plus 2 blank rows) of 1 column.
Then copy the table.
Now in Excel:
Paste the table. (It'll be one row in each row, with all of your fields in column A.)
Convert the text to columns (Data tab, Text to columns, Delimited, check semicolon)
Sort by column E. The phone numbers should be grouped together.
Cut the phone numbers in column E and copy to column F.

Related

Using input_file_name, take substring between underscores without file extension

I want to take a substring from a filename every time a new file is coming to us for processing and load that value into file. The task here is like suppose we are receiving many files from X company for cleansing process and the first thing what we need to do is to take substring from the file name.
For Example: the file name is 'RV_NETWORK_AXN TECHNOLOGY_7737463273272635'. From this I want to take 'AXN TECHNOLOGY' and want to create a new column with name 'COMPANY NAME' in the same file and load 'AXN TECHNOLOGY" value into it. The file names change, but the company name will every time be after the second underscore.
In the comment, you said that using df_1 = df_1.withColumn('COMPANY', F.split(F.input_file_name(), '_')[3]) extracts AXN TECHMOLOGY.csv.
I'll suggest 2 options to you:
You could use one more split on \. and using element_at get the 2nd to last element. In this case, splitting on \. works and on . doesn't, because this argument of split function is not a simple string, but a regex pattern; and unescaped dot . in regex has a meaning of "any character".
df = df.withColumn(
'COMPANY',
F.element_at(F.split(F.split(F.input_file_name(), '_')[3], '\.'), -2)
)
The following regex pattern would extract only what's after the 3rd _ and potential 4th _, but not including file extension (e.g. .csv).
df = df.withColumn(
'COMPANY',
F.regexp_extract(F.input_file_name(), r'^.+?_.*?_.*?_([^_]+)\w*\.\w+$', 1)
)

How can I replace line breaks and paragraph marks in text fields, with spaces, in a flat file?

How can I replace line breaks and paragraph marks in text fields, with spaces, in a flat file?
I have a tab-delimited text file that parses fine in Excel and Gammadyne CSV Editor Pro. However, when I try to import it into SQL Server with the import wizard, it breaks some of the text fields into multiple fields. When I copy the text in those fields from Excel into Word, the incorrect field-breaks turn out to be line breaks within text fields.
So I thought, no problem, I'll just open the text file in Word and do a global replace of line breaks with spaces.
However, when I open the file in Word, the line breaks show up as paragraph marks, and in Notepad++ they show up as [CR][LF]. I can't do a global replace for those, because that's also the end-of row delimiter.
I realize that a possible solution would be to add an extra column on the right, fill it with something that won't appear in the text (e.g, |-|-|), then in Word replace all the paragraph marks with spaces, then replace all the last-column entries (|-|-|) with paragraph marks (or more precisely, using Word's conventions, replace "^t|-|-| " with "^p"). Is that the best I can do?
How can I efficiently get rid of these line breaks or paragraph marks within text fields in my flat file?
(Notice that word, "efficiently"). I've got 15K+ rows and 24 columns, so a one-by-one replacement would not exactly be "efficient"!)
Any suggestions? Thanks!
Integral

converting large text file to database

Background
I am not a programmer or technical person
I have a project where I need to convert a large text file to an access database.
The text file is not in traditional flat file format so I need some help pre processing.
The files are large (millions of records) between 100MB and 1GB and seem to be choking all of the editors I have tried (word pad, notepad, vim, em editor)
The following is a sample of the source text file:
product/productId:B000H9LE4U
product/title: Copper 122-H04 Hard Drawn Round Tubing, ASTM B75, 1/2" OD, 0.436" ID, 0.032" Wall, 96" Length
product/price: 22.14
review/userId: ABWHUEYK6JTPP
review/profileName: Robert Campbell
review/helpfulness: 0/0
review/score: 1.0
review/time: 1339113600review/summary: Either 1 or 5 Stars. Depends on how you look at it.
review/text: Either 1 or 5 Stars. Depends on how you look at it.1 Star because they sent 6 feet of 2" OD copper pipe.0 Star because they won't accept returns on it.5 stars because I figure it's actually worth $12-15/foot and since they won't take a return I figure I can sell it and make $40-50 on this deal
product/productId: B000LDNH8I
product/title: Bacharach 0012-7012 Sling Psychrometer, 25?F to 120?F, red spirit filled
product/price: 84.99
review/userId: A19Y7ZIICAKM48
review/profileName: T Foley "computer guy"
review/helpfulness: 3/3
review/score: 5.0
review/time: 1248307200
review/summary: I recommend this Sling Psychrometer
review/text: Not too much to say. This instrument is well built, accurate (compared) to a known good source. It's easy to use, has great instructions if you haven't used one before and stores compactly.I compared prices before I purchased and this is a good value.
Each line represents a specific attribute of a product, starting at "product/productId:"
What I need
I need to convert this file to a character delimited field (i think # symbol work) by stripping out each of the codes (i.e. product/productId:, product/title:, etc and replacing with the # and replacing the line feeds.
I want to eliminate the review/text: line
The output would look like this:
B000H9LE4U#Copper 122-H04 Hard Drawn Round Tubing, ASTM B75, 1/2" OD, 0.436" ID, 0.032" Wall, 96" Length#22.14#ABWHUEYK6JTPP#Robert Campbell#0/0#1.0#1339113600#Either 1 or 5 Stars. Depends on how you look at it.
B000LDNH8I#Bacharach 0012-7012 Sling Psychrometer, 25?F to 120?F, red spirit filled#84.99#A19Y7ZIICAKM48#T Foley "computer guy"#3/3#5.0#1248307200#I recommend this Sling Psychrometer
B000LDNH8I#Bacharach 0012-7012 Sling Psychrometer, 25?F to 120?F, red spirit filled#84.99#A3683PMJPFMAAS#Spencer L. Cullen#1/1#5.0#1335398400#A very useful tool
I now would have a flat file delimited with "#" that I can easily import into access.
Sorry for the ramble. I am open to suggestions, but don't understand programming enough to write using the editor language. Thanks in advance
This is a method I just put together and it comes with no guarantee. It reads the data (you have provided as sample) and displays in the right format as you need.
Public Sub ReadFileAndSave(filePath As String, breakIdentity As String, Optional sepStr As String = "#")
'******************************************************************************
' Opens a large TXT File, reads the data until EOF on the Source,
' then reformats the data to be saved on the Destination
' Arguments:
' ``````````
' 1. The Source File Path - "C:\Users\SO\FileName.Txt" (or) D:\Data.txt
' 2. The element used to identify new row - "new row" (or) "-" (or) "sam"
' 3. (Optional) Separator - The separator, you wish to use. Defauls to '#'
'*******************************************************************************
Dim newFilePath As String, strIn As String, tmpStr As String, lineCtr As Long
'The Destination file is stored in the same drive with a suffix to the source file name
newFilePath = Replace(filePath, ".txt", "-ReFormatted.txt")
'Open the SOURCE file for Read.
Open filePath For Input As #1
'Open/Create the DESTINATION file for Write.
Open newFilePath For Output As #2
'Loop the SOURCE till the last line.
Do While Not EOF(1)
'Read one line at a time.
Line Input #1, strIn
'If it is a blank/empty line SKIP.
If Len(strIn) > 1 Then
lineCtr = lineCtr + 1
'Create a String of the same ID.
tmpStr = tmpStr & Trim(Mid(strIn, InStr(strIn, ":") + 1)) & sepStr
'If a new row needs to be inserted, the BREAK IDENTITY is analyzed.
If InStr(strIn, breakIdentity) <> 0 And lineCtr > 1 Then
'Once the new row is triggered, dump the line in the Destination.
Print #2, Left(tmpStr, Len(tmpStr) - Len(Mid(strIn, InStr(strIn, ":") + 1)) - 1) & vbCrLf
'Prepare the NEXT ROW
tmpStr = Trim(Mid(strIn, InStr(strIn, ":") + 1)) & sepStr
End If
End If
Loop
'Print the last line
Print #2, Left(tmpStr, Len(tmpStr) - 1) & vbCrLf
'Close the files.
Close #1
Close #2
End Sub
Again, this code works on my system and I have not tested the bulk of the matter, so it might be slower in yours. Let me know if this works alright for you.
I'm not sure I understand how you want to map pf your textfile to data base fields.
That's the first thing you need to decide.
Once you've done that I'd suggest putting your text file into columns corresponding to the database columns. Then you should be able to import it into Access.

Fix CSV file with new lines

I ran a query on a MS SQL database using SQL Server Management Studio, and some the fields contained new lines. I selected to save the result as a csv, and apparently MS SQL isn't smart enough to give me a correctly formatted CSV file.
Some of these fields with new lines are wrapped in quotes, but some aren't, I'm not sure why (it seems to quote fields if they contain more than one new line, but not if they only contain one new line, thanks Microsoft, that's useful).
When I try to open this CSV in Excel, some of the rows are wrong because of the new lines, it thinks that one row is two rows.
How can I fix this?
I was thinking I could use a regex. Maybe something like:
/,[^,]*\n[^,]*,/
Problem with this is it matches the last element of one line and the 1st of the next line.
Here is an example csv that demonstrates the issue:
field a,field b,field c,field d,field e
1,2,3,4,5
test,computer,I like
pie,4,8
123,456,"7
8
9",10,11
a,b,c,d,e
A simple regex replacement won't work, but here's a solution based on preg_replace_callback:
function add_quotes($matches) {
return preg_replace('~(?<=^|,)(?>[^,"\r\n]+\r?\n[^,]*)(?=,|$)~',
'"$0"',
$matches[0]);
}
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){5}$~m';
$result=preg_replace_callback($row_regex, 'add_quotes', $source);
The secret to $row_regex is knowing ahead of time how many columns there are. It starts at the beginning of a line (^ in multiline mode) and consumes the next five things that look like fields. It's not as efficient as I'd like, because it always overshoots on the last column, consuming the "real" line separator and the first field of the next row before backtracking to the end of the line. If your documents are very large, that might be a problem.
If you don't know in advance how many columns there are, you can discover that by matching just the first row and counting the matches. Of course, that assumes the row doesn't contain any of the funky fields that caused the problem. If the first row contains column headers you shouldn't have to worry about that, or about legitimate quoted fields either. Here's how I did it:
preg_match_all('~\G,?[^,\r\n]++~', $source, $cols);
$row_regex = '~^(?:(?:(?:"[^"*]")+|[^,]*)(?:,|$)){' . count($cols[0]) . '}$~m';
Your sample data contains only linefeeds (\n), but I've allowed for DOS-style \r\n as well. (Since the file is generated by a Microsoft product, I won't worry about the older-Mac style CR-only separator.)
See an online demo
If you want a java programmatic solution, open the file using the OpenCSV library. If it is a manual operation, then open the file in a text editor such as Vim and run a replace command. If it is a batch operation, you can use a perl command to cleanup the CRLFs.

Using dlmwrite to write cell objects MATLAB

I have a cell array with 7 columns. All these columns contain strings. I want to write this cell array into a text file. To start, I was doing this on only 1 element of the cell and this is my code:
dlmwrite('735.txt',cell{1},'delimiter','%s\t');
cell{1} looks like this:
Columns 1 through 2
[1x30 char] [1x20 char]
Column 3
'Acaryochloris'
Column 4
'Cyanobacteria001'
Columns 5 through 6
'Cyanobacteria00' 'Cyanobacteria'
Column 7
'Bacteria'
It gives me the output without separating the columns. Sample output is:
Acaryochloris_marina_MBIC11017AcaryochlorismarinaAcaryochlorisCyanobacteria001Cyanobacteria00CyanobacteriaBacteria
The correct output should have spaces between all the columns :
Acaryochloris_marina_MBIC11017 Acaryochloris_marina Acaryochloris Cyanobacteria001 Cyanobacteria00 Cyanobacteria Bacteria
Note that for the second column, we need to add the underscore between Acaryochloris and marina. There is originally a space between those two words.
I hope I explained the problem correctly, Would appreciate the help. Thanks!
DLMWRITE is for numerical data. In your case it process the char data as numbers, each character as a time. You probably view the resulted file in such a way that you don't see tab delimiters.
You can use XLSWRITE to write cell string array to a file. If you don't want the output to be in Excel format, run DLMWRITE before it to write some number to a file.
dlmwrite(filename,1)
xlswrite(filename, Acell{1})
Don't call you variable cell, which a keyword in MATLAB.
As an alternative you can write to a file with lower level function, like FPRINTF.
UPDATE:
If you want to use XLSWRITE in a for-loop and not to overwrite the data you can specify the row to start from:
dlmwrite(filename,1)
for k = 1:10
xlswrite( filename, Acell{k}, 1, sprintf('A%d',k) )
end
UPDATE 2:
Unfortunately it does not work anymore in the latest MATLAB releases (I believe starting from R2012b). XLSWRITE gives error about wrong file type.
Something along the lines of the following should do what you want:
fid = fopen('735.txt', 'w');
fprintf(fid, '%s\t', cell{1}{:});
fclose(fid);

Resources