converting large text file to database - database

Background
I am not a programmer or technical person
I have a project where I need to convert a large text file to an access database.
The text file is not in traditional flat file format so I need some help pre processing.
The files are large (millions of records) between 100MB and 1GB and seem to be choking all of the editors I have tried (word pad, notepad, vim, em editor)
The following is a sample of the source text file:
product/productId:B000H9LE4U
product/title: Copper 122-H04 Hard Drawn Round Tubing, ASTM B75, 1/2" OD, 0.436" ID, 0.032" Wall, 96" Length
product/price: 22.14
review/userId: ABWHUEYK6JTPP
review/profileName: Robert Campbell
review/helpfulness: 0/0
review/score: 1.0
review/time: 1339113600review/summary: Either 1 or 5 Stars. Depends on how you look at it.
review/text: Either 1 or 5 Stars. Depends on how you look at it.1 Star because they sent 6 feet of 2" OD copper pipe.0 Star because they won't accept returns on it.5 stars because I figure it's actually worth $12-15/foot and since they won't take a return I figure I can sell it and make $40-50 on this deal
product/productId: B000LDNH8I
product/title: Bacharach 0012-7012 Sling Psychrometer, 25?F to 120?F, red spirit filled
product/price: 84.99
review/userId: A19Y7ZIICAKM48
review/profileName: T Foley "computer guy"
review/helpfulness: 3/3
review/score: 5.0
review/time: 1248307200
review/summary: I recommend this Sling Psychrometer
review/text: Not too much to say. This instrument is well built, accurate (compared) to a known good source. It's easy to use, has great instructions if you haven't used one before and stores compactly.I compared prices before I purchased and this is a good value.
Each line represents a specific attribute of a product, starting at "product/productId:"
What I need
I need to convert this file to a character delimited field (i think # symbol work) by stripping out each of the codes (i.e. product/productId:, product/title:, etc and replacing with the # and replacing the line feeds.
I want to eliminate the review/text: line
The output would look like this:
B000H9LE4U#Copper 122-H04 Hard Drawn Round Tubing, ASTM B75, 1/2" OD, 0.436" ID, 0.032" Wall, 96" Length#22.14#ABWHUEYK6JTPP#Robert Campbell#0/0#1.0#1339113600#Either 1 or 5 Stars. Depends on how you look at it.
B000LDNH8I#Bacharach 0012-7012 Sling Psychrometer, 25?F to 120?F, red spirit filled#84.99#A19Y7ZIICAKM48#T Foley "computer guy"#3/3#5.0#1248307200#I recommend this Sling Psychrometer
B000LDNH8I#Bacharach 0012-7012 Sling Psychrometer, 25?F to 120?F, red spirit filled#84.99#A3683PMJPFMAAS#Spencer L. Cullen#1/1#5.0#1335398400#A very useful tool
I now would have a flat file delimited with "#" that I can easily import into access.
Sorry for the ramble. I am open to suggestions, but don't understand programming enough to write using the editor language. Thanks in advance

This is a method I just put together and it comes with no guarantee. It reads the data (you have provided as sample) and displays in the right format as you need.
Public Sub ReadFileAndSave(filePath As String, breakIdentity As String, Optional sepStr As String = "#")
'******************************************************************************
' Opens a large TXT File, reads the data until EOF on the Source,
' then reformats the data to be saved on the Destination
' Arguments:
' ``````````
' 1. The Source File Path - "C:\Users\SO\FileName.Txt" (or) D:\Data.txt
' 2. The element used to identify new row - "new row" (or) "-" (or) "sam"
' 3. (Optional) Separator - The separator, you wish to use. Defauls to '#'
'*******************************************************************************
Dim newFilePath As String, strIn As String, tmpStr As String, lineCtr As Long
'The Destination file is stored in the same drive with a suffix to the source file name
newFilePath = Replace(filePath, ".txt", "-ReFormatted.txt")
'Open the SOURCE file for Read.
Open filePath For Input As #1
'Open/Create the DESTINATION file for Write.
Open newFilePath For Output As #2
'Loop the SOURCE till the last line.
Do While Not EOF(1)
'Read one line at a time.
Line Input #1, strIn
'If it is a blank/empty line SKIP.
If Len(strIn) > 1 Then
lineCtr = lineCtr + 1
'Create a String of the same ID.
tmpStr = tmpStr & Trim(Mid(strIn, InStr(strIn, ":") + 1)) & sepStr
'If a new row needs to be inserted, the BREAK IDENTITY is analyzed.
If InStr(strIn, breakIdentity) <> 0 And lineCtr > 1 Then
'Once the new row is triggered, dump the line in the Destination.
Print #2, Left(tmpStr, Len(tmpStr) - Len(Mid(strIn, InStr(strIn, ":") + 1)) - 1) & vbCrLf
'Prepare the NEXT ROW
tmpStr = Trim(Mid(strIn, InStr(strIn, ":") + 1)) & sepStr
End If
End If
Loop
'Print the last line
Print #2, Left(tmpStr, Len(tmpStr) - 1) & vbCrLf
'Close the files.
Close #1
Close #2
End Sub
Again, this code works on my system and I have not tested the bulk of the matter, so it might be slower in yours. Let me know if this works alright for you.

I'm not sure I understand how you want to map pf your textfile to data base fields.
That's the first thing you need to decide.
Once you've done that I'd suggest putting your text file into columns corresponding to the database columns. Then you should be able to import it into Access.

Related

How to export many variables of a simulation result to csv?

I'm trying to export my results from Dymola to excel through a csv file. But I have many results. How can I write them in an array??
I tried to create a for loop but I lack the knowledge on how to type the code.
if (time) <= 0 then
Modelica.Utilities.Files.removeFile("tube_0.02"+".csv");
Modelica.Utilities.Streams.print("temps," +
"delta_fr1,"+
"delta_fr2,"+
"delta_fr3,",
"tube_0.02"+".csv");
else
Modelica.Utilities.Streams.print(String(time) +
"," + String( my_code_tube1[1].delta_fr)+
"," + String( my_code_tube1[2].delta_fr)+
"," + String( my_code_tube1[3].delta_fr),
"tube_0.02"+".csv");
end if;
Instead of having to write delta_fr1, delta_fr2... and then my_code_tube1[1].delta_fr, my_code_tube1[2].delta_fr.... I need to create a for loop because I will have almost 1500 variable to export.
Trying to write the results to a csv file during simulation comes with some troubles in Modelica:
For which time steps do you want to write the result? Using the solver step size is not possible.
But we can define a periodic output using the sample() function
Calling print is quite expensive. You don't want to do that too many times while your simulation runs
The print function always adds a newline after every call. That requires us to write every line at once (but due to the performance limits mentioned before we should write as much text as possible at once anyhow).
Unfortunately, the default maximum string size of Dymola is quite small, limited on 500.
You have to ensure that the current Dymola instance is using an appropriate large value by setting e.g. Advanced.MaxStringLength = 50000.
With these things in mind, we could come up with a code like below:
model SO_print_for
My_code_tube my_code_tube1[1500];
protected
String line;
String f="tube_0.02" + ".csv";
model My_code_tube
Real delta_fr=1;
end My_code_tube;
initial algorithm
line := "temps";
for i in 1:size(my_code_tube1,1) loop
line := line + ", delta_fr" + String(i);
end for;
Modelica.Utilities.Files.removeFile(f);
Modelica.Utilities.Streams.print(line, f);
algorithm
when sample(0, 0.1) then
line := String(time);
for i in 1:size(my_code_tube1,1) loop
line := line + ", "+String(my_code_tube1[i].delta_fr);
end for;
Modelica.Utilities.Streams.print(line, f);
end when;
end SO_print_for;
The code works, but it will slow down your simulation, as many time events are generated (from the sample() function).
Instead of writing a csv during the simulation, you should consider one of the following ways to convert the result file after the simulation has finished:
Use Export Result in the Dymola variable browser to export all or only the plotted variables into csv. This must be done manually after every simulation and can not be scrippted
Use the function DataFiles.convertMATtoCSV. The following two lines of code will extract 1500 delta_fr variables from the .mat result file
vars = {"my_code_tube1["+String(i)+"].delta_fr" for i in 1:1500}
DataFiles.convertMATtoCSV("your-model.mat", vars, "out.csv")
Use Matlab, Octave or FreeMat to open the .mat result file and convert it to a file format which Excel understands.
Dymola provides dymload.m to import .mat result files into Matlab. See the User manual volume 1 of Dymola for details
Use python to convert the results to a csv file, e.g. by using the package DyMat or SDF-Python, which can both be used to read the result files

Using VB6 to update info in database

I'm being taught VB6 by a co-worker who gives me assignments every week. I think this time he's overestimated my skills. I'm supposed to find a line in a text file that contains Brand IDs and their respective brand name. Once I find the line, I'm to split it into variables and use that info to create a program that, via an inserted SQL statement, finds the brand, and replaces the "BrandName" in the item description with the "NewBrandname".
Here's what I'm working with
Dim ff as integer
ff = freefile
Open "\\tsclient\c\temp\BrandNames.txt" For Input as #ff
Do Until EOF(ff)
Dim fileline as string,linefields() as string
line input #ff, fileline
linefields = split(fileline,",")
brandID = linefields(0)
BrandName = linefields(1)
NewBrandName = linefields(2)
I want to use the following line in the text file, since It's the brand I'm working with:
BrandID =CHEFJ, BrandName=Chef Jay's NewBrandName=Chef Jays
That's what 'fileline' is- just don't know how to select just that one line
As for updating the info, here's what I've got:
dim rs as ADODB.Recordset, newDesc1 as String
rs = hgSelect("select desc1 from thprod where brandID='CHEFJ'")
do while not rs.eof
if left(rs!desc1,len(BrandName)) = BrandName then
dim newDesc1 as string
newDesc1 = NewBrandname & mid(rs!desc1, len(BrandName)+1)
hgExec "update thprod set desc1=" & adoquote(NewBrandName) & "+right(desc1,len(BrandName))" where brandId=CHEFJ and desc1 like 'BrandName%'"
end if
rs.movenext
loop
end while
How do I put this all together?
Just to give you some guidelines;
Firstly you need to read the Text file, which you are already doing.
Then, once you get the data, you need to spot the format and SPLIT the data to retrieve only the parts you need.
For example, if the data read from textfile gives you BrandID=CHEFJ, BrandName=Chef Jay's, NewBrandName=Chef Jays, you will see that the data are delimited by commas ,, and the property values are preceded by equal signs.
Follow LINK for more info of how to split.
Once you've split the data, you can easily use them to proceed with your database update. To update your db, first of all you will need to create the connection. Then your query to update using the data you've fetched from the Text file.
Finally you need to execute your query using ADODB. This EXAMPLE can help.
Do not forget to dispose the objects used, including your connection.
Hope it helps.

Format text file so I can import it into excel

I have a huge list of addresses and details I need to convert into an Excel spreadsheet and I think the best way would be to read the data and then write a second document that separates the lines so that they are tab-delimited whilst recognizing blank lines (between data entries) to preserve each separate address.
It is in the format:
AddressA1
AddressB1
Postcode1
Name1
PhoneNumber1
AddressA2
AddressB2
Postcode2
Name2
Name2
PhoneNumber2
AddressA3
AddressB3
Postcode3
Name3
PhoneNumber3
So the difficulty also comes when there are multiple names for a company, but I can hand format those if necessary (ideally they want to take on the same address as each other).
The resulting text document then, wants to be tab-delimited to:
Name|AddressA|AddressB|Postcode|Phone Number
I am thinking this would be easiest to do within a simple .bat command? or should I open the list in excel and run a script through that..?
I'm thinking if I can run through where it adds each entry to an array ($address $name etc) then I can use that to build a new text file by writing $name[i] tab $address[$i] etc
There are hundreds of entries and putting it in by hand is proving.. difficult.
I have some experience in MEL (basically C++) so I do understand programming in general, but somewhat at a loss in how .bat and Excel (VB?) handle and define empty lines and tabs.
The first step would be to bring the data into an Excel file. Once the data has been imported, we can re-package it to meet your specs. The first step:
Sub BringFileIn()
Dim TextLine As String, CH As String
Close #1
Open "C:\TestFolder\question.txt" For Input As #1
Dim s As String
Dim I As Long, J As Long
J = 1
I = 1
Do While Not EOF(1)
Line Input #1, TextLine
Cells(I, J) = TextLine
I = I + 1
Loop
Close #1
End Sub
Any text editor that can do regex search and replace across multiple lines can do the job nicely.
I have written a hybrid JScript/batch utility called REPL.BAT that performs a regex search and replace on stdin and writes the result to stdout. It is pure script that works on any modern Windows machine from XP forward - No 3rd party executable required. Full documentation is embedded within the script.
Assuming REPL.BAT is in your current directory, or better yet, somewhere within your PATH, then:
type file.txt|repl "\r?\n" "\t" mx|repl "\t\t" "\n" x|repl "^(([^\t]*\t){4})([^\t]*)$" "$1\t$3" x >newFile.txt
The above modifies the file in 3 steps and writes the result to a new file, leaving the original intact:
convert all newlines into tabs
convert consecutive tabs into newlines
insert an empty column (tab) before the last column on any line that contains only 5 columns.
Here's a method using only Word and Excel. I used the data that you posted. I am assuming that Name2 is the only optional field.
Paste your text into Word.
Replace all paragraph marks with a special
characters. (Ctrl-h, Search for ^p, Replace with |)
Replace all line breaks with a different special character. (Ctrl-h, Special character, search for Manual line break, replace with ;)
This is what it looks like in Word:
AddressA1;AddressB1;Postcode1;Name1;PhoneNumber1|AddressA2;AddressB2;Postcode2;Name2;Name2;PhoneNumber2|AddressA3;AddressB3;Postcode3;Name3;PhoneNumber3||
Then convert text to table (Insert -> Table -> Convert text to table), delimiting by ;. This gives 3 rows (plus 2 blank rows) of 1 column.
Then copy the table.
Now in Excel:
Paste the table. (It'll be one row in each row, with all of your fields in column A.)
Convert the text to columns (Data tab, Text to columns, Delimited, check semicolon)
Sort by column E. The phone numbers should be grouped together.
Cut the phone numbers in column E and copy to column F.

How to deal with Weird records in FLAT FILE?

SSIS falls flat on it back with this scenario .
In my flat file, we have Normal looking records like this
"1","2","STATUSCHANGED","A","02-MAY-12 21:52:34","","Re","Initial review",""
And some like this ; ( record spread over several lines )
"1","2","SALESNOTIFICATIONRESPOND","Ac","02-MAY-12 21:55:19","From: W, J
Sent: Wednesday, May 08, 2012 2:00 PM
To: XXXX, A; Acost
Subject: RE: Notification Id 1219 - Qu ID XXXXXX
I got this from earlier today. Our team is reviewing the request.
Thanks,
Hi,
This account belongs to D please approve/deny.
Thanks!
Claud","","","Reassign"
So looking at the file in NOTEPAD + which is amazing it shows me that within that field that is spread over several line, I should take out all the {CR}{LF} in that field.
The row delimiter for this file is LF and the text qualifier is “.
So 2 things I need to do on a collection of 200 file ?
Remove all the {CR}{LF} in the file ?
Remove any embedded “ in the actual fields as “ is the text qualifier ?
Anyone have any idea how to do this in windows , dos or vba for such a large number of files so its automated ?
For data such as this, I prefer using a Script Component to perform the parse. I wrote a blog post describing one approach.
Hope this helps,
Andy
Powershell will do this for you for the {CR}{LF} but it might take you a while to code if you have never used powershell before.
The " qualifier appearing in the middle of fields is a real mess, you may be able to develop rules to clean this up but there is no guarantee that you will succeed.
If the proper row terminator is just LF and you are certain that every row is properly terminated by LF then you can remove all {CR}{LF}, but you should not actually need to. As long as they {CR}{LF} is properly inside a pair of text qualifiers, it should just be imported literally.
And yes, you definitely need to remove any text qualifiers (or escape them, as you prefer) from within an actual field when the entire field is surrounded by text qualifiers. That will cause confusion.
Personally, I would approach this by either writing a python script to preprocess the data before feeding it to SSIS or just have the script import the entire thing into SQL for me.
I agree with Andy. I had a similar issue and I took care of it with a script component task.
Your code could look something like this (doesnt handle the CR LF issue)
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim strRow As String
Dim strColSeperator As String
Dim rowValues As String()
strRow = Row.Line.ToString()
If strRow.Contains(",") Then
strColSeperator = (",")
ElseIf strRow.Contains(";") Then
strColSeperator = ";"
End If
rowValues = Row.Line.Split(CChar(strColSeperator))
If (rowValues.Length > 1) Then
Row.Code = rowValues.GetValue(0).ToString()
Row.Description = rowValues.GetValue(1).ToString()
Row.Blank = rowValues.GetValue(2).ToString()
Row.Weight = rowValues.GetValue(3).ToString()
Row.Scan = rowValues.GetValue(4).ToString()
End If
End Sub
End Class
A step by step tutorial is available at Andy Mitchell's post

How to split the data in this file vb6

I have this file. It stores a names, a project, the week that they are storing the data for and hours spent on project. here is an example
"James","Project5","15/05/2010","3"
"Matt","Project1","01/05/2010","5"
"Ellie","Project5","24/04/2010","1"
"Ellie","Project2","10/05/2010","3"
"Matt","Project3","03/05/2010","4"
I need to print it on the form without quotes. There it should only show the name once and then just display projects under the name. I've looked tihs up and the split function seems interesting
any help would be good.
Create a Dictionary object and then put everything you find for a given name into one dictionary entry.
Then in a second iteration print all that out.
Microsoft has a CSV ADO provider. I think it is installed along with the rest of ADO. This is exactly the format it was designed to read.
See http://www.vb-helper.com/howto_ado_load_csv.html for a VB sample.
Do I understand you correctly in that you want to keep track of the names entered and thus re-order the data that way? Why not just read the data into a list of some new type that has the name, project, and other information and then sort that before printing it?
While the Dictionary solution is simpler, this may be a better solution if you are OK with building a class and implementing the IComparer so that you could sort the list to get this done pretty easily.
You could read each line, strip out the quotes, split on the comma, then process the array of data you would be left with:
Dim filenum As Integer
Dim inputLine As String
Dim data() As String
filenum = FreeFile
Open "U:\test.txt" For Input As #filenum
Do While Not EOF(filenum)
Line Input #filenum, inputLine
inputLine = Replace(inputLine, Chr(34), vbNullString)
data = Split(inputLine, ",")
Debug.Print data(0), data(1), data(2), data(3)
Loop
Close #filenum
Or you could have the Input command strip the quotes, and read the data into variables:
Dim filenum As Integer
Dim name As String, project As String, dat As String, hours As String
filenum = FreeFile
Open "U:\test.txt" For Input As #filenum
Do While Not EOF(filenum)
Input #filenum, name, project, dat, hours
Debug.Print name, project, dat, hours
Loop
Close #filenum

Resources