This is for a project I'm working on.
Using python, I've imported a big CSV file that has about 2000 rows, I turned it into a list.
below is the script that I used to create a list
data=[] #Will put the data in here
with open('output.csv', "r") as file: # open the file
for data_row in file:
#get data one row at a time split up the row into columns, stripping
whitespace from each one and store it in 'data'
data.append( [x.strip() for x in data_row.split(",")] )
My main goal with this project is to create a table directly into a SQL server using a python script using pandas for example
df = pd.DataFrame(mydata, columns = ['column1' , 'column2', ...]
However, I ran into a problem while splitting because there are fields that contain peoples' names in 'Doe, John' format which creates extra columns and because of that when I insert column names into the pd.DataFrame it throws me error of 'AssertionError: 39 columns passed, passed data had 44 columns'
Could someone please help me solve this problem? I'd appreciate it much!
Related
I have records in a flat file that have what you might consider to be master records with detail records following the master they relate to, until the master record.
Here is an example:
Order123, Customer1, 1/1/2018
Item1, $1
Item2, $1
Order124, Customer2, 1/1/2018
Item1, $1
Item4, $2
The file has no line numbers or any kind of sequencing built in, nor does it use foreign keys to relate master to detail.
If I were to use SSIS to import the raw TXT data into a flexible table with columns designed to take various datatypes (i.e. nvarchar(255), or similar), I could iterate over the values after the import, and relate the values in Line #2 and Line #3 with Order123; and consequently, Lines #5 and #6 with Order124.
The table holding the raw data will use a simple RecordID identity column with an integer incrementing by one.
It doesn't really matter, but if you're curious, the actual data I'm referring to is the Retrosheet data event files. It's a collection of all Major League Baseball data. A real file could be downloaded from a link on this page:
https://www.retrosheet.org/game.htm
I seem to recall that you could not import TXT data into a table and expect that the order of the rows would match the order of the TXT lines. When I do small tests of this, however, the records do appear in the same order as the source file. I suspect that my small test results were too good to be true and not a fail safe prediction on how it will turn out.
In summary:
How do I use SSIS to import data, inserting SQL records in the same order as the original flat file?
The answer is yes, flat files are processed in order, as long as you aren't applying any kind of sorting.
I was able to process the Retrosheet files by creating a table in my DB that had an identity column, and a varchar column long enough to hold each line of the file (I chose 100). I then set up my flat file connection with Ragged Right formatting, defining the row delimiter as {CR}{LF}.
I just typed this out so there might be a few errors in syntax but it should get you close.
You will need to set up 2 different outputs.
Order of load will not matter as you are adding a foreign key to the detail table.
public string orderNo; /// on the OUTSIDE
public main()
string[] lines = System.IO.File.ReadAllLines([filename]);
foreach(var line in lines)
{
string[] cols = line.Split(',');
if(cols.Length == 3)
{
orderNo = cols[0];
Output0Buffer.AddRow();
Output0Buffer.OrderNo = cols[0].ToString();
Output0Buffer.Customer = cols[1].ToString();
Output0Buffer.OrderDate = DateTime.Parse(cols[2].ToString().Trim());
}
else
{
Output1Buffer.AddRow();
Output1Buffer.OrderNo = orderNo;
Output1Buffer.Item = cols[0].ToString();
Output1Buffer.Amt = cols[1].ToString(); //This needs to be parsed later.
}
}
FOLLOW UP:
I just reviewed the site you are trying to download from. And the file is more complicated than you led on by your question.
Split still seems safe to use but you will have to trim some quotewrapped strings (names) but it looks like there are no quote wrapped commas (at least in the examples). If that is the case you need to use REGEX to split.
I would change the logic to use switch and case and base it on cols[0] being one of the 8 types.
Save ID on the outside and write to each of the 7 other possible datasets this creates for linkage to parent. You will have to use the same strategy for other records that need to be tied to a different parent (I think comment is an example).
GOOD LUCK with all this. Plays do not look easy to interpret!
I need to create an export of data from SQL Server (multiple tables) into a fixed width text file. The text file will have rows that are different based on the record type.
Header Info (Customer, Address)
Line Item Info (Customer, Item, Qty)
Summary Info (Customer, Total Qty)
Any suggestions to accomplish this efficiently?
I'm currently re-casting all columns into char to create the "fixed width" then using SSIS to merge the tables before exporting as a ragged right text file. However, because not all the widths are the same, I'm having to concatenate the line item info into one column to make the merge work. Also, the header info is being merged AFTER the line item info, not before so there's a sorting problem there. Not sure if I'm going down the right path?
Hope that made sense... this export is used to import into a COBOL type system.
Thanks,
Using SSIS create three data flow tasks, each for creating a single text file with the fixed-width format.
File 1: Header Info
File 2: Line Item Info
File 3: Summary Info
Then concatenate them together into a fourth file using the approach described in the following link:
How to concatenate 2 files in SSIS (Integration Services)?
Hope this helps.
For these sorts of problems, I reach for SSIS. It eats this kind of thing for lunch
I have two CSV files:
Identity(no,name,Age) which has 10 rows
Location(Address,no,City) which has 100 rows
I need to extract rows and check the no column in the Identity with Location CSV files.
Get the single row from Identity CSV file and check Identity.no with Location.no having 100 rows in Location CSV file.
If it is matching then combine the name, Age, Address, City in Identity, Location
Note: I need to get 1st row from Identity compare it with 100 rows in Location CSV file and then get the 2nd row compare it with 100 rows. It will be continue up to 10 rows in Identity CSV file.
And overall results convert into Json.Then move the results in to SQL Server.
Is it possible in Apache Nifi?
Any help appreciated.
You can do this in NiFi by using the DistributedMapCache feature, which implements a key/value store for lookups. The setup requires a distributed map cache, plus two flows - one to populate the cache with your Address records, and one to look up the address by the no field.
The DistributedMapCache is defined by two controller services, a DistributedMapCacheServer and a DistributeMapCacheClientService. If your data set is small, you can just use "localhost" as the server.
Populating the cache requires reading the Address file, splitting the records, extracting the no key, and putting key/value pairs to the cache. An approximate flow might include GetFile -> SplitText -> ExtractText -> UpdateAttribute -> PutDistributedMapCache.
Looking up your identity records is actually fairly similar to the flow above, in that it requires reading the Identity file, splitting the records, extracting the no key, and then fetching the address record. Processor flow might include GetFile -> SplitText -> ExtractText -> UpdateAttribute -> FetchDistributedMapCache.
You can convert the whole or parts from CSV to JSON with AttributesToJSON, or maybe ExecuteScript.
I have been given a task to load a simple flat file into another using ssis package. The source flat file contains a zip code field, now my task is to extract and load into another flat file that accepts only the ones with correct zip code which is 5 digit zip code , and redirect the invalid rows to a new file.
Since I am new to SSIS, any help or ideas is much appreciated.
You can add a derived column which determines the length of the field. Then you can add a conditional split based on that column. <= 5 goes the good path, > 5 goes the reject path.
Summary : Is there a limit to the number of columns which can be Imported/Loaded from a CSV file? If yes, what is the workaround? Thanks
I am very new to DB2, and I am supposed to import a | (pipe) delimited csv file which contains 532 columns into a DB2 table which also has 532 columns in exact positions as the csv. I also have a smaller file with only 27 columns in both csv and table. I am using the following command:
IMPORT FROM "C:\myfile.csv" OF DEL MODIFIED BY COLDEL| METHOD P (1, 2,....27) MESSAGES "C:\messages.txt" INSERT INTO PRE_SUBS_GPRS2_1010 (col1,col2,....col27);
This works fine.
But in the second file, which is like:
IMPORT FROM "C:\myfile.csv" OF DEL MODIFIED BY COLDEL| METHOD P (1, 2,....532) MESSAGES "C:\messages.txt" INSERT INTO PRE_SUBS_GPRS_1010 (col1,col2,....col532);
It does not work. It gives me an error that says:
SQL3037N An SQL error "-206" occurred during Import processing.
Explanation:
An SQL error occurred during processing of the Action String (for
example, "REPLACE into ...") parameter.
The command cannot be processed.
User Response:
Look at the SQLCODE (message number) in the message for more
information. Make changes and resubmit the command.
I am using the Control Center to run the query, not command prompt.
The problem was because one of the column names in the list of columns of the INSERT statement was more than 30 characters long. It was getting truncated and was not recognized.
Hope this helps others in future. Please let me know if you need further details.
The specific error code is SQL0206 and the documentation about this error is here
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.messages.sql.doc/doc/msql00206n.html
For the limits, I think the maximal quantity of columns in an import should be the maximal quantity permitted for a Table. Take a look in the information center
Database fundamentals > SQL > SQL and XML limits
Maximum number of columns in a table 7 1012
Try to import just one row. If you have problems, probably is due to incompatibility of types, column order, duplicated rows with the already present in the table.