Import XML objects in batches - arrays

I'm working on a PowerShell script that deals with a very large dataset. I have found that it runs very well until the memory available is consumed. Because of how large the dataset is, and what the script does, it has two arrays that become very large. The original array is something around a half gig, and the final object is easily six or seven gigs en-memory. My idea is that it should work better if I'm able to release rows as done and run the script in increments.
I am able to split the imported XML using a function I've found and tweaked, but I'm not able to change the data actually contained in the array.
This is the script I'm using to split the array into batches currently: https://gallery.technet.microsoft.com/scriptcenter/Split-an-array-into-parts-4357dcc1
And this is the code used to import and split the results.
# Import object which should have been prepared beforehand by the query
# script. (QueryForCombos.ps1)
$SaveObj = "\\server\share$\me\Global\Scripts\Resultant Sets\LatestQuery.xml"
$result_table_import = Import-Clixml $SaveObj
if ($result_tables.count > 100000) {
$result_tables = Split-Array -inArray $result_table_import -size 30000;
} else {
$result_tables = Split-Array -inArray $result_table_import -parts 6
}
And then of course there is the processing script which actually uses the data and converts it as desired.

For large XML files, I don't think you want to read it all into memory as is required with an XmlDocument or Import-Clxml. You should look at the XmlTextReader as one way to process the XML file a bit at a time.

Related

How does one split large result set from a Group By into multiple flat files?

I'm far away from an SSIS expert and I'm attempting to correct an error (unspecified in the messages) that began once I modified a variable to increase the size of the data accumulated and exported into a flat file. (Note variable was a date in the WHERE statement that limited the data returned from the SELECT.)
So in the data flow there's a GROUP BY component and I'm trying to find the appropriate component to put in between that and the flat file destination component to chop up the results. I figured there'd be something to export, say flatFile1.csv, flatFile2.csv, etc. based on a number of lines (so if I set a limit of 1-million lines and the results returned 3.5-million, I'd get 4 files with the last one containing 1/2-million lines) or perhaps a max file size with similar results.
Which component should I use from the toolbox to guarantee a manageable file size?
Is a script component the only way to be able to handle any size output? If so would it sit in between the Group By and the Flat File output components or would the script completely obviate the need for the Flat File output?

Can SSIS import TXT records in exact same order as they are in TXT file? If not (by default), then how?

I have records in a flat file that have what you might consider to be master records with detail records following the master they relate to, until the master record.
Here is an example:
Order123, Customer1, 1/1/2018
Item1, $1
Item2, $1
Order124, Customer2, 1/1/2018
Item1, $1
Item4, $2
The file has no line numbers or any kind of sequencing built in, nor does it use foreign keys to relate master to detail.
If I were to use SSIS to import the raw TXT data into a flexible table with columns designed to take various datatypes (i.e. nvarchar(255), or similar), I could iterate over the values after the import, and relate the values in Line #2 and Line #3 with Order123; and consequently, Lines #5 and #6 with Order124.
The table holding the raw data will use a simple RecordID identity column with an integer incrementing by one.
It doesn't really matter, but if you're curious, the actual data I'm referring to is the Retrosheet data event files. It's a collection of all Major League Baseball data. A real file could be downloaded from a link on this page:
https://www.retrosheet.org/game.htm
I seem to recall that you could not import TXT data into a table and expect that the order of the rows would match the order of the TXT lines. When I do small tests of this, however, the records do appear in the same order as the source file. I suspect that my small test results were too good to be true and not a fail safe prediction on how it will turn out.
In summary:
How do I use SSIS to import data, inserting SQL records in the same order as the original flat file?
The answer is yes, flat files are processed in order, as long as you aren't applying any kind of sorting.
I was able to process the Retrosheet files by creating a table in my DB that had an identity column, and a varchar column long enough to hold each line of the file (I chose 100). I then set up my flat file connection with Ragged Right formatting, defining the row delimiter as {CR}{LF}.
I just typed this out so there might be a few errors in syntax but it should get you close.
You will need to set up 2 different outputs.
Order of load will not matter as you are adding a foreign key to the detail table.
public string orderNo; /// on the OUTSIDE
public main()
string[] lines = System.IO.File.ReadAllLines([filename]);
foreach(var line in lines)
{
string[] cols = line.Split(',');
if(cols.Length == 3)
{
orderNo = cols[0];
Output0Buffer.AddRow();
Output0Buffer.OrderNo = cols[0].ToString();
Output0Buffer.Customer = cols[1].ToString();
Output0Buffer.OrderDate = DateTime.Parse(cols[2].ToString().Trim());
}
else
{
Output1Buffer.AddRow();
Output1Buffer.OrderNo = orderNo;
Output1Buffer.Item = cols[0].ToString();
Output1Buffer.Amt = cols[1].ToString(); //This needs to be parsed later.
}
}
FOLLOW UP:
I just reviewed the site you are trying to download from. And the file is more complicated than you led on by your question.
Split still seems safe to use but you will have to trim some quotewrapped strings (names) but it looks like there are no quote wrapped commas (at least in the examples). If that is the case you need to use REGEX to split.
I would change the logic to use switch and case and base it on cols[0] being one of the 8 types.
Save ID on the outside and write to each of the 7 other possible datasets this creates for linkage to parent. You will have to use the same strategy for other records that need to be tied to a different parent (I think comment is an example).
GOOD LUCK with all this. Plays do not look easy to interpret!

Save Pandas data frame from list of dicts as hdf5 table

I have a large pandas data frame that I create from a list of dictionaries where the column names are the dict keys. The columns contain different types of data, but the datatype is consistent in any given column.
Example: one of my columns contains 28x28 numpy arrays and another contains strings...etc. I would like to save this out as an HDF5 file, having table format so I can query the data when reading it in later (these files are ~1-2 GB).
This is how I'm trying to save the hdf5 file:
df = pd.DataFrame(list_of_dicts)
df.convert_objects(convert_numeric=True) (**have also tried pd.to_numeric)**
df.to_hdf(path_to_save, 'df', format='table')
And I get the following error:
TypeError: Cannot serialize the column [image_dims] because
its data contents are [mixed] object dtype
The image_dims column in this case has a numpy array for each entry, and this happens on any column that has an object datatype in pandas, I am not sure how to change/set it. I can save it as fixed format but I'd really like to use tables to save on loading time, etc, with queries. I've seen some other questions similar to this, but not with regard to creating the data frame from a list of dictionaries, which may be causing the problem?
Thanks for any suggestions
you changed your DF in memory - df.convert_objects(convert_numeric=True) - will return a copy of converted DF, but it WON'T change the original DF. Do it this way instead:
df.apply(pd.to_numeric, errors='coerce').to_hdf(path_to_save, 'df', format='table')
or
df = df.apply(pd.to_numeric)
df['image_dims'] = df['image_dims'].astype(str) # in case it can't be converted to numeric
df.to_hdf(path_to_save, 'df', format='table')
PS you may want to use errors='coerce' or errors='ignore' parameter when calling pd.to_numeric(...) function, depending on what you want
After trying many things I finally found what seems to be a good solution for saving mixed datatype objects in a way that they can be loaded quickly and queried. I thought I'd post it in case someone else is having trouble with this:
Tried and failed: Pandas data frame save as hdf5 or json files. Hdf5 works but only in fixed format (no querying capability) when data types are mixed. Json works with mixed but is no faster to load for files of this size. Tried converting all numpy arrays, lists, etc to byte strings for saving: this was more trouble than it was worth with the number of columns and different data types I'm dealing with.
Solution: Use ZODB (https://pypi.python.org/pypi/ZODB). Object database that was very easy to implement, and is easy to integrate with pandas if data frames are your thing.
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
storage = FileStorage('path_to_store.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
# insert each column of the dataframe into the db
for col in df.columns:
root[col] = df[col]
# commit changes to the db
transaction.commit()
# works like a dictionary with key:value pairs that correspond to column names
print root.keys()
# close db and connection
db.close()
connection.close()
You can then read a database with the same syntax, and access data through the root variable.

Auto-generating destinations of split files in SSIS

I am working on my first SSIS package. I have a view with data that looks something like:
Loc Data
1 asd
1 qwe
2 zxc
3 jkl
And I need all of the rows to go to different files based on the Loc value. So all of the data rows where Loc = 1 should end up in the file named Loc1.txt, and the same for each other Loc.
It seems like this can be accomplished with a conditional split to flat file, but that would require a destination for each Location. I have a lot of Locations, and they all will be handled the same way other than being split in to different files.
Is there a built in way to do this without creating a bunch of destination components? Or can I at least use the script component to act as a way?
You should be able to set an expression using a variable. Define your path up to the directory and then set the variable equal to that column.
You'll need an Execute SQL task to return a Single Row result set, and loop that in a container for every row in your original result set.
I don't have access at the moment to post screenshots, but this link should help outline the steps.
So when your package runs the expression will look like:
'C:\Documents\MyPath\location' + #User::LocationColumn + '.txt'
It should end up feeding your directory with files according to location.
Set the User::LocationColumn equal to the Location Column in your result set. Write your result set to group by Location, so all your records write to a single file per Location.
I spent some time try to complete this task using the method #Phoenix suggest, but stumbled upon this video along the way.
I ended up going with the method shown in the video. I was hoping I wouldn't have to separate it in to multiple select statements for each location and an extra one to grab the distinct locations, but I thought the SSIS implementation in the video was much cleaner than the alternative.
Change the connection manager's connection string, in which you have to use variable which should be changed.
By varying the variable, destination file also changes
and connection string is :
'C:\Documents\ABC\Files\' + #User::data + '.txt'
vote this if it helps you

Pandas, large file with varying number columns, in memory append

I would like to maintain a large PyTable in a hdf5 file.
Normally as new data comes I would append to the existing table:
store = pd.HDFStore(path_to_dataset, 'a')
store.append("data", newdata)
store.close()
However, if the columns of old stored data and those of the incoming newdata are partially only overlapping, it is returned the following error:
Exception: cannot match existing table structure for [col1,col2,col3] on appending data
In these cases, I would like to get a behavior similar to the normal DataFrame append function
which fills non overlapping entries with NAN
import pandas as pd
a = {"col1":range(10),"col2":range(10)}
a = pd.DataFrame(a)
b = {"b1":range(10),"b2":range(10)}
b = pd.DataFrame(b)
a.append(b)
Is it possible have a similar operation "in memory", or do I need to create a completely new file?
HDFStore stores row-oriented, so this is currently not possible.
You could need to read it in, append, and write it out. Possibly you could use: http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries
However, you could also create the table with all columns that are possible at the beginning (and just leave them nan).

Resources