Pandas, large file with varying number columns, in memory append - database

I would like to maintain a large PyTable in a hdf5 file.
Normally as new data comes I would append to the existing table:
store = pd.HDFStore(path_to_dataset, 'a')
store.append("data", newdata)
store.close()
However, if the columns of old stored data and those of the incoming newdata are partially only overlapping, it is returned the following error:
Exception: cannot match existing table structure for [col1,col2,col3] on appending data
In these cases, I would like to get a behavior similar to the normal DataFrame append function
which fills non overlapping entries with NAN
import pandas as pd
a = {"col1":range(10),"col2":range(10)}
a = pd.DataFrame(a)
b = {"b1":range(10),"b2":range(10)}
b = pd.DataFrame(b)
a.append(b)
Is it possible have a similar operation "in memory", or do I need to create a completely new file?

HDFStore stores row-oriented, so this is currently not possible.
You could need to read it in, append, and write it out. Possibly you could use: http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries
However, you could also create the table with all columns that are possible at the beginning (and just leave them nan).

Related

How to find distinct values in a list of columns and print them in a single CSV

I am have a large dataset to analyze which I need to look at the distinct values for multiple features (Flags).
I am attempting to run a for loop as follows:
d= {}
name_list = ["ultfi_ind", "status"] # Add names of columns here
for x in name_list:
d["{0}".format(x)] = test_df.select(x).distinct().collect() # Please change df name
dist_val = pd.DataFrame.from_dict(d)
Here I am specifying the column names in the name_list list and the then in the for loop I am finding the distinct values in each of the columns and saving the output in a dictionary.
Finally I am attempting to combine it all in a single dataframe but that isn't possible as the length of the columns isn't same.
I am aware that one to do it is via padding but I find that too complex a solution and am wondering if there's a smart way to go about this.
Note that I am running this in a spark environment as my dataset is large.
I imagine the ultimate output to be a single CSV file/ Dataframe wherein the header is the name of the column mentioned in the name_list (above) and underneath that the distinct values are listed.
d= {}
name_list = ["ultfi_ind", "status"] # Add names of columns here
for x in name_list:
d["{0}".format(x)] = test_df.select(x).distinct().collect() # Please change df name
dist_val = pd.DataFrame.from_dict(d)
What you are talking about is data profiling. There pandas.dataframe has a describe function to start your journey off.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#
If you want something a little more graphical, look at this article on towards data science.
https://towardsdatascience.com/3-tools-for-fast-data-profiling-5bd4e962e482
If you want to roll your own, you can. I am not going to code it for you since you will not learn. But enclosed is an algorithm that I would use.
1 - Get a list of columns and types in the data set.
2 - If number, we can give aggregations such as min, max, avg, etc.
3 - If non-numeric, we can group by field + count() occurrences
All this data can be outputted in as a data frame and save to your favorite format.
Last but not least, there is the spark.topandas library the replaces koalas. This allows you to convert a spark dataframe to a pandas dataframe to use some of these prebuilt functions.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html

Can SSIS import TXT records in exact same order as they are in TXT file? If not (by default), then how?

I have records in a flat file that have what you might consider to be master records with detail records following the master they relate to, until the master record.
Here is an example:
Order123, Customer1, 1/1/2018
Item1, $1
Item2, $1
Order124, Customer2, 1/1/2018
Item1, $1
Item4, $2
The file has no line numbers or any kind of sequencing built in, nor does it use foreign keys to relate master to detail.
If I were to use SSIS to import the raw TXT data into a flexible table with columns designed to take various datatypes (i.e. nvarchar(255), or similar), I could iterate over the values after the import, and relate the values in Line #2 and Line #3 with Order123; and consequently, Lines #5 and #6 with Order124.
The table holding the raw data will use a simple RecordID identity column with an integer incrementing by one.
It doesn't really matter, but if you're curious, the actual data I'm referring to is the Retrosheet data event files. It's a collection of all Major League Baseball data. A real file could be downloaded from a link on this page:
https://www.retrosheet.org/game.htm
I seem to recall that you could not import TXT data into a table and expect that the order of the rows would match the order of the TXT lines. When I do small tests of this, however, the records do appear in the same order as the source file. I suspect that my small test results were too good to be true and not a fail safe prediction on how it will turn out.
In summary:
How do I use SSIS to import data, inserting SQL records in the same order as the original flat file?
The answer is yes, flat files are processed in order, as long as you aren't applying any kind of sorting.
I was able to process the Retrosheet files by creating a table in my DB that had an identity column, and a varchar column long enough to hold each line of the file (I chose 100). I then set up my flat file connection with Ragged Right formatting, defining the row delimiter as {CR}{LF}.
I just typed this out so there might be a few errors in syntax but it should get you close.
You will need to set up 2 different outputs.
Order of load will not matter as you are adding a foreign key to the detail table.
public string orderNo; /// on the OUTSIDE
public main()
string[] lines = System.IO.File.ReadAllLines([filename]);
foreach(var line in lines)
{
string[] cols = line.Split(',');
if(cols.Length == 3)
{
orderNo = cols[0];
Output0Buffer.AddRow();
Output0Buffer.OrderNo = cols[0].ToString();
Output0Buffer.Customer = cols[1].ToString();
Output0Buffer.OrderDate = DateTime.Parse(cols[2].ToString().Trim());
}
else
{
Output1Buffer.AddRow();
Output1Buffer.OrderNo = orderNo;
Output1Buffer.Item = cols[0].ToString();
Output1Buffer.Amt = cols[1].ToString(); //This needs to be parsed later.
}
}
FOLLOW UP:
I just reviewed the site you are trying to download from. And the file is more complicated than you led on by your question.
Split still seems safe to use but you will have to trim some quotewrapped strings (names) but it looks like there are no quote wrapped commas (at least in the examples). If that is the case you need to use REGEX to split.
I would change the logic to use switch and case and base it on cols[0] being one of the 8 types.
Save ID on the outside and write to each of the 7 other possible datasets this creates for linkage to parent. You will have to use the same strategy for other records that need to be tied to a different parent (I think comment is an example).
GOOD LUCK with all this. Plays do not look easy to interpret!

Save Pandas data frame from list of dicts as hdf5 table

I have a large pandas data frame that I create from a list of dictionaries where the column names are the dict keys. The columns contain different types of data, but the datatype is consistent in any given column.
Example: one of my columns contains 28x28 numpy arrays and another contains strings...etc. I would like to save this out as an HDF5 file, having table format so I can query the data when reading it in later (these files are ~1-2 GB).
This is how I'm trying to save the hdf5 file:
df = pd.DataFrame(list_of_dicts)
df.convert_objects(convert_numeric=True) (**have also tried pd.to_numeric)**
df.to_hdf(path_to_save, 'df', format='table')
And I get the following error:
TypeError: Cannot serialize the column [image_dims] because
its data contents are [mixed] object dtype
The image_dims column in this case has a numpy array for each entry, and this happens on any column that has an object datatype in pandas, I am not sure how to change/set it. I can save it as fixed format but I'd really like to use tables to save on loading time, etc, with queries. I've seen some other questions similar to this, but not with regard to creating the data frame from a list of dictionaries, which may be causing the problem?
Thanks for any suggestions
you changed your DF in memory - df.convert_objects(convert_numeric=True) - will return a copy of converted DF, but it WON'T change the original DF. Do it this way instead:
df.apply(pd.to_numeric, errors='coerce').to_hdf(path_to_save, 'df', format='table')
or
df = df.apply(pd.to_numeric)
df['image_dims'] = df['image_dims'].astype(str) # in case it can't be converted to numeric
df.to_hdf(path_to_save, 'df', format='table')
PS you may want to use errors='coerce' or errors='ignore' parameter when calling pd.to_numeric(...) function, depending on what you want
After trying many things I finally found what seems to be a good solution for saving mixed datatype objects in a way that they can be loaded quickly and queried. I thought I'd post it in case someone else is having trouble with this:
Tried and failed: Pandas data frame save as hdf5 or json files. Hdf5 works but only in fixed format (no querying capability) when data types are mixed. Json works with mixed but is no faster to load for files of this size. Tried converting all numpy arrays, lists, etc to byte strings for saving: this was more trouble than it was worth with the number of columns and different data types I'm dealing with.
Solution: Use ZODB (https://pypi.python.org/pypi/ZODB). Object database that was very easy to implement, and is easy to integrate with pandas if data frames are your thing.
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
storage = FileStorage('path_to_store.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
# insert each column of the dataframe into the db
for col in df.columns:
root[col] = df[col]
# commit changes to the db
transaction.commit()
# works like a dictionary with key:value pairs that correspond to column names
print root.keys()
# close db and connection
db.close()
connection.close()
You can then read a database with the same syntax, and access data through the root variable.

Import XML objects in batches

I'm working on a PowerShell script that deals with a very large dataset. I have found that it runs very well until the memory available is consumed. Because of how large the dataset is, and what the script does, it has two arrays that become very large. The original array is something around a half gig, and the final object is easily six or seven gigs en-memory. My idea is that it should work better if I'm able to release rows as done and run the script in increments.
I am able to split the imported XML using a function I've found and tweaked, but I'm not able to change the data actually contained in the array.
This is the script I'm using to split the array into batches currently: https://gallery.technet.microsoft.com/scriptcenter/Split-an-array-into-parts-4357dcc1
And this is the code used to import and split the results.
# Import object which should have been prepared beforehand by the query
# script. (QueryForCombos.ps1)
$SaveObj = "\\server\share$\me\Global\Scripts\Resultant Sets\LatestQuery.xml"
$result_table_import = Import-Clixml $SaveObj
if ($result_tables.count > 100000) {
$result_tables = Split-Array -inArray $result_table_import -size 30000;
} else {
$result_tables = Split-Array -inArray $result_table_import -parts 6
}
And then of course there is the processing script which actually uses the data and converts it as desired.
For large XML files, I don't think you want to read it all into memory as is required with an XmlDocument or Import-Clxml. You should look at the XmlTextReader as one way to process the XML file a bit at a time.

TextFieldParser into DataTable

I have a csv file read into a TextFieldParser.
Before I place the data rows into a DataTable, I want to add a couple of extra fields that are not in the csv.
This line writes all the csv data into the table ok -
tempTable.Rows.Add(parser.ReadFields())
If I do something like this -
tempTable.Rows.Add(parser.ReadFields(), stationID, sMaxSpeedDecimal, sqlFormattedDate)
the Row.Add seems to treat all the Parser data as one field and then appends the new columns. Basically the parser data is lost to the database.
How can I add additional columns so the tempTable.Rows.Add includes all the parser data plus new column data in one write?
You must either pass one array containing all the field values or else pass all the field values individually. Because you are passing multiple arguments, it is assumed to be the latter and the array is treated as one field value. You must either break up the array and pass each field individually, e.g.
Dim fields = parser.ReadFields()
tempTable.Rows.Add(fields(0), fields(1), stationID, sMaxSpeedDecimal, sqlFormattedDate)
or else combine the extra field values with the original array to create a new array, e.g.
Dim fields = parser.ReadFields().Concat({stationID, sMaxSpeedDecimal, sqlFormattedDate})
tempTable.Rows.Add(fields)

Resources