Merge/concatenate CSV-imported dataframes and delete duplicates - loops

I am following up on my previous question.
Have sorted out a loop to import CSVs, concatenate data and remove duplicates.
files = glob.glob('./A08_csv/A08_B1_T*.csv')
dfs = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True) for fp in files]
df = pd.concat(dfs)
df_purged = df.drop_duplicates(inplace=True)
print df_purged
However df.drop_duplicates(inplace=True) does not work (surely I am missing something) and print returns a void. How can I specify to check the duplicates by index? Adding the column name does not seem to work.
Also, how can I transform this loop into a formula, so I can apply this recursive input to csv with different filenames (i.e something that could work for A08_B1_T*.csv (bedroom) and for A08_KI_T*.csv (kitchen) etc.)?

Do you understand the inplace = True option?
If you do it inplace, it means you will modify df, so don't set the values to df_purged.
You here have two solutions: either you want to keep the 'unpurged' dataframe and you do:
df_purged = df.drop_duplicates()
Either you don't care about keeping it and you do:
df.drop_duplicates(inplace = True)
First option your result dataframe will be df_purged, but in the second it will be df which will be purged since you performed it inplace.
That being said, if you want to purge on your index, if you don't need to keep it, you can reset_index and then drop_duplicates like this:
df_purged = df.reset_index().drop_duplicates(['index']).drop('index',1)
And if you need to keep the index (modulo the dropped lines):
df_purged = df.reset_index().drop_duplicates(['index']).set_index('index')
del df.index.name
(Note that once again deleting the index name is only here for aesthetic)

Would this help?
df.drop_duplicates(['col_name'])
Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:
df= df.reset_index().drop_duplicates(subset='Date', 'Time', keep='last').set_index(subset='Date', 'Time')

Related

Filter Array For IDs Existing in Another Array with Ruby on Rails/Mongo

I need to compare the 2 arrays declared here to return records that exist only in the filtered_apps array. I am using the contents of previous_apps array to see if an ID in the record exists in filtered_apps array. I will be outputting the results to a CSV and displaying records that exist in both arrays to the console.
My question is this: How do I get the records that only exist in filtered_apps? Easiest for me would be to put those unique records into a new array to work with on the csv.
start_date = Date.parse("2022-02-05")
end_date = Date.parse("2022-05-17")
valid_year = start_date.year
dupe_apps = []
uniq_apps = []
# Finding applications that meet my criteria:
filtered_apps = FinancialAssistance::Application.where(
:is_requesting_info_in_mail => true,
:aasm_state => "determined",
:submitted_at => {
"$exists" => true,
"$gte" => start_date,
"$lte" => end_date })
# Finding applications that I want to compare against filtered_apps
previous_apps = FinancialAssistance::Application.where(
is_requesting_info_in_mail: true,
:submitted_at => {
"$exists" => true,
"$gte" => valid_year })
# I'm using this to pull the ID that I'm using for comparison just to make the comparison lighter by only storing the family_id
previous_apps.each do |y|
previous_apps_array << y.family_id
end
# This is where I'm doing my comparison and it is not working.
filtered_apps.each do |app|
if app.family_id.in?(previous_apps_array) == false
then #non_dupe_apps << app
else "No duplicate found for application #{app.hbx_id}"
end
end
end
So what am I doing wrong in the last code section?
Let's check your original method first (I fixed the indentation to make it clearer). There's quite a few issues with it:
filtered_apps.each do |app|
if app.family_id.in?(previous_apps_array) == false
# Where is "#non_dupe_apps" declared? It isn't anywhere in your example...
# Also, "then" is not necessary unless you want a one-line if-statement
then #non_dupe_apps << app
# This doesn't do anything, it's just a string
# You need to use "p" or "puts" to output something to the console
# Note that the "else" is also only triggered when duplicates WERE found...
else "No duplicate found for application #{app.hbx_id}"
end # Extra "end" here, this will mess things up
end
end
Also, you haven't declared previous_apps_array anywhere in your example, you just start adding to it out of nowhere.
Getting the difference between 2 arrays is dead easy in Ruby: just use -!
uniq_apps = filtered_apps - previous_apps
You can also do this with ActiveRecord results, since they are just arrays of ActiveRecord objects. However, this doesn't help if you specifically need to compare results using the family_id column.
TIP: Getting the values of only a specific column/columns from your database is probably best done with the pluck or select method if you don't need to store any other data about those objects. With pluck, you only get an array of values in the result, not the full objects. select works a bit differently and returns ActiveRecord objects, but filters out everything but the selected columns. select is usually better in nested queries, since it doesn't trigger a separate query when used as a part of another query, while pluck always triggers one.
# Querying straight from the database
# This is what I would recommend, but it doesn't print the values of duplicates
uniq_apps = filtered_apps.where.not(family_id: previous_apps.select(:family_id))
I highly recommend getting really familiar with at least filter/select, and map out of the basic array methods. They make things like this way easier. The Ruby docs are a great place to learn about them and others. A very simple example of doing a similar thing to what you explained in your question with filter/select on 2 arrays would be something like this:
arr = [1, 2, 3]
full_arr = [1, 2, 3, 4, 5]
unique_numbers = full_arr.filter do |num|
if arr.include?(num)
puts "Duplicates were found for #{num}"
false
else
true
end
end
# Duplicates were found for 1
# Duplicates were found for 2
# Duplicates were found for 3
=> [4, 5]
NOTE: The OP is working with ruby 2.5.9, where filter is not yet available as an array method (it was introduced in 2.6.3). However, filter is just an alias for select, which can be found on earlier versions of Ruby, so they can be used interchangeably. Personally, I prefer using filter because, as seen above, select is already used in other methods, and filter is also the more common term in other programming languages I usually work with. Of course when both are available, it doesn't really matter which one you use, as long as you keep it consistent.
EDIT: My last answer did, in fact, not work.
Here is the code all nice and working.
It turns out the issue was that when comparing family_id from the set of records I forgot that the looped record was a part of the set, so it would return it, too. I added a check for the ID of the array to match the looped record and bob's your uncle.
I added the pass and reject arrays so I could check my work instead of downloading a csv every time. Leaving them in mostly because I'm scared to change anything else.
start_date = Date.parse(date_from)
end_date = Date.parse(date_to)
valid_year = start_date.year
date_range = (start_date)..(end_date)
comparison_apps = FinancialAssistance::Application.by_year(start_date.year).where(
aasm_state:'determined',
is_requesting_voter_registration_application_in_mail:true)
apps = FinancialAssistance::Application.where(
:is_requesting_voter_registration_application_in_mail => true,
:submitted_at => date_range).uniq{ |n| n.family_id}
#pass_array = []
#reject_array = []
apps.each do |app|
family = app.family
app_id = app.id
previous_apps = comparison_apps.where(family_id:family.id,:id.ne => app.id)
if previous_apps.count > 0
#reject_array << app
puts "\e[32mApplicant hbx id \e[31m#{app.primary_applicant.person_hbx_id}\e[32m in family ID \e[31m#{family.id}\e[32m has registered to vote in a previous application.\e[0m"
else
<csv fields here>
csv << [csv fields here]
end
end
Basically, I pulled the applications into the app variable array, then filtered them by the family_id field in each record.
I had to do this because the issue at the bottom of everything was that there were records present in app that were themselves duplicates, only submitted a few days apart. Since I went on the assumption that the initial app array would be all unique, I thought the duplicates that were included were due to the rest of the code not filtering correctly.
I then use the uniq_apps array to filter through and look for matches in uniq_apps.each do, and when it finds a duplicate, it adds it to the previous_applications array inside the loop. Since this array resets each go-round, if it ever has more than 0 records in it, the app gets called out as being submitted already. Otherwise, it goes to my csv report.
Thanks for the help on this, it really got my brain thinking in another direction that I needed to. It also helped improve the code even though the issue was at the very beginning.

Add a column with value containing in certain column's value

Really hope to get some help as I already f-d my brain out in trying to achieve it.
I have a DataFrame:
PagePath Source
0 /product/123/sometext (Other)
1 /product/234?someutminfo (Other)
2 /product/112?whatever (Other)
A aslo have another dataframe with short product paths:
Path Other stuff
0 /product/123 Foo
1 /product/234 Bar
2 /product/345 Buzz
3 /product/456 Lol
What I need is to create a new column in first df that will match the second df so that it will contain short Paths if there are ones.
So far I managed to do the following:
1) Created a series from the second df by subsetting it
2) Sort of iterated through the first df with list from the second
df1['newcol'] = df1['PagePath'].str.contains('|'.join(list_from_df2))
Which gave me a column with True/False based on whether match was found.
I understand that what I need to do is to iterate through each row from first df, iterate through each value of list and return it when the match is found.
But if only could I write an appropriate code for it. I really hope for your help.
Solved the problem myself:
First we define a function:
def return_match(row):
try:
return re.search(r'/product/.+-\d+/', row).group(0)
except:
return 'Not a product'
Then we apply a funcgion over a necessary column:
df['newcol'] = df['PagePath'].apply(return_match)

Single search box Web2py, union usage

I am trying to create a single search box on my website.
First I split up the search input in multiple strings using split().
Then I am looping over the multiple strings I created with split(), with every string I create a query. These query's will be stored in a list.
In the next step I am trying to execute all those query's and store the results (rows) in another list.
The next thing I want to do is union all these results(rows). In this case the final result will be an output of a query containing all the different keywords used in the searchbox.
This is my code:
def ajaxlivesearch():
str = request.vars.values()[0]
a=str.split()
items = []
q = []
r =[]
for partialstr in a:
q.append((db.profiel.sport.like('%'+partialstr+'%'))|(db.profiel.speelsterkte.like('%'+partialstr+'%'))|(db.profiel.plaats.like('%'+partialstr+'%')))
for query in q:
r.append(db(query).select(groupby=db.profiel.id))
for results in r:
for (i,row) in enumerate(results):
items.append(DIV(A(B(row.id_user.first_name) ,NBSP(1), B(row.id_user.last_name),BR(), I(row.sport),I(','), NBSP(1), I(row.speelsterkte),I(','), NBSP(1),I(row.plaats),HR(), _id="res%s"%i, _href=row.id_user, _onclick="copyToBox($('#res%s').html())"%i), _id="resultLiveSearch"))
return TAG[''](*items)
My question is: How do I union the multiple results(rows)?
You can get the union of two Rows objects (removing duplicates) as follows:
rows_union = rows1 | rows2
However, it would be more efficient to get all the records in a single query. To simplify, you can also use the .contains method rather than using .like and wrapping each term with %s.
fields = ['sport', 'speelsterkte', 'plaats']
query_terms = [db.profiel[f].contains(term) for f in fields for term in a]
query = reduce(lambda a, b: a | b, query_terms)
results = db(query).select()
Also, you are not using any aggregation functions, so it is not clear why you have specified the groupby argument (and in any case, each record has a unique id, so grouping would have no effect). Perhaps you instead meant orderby=db.profiel.id.
Finally, it is probably not a good idea to do request.vars.values()[0], as request.vars is a dictionary-like object, and the particular value of interest is not guaranteed to be the first item in .values(). Instead, just refer to the name of the particular variable (e.g., request.vars.keyword), which is also more efficient because you are extracting a single item rather than converting all values to a list.

SparkR - extracting dataframe's array<int> for an R function

I have 1000s of sensors, I need to partition the data (i.e. per sensor per day) then submit each list of data points to an R algorithm). Using Spark, simplified sample looks like:
//Spark
val rddData = List(
("1:3", List(1,1,456,1,1,2,480,0,1,3,425,0)),
("1:4", List(1,4,437,1,1,5,490,0)),
("1:6", List(1,6,500,0,1,7,515,1,1,8,517,0,1,9,522,0,1,10,525,0)),
("1:11", List(1,11,610,1))
)
case class DataPoint(
key: String,
value: List[Int]) // 4 value pattern, sensorID:seq#, seq#, value, state
I convert to a parquet file, save it.
Load the parquet in SparkR, no problem, the schema says:
#SparkR
df <- read.df(sqlContext, filespec, "parquet")
schema(df)
StructType
|-name = "key", type = "StringType", nullable = TRUE
|-name = "value", type = "ArrayType(IntegerType,true)", nullable = TRUE
So in SparkR, I have a dataframe where each record has all of the data I want (df$value). I want to extract that array into something R can consume then mutate my original dataframe(df) with a new column holding the resultant array. Logically something like results = function(df$value). Then I need to get results (for all rows) back into a SparkR dataframe for output.
How to I extract an array from the SparkR dataframe then mutate with the results?
Let spark data frame be, df and R data frame be df_r
To convert sparkR df to R df, use code
df_r <- collect(df)
with R data frame df_r, you can do all computations you want to do in R.
let say you have the result in column df_r$result
Then for converting back to SparkR data frame use code,
#this is a new SparkR data frame, df_1
df_1 <- createDataFrame(sqlContext, df_r)
For adding the result back to SparkR data frame `df` use code
#this adds the df_1$result to a new column df$result
#note that number of rows should be same in df and `df_1`, if not use `join` operation
df$result <- df_1$result
Hope this solves your problem
I had this problem too. The way I got around it was by adding a row index into the spark DataFrame and then using explode inside a select statement. Make sure to select the index and then the row you want in your select statement. That will get you a "long" dataframe. If each of the nested lists in the DataFrame column has the same amount of information in it (for example if you are exploding a list-column of x,y coordinates), you would expect each row index in the long DataFrame to occur twice.
After doing the above, I typically do a groupBy(index) on the exploded DataFrame, filter where the n() of each index is not equal to the expected number of items in the list and proceed with additional groupBy, merge, join, filter, etc. operations on the Spark DataFrame.
There are some excellent guides on the Urban Institute's GitHub page. Good luck. -nate

create hash value for each row of data in dataframe in R

I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.
My plan is to create hash for each row of data in two dataframe with same columns, using digest in digest package, and I suppose hash should be the same for any 2 identical row of data.
I tried to give and unique hash for each row of data, using the code below:
for (loop.ssi in (1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
flush.console()
}
But this is very slow.
Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.
UPDATE
I have updated the code as below:
ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)
ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
c("uid"),
function(df)
{df[,"uid"]<- NULL
hash <- digest(as.character(df))
data.frame(hash=hash)
},
.progress="text")
I self-generated a uid column for the "unique" purpose.
If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))
I know this answer doesn't match the title of the question, but if you just want to see when rows are different you can do it directly:
rowSums(df2 == df1) == ncol(df1)
Assuming both data.frames have the same dimensions, that will evaluate to FALSE for every row that is not identical. If you need to test rownames as well that could be manage seperately and combined with the test of contents, and similarly for colnames (and attributes, and strict tests on column types).
rowSums(df2 == df1) == ncol(df1) & rownames(df2) == rownames(df1)

Resources