Add a column with value containing in certain column's value - loops

Really hope to get some help as I already f-d my brain out in trying to achieve it.
I have a DataFrame:
PagePath Source
0 /product/123/sometext (Other)
1 /product/234?someutminfo (Other)
2 /product/112?whatever (Other)
A aslo have another dataframe with short product paths:
Path Other stuff
0 /product/123 Foo
1 /product/234 Bar
2 /product/345 Buzz
3 /product/456 Lol
What I need is to create a new column in first df that will match the second df so that it will contain short Paths if there are ones.
So far I managed to do the following:
1) Created a series from the second df by subsetting it
2) Sort of iterated through the first df with list from the second
df1['newcol'] = df1['PagePath'].str.contains('|'.join(list_from_df2))
Which gave me a column with True/False based on whether match was found.
I understand that what I need to do is to iterate through each row from first df, iterate through each value of list and return it when the match is found.
But if only could I write an appropriate code for it. I really hope for your help.

Solved the problem myself:
First we define a function:
def return_match(row):
try:
return re.search(r'/product/.+-\d+/', row).group(0)
except:
return 'Not a product'
Then we apply a funcgion over a necessary column:
df['newcol'] = df['PagePath'].apply(return_match)

Related

Matching and replacing a selection of data from two different dataframes

(First time posting so please bear with) I have two different dataframes, one of which contains a column of replacement data for a selection of data within the first dataframe.
#dataframe 1
df<-data.frame(site= rep(1:4,3), landings = rep("val",12),
harbour = c("a","b","c","d","e","f","g","h","i","j","k","l"))
#dataframe 2
new_site4<-data.frame(harbour = c("a","b","c","d","e","f","g","h","i","j","k","l"),
sub_site = c("x","x","y","x","y","y","y","x","y","x","y","y") )
I want to replace the "site" in dataframe 1 with the "subsite" in dataframe 2 based on the match of "harbour" however I only need to do it for records for site "4".
Is there a neat way to select only site 4 and then replace the site number with the subsite, ideally without merging or without creating a whole new dataframe. My real dataset is large but the key is only small as it only refers to a small selection of the data which needs the subsite added.
I tried using match() on my main dataset but for some reason it only matched some of the required data not all of it, but this code wont work on my sample data either.
#df$site[match(df$harbour, new_site4$harbour)] <- new_site4$sub_site[match(df$harbour, df$harbour)]`

Array formula is not working to my subsequent entries - filling rest of the entries with respect to the first entry only

Im using an array formula in google sheet as a response to a google form which has an F field that has the first name of a person (say, thomas) and a G field whic is the last name of a person(say, mathew)and E field that has a custom email domain say "test.org" - the result expected is "thomasm_#test.org"
Im applying this array formula to one of my header field which will be given with a header name "UserID".
ArrayFormula(IFS(ROW(A:A)=1, "UserID", LEN(A:A)=0, IFERROR(1/0), LEN(A:A)>0,LOWER(CONCATENATE(SUBSTITUTE(F2," ",""),LEFT(G2,1),"_",E2))))
Applying this formula it will apply to whatever entries i had given in the second row to all subsequent rows.. subsequent entries are not reflecting... Pls help
Hi #richtom welcome to this community!
A few reasons:
concatenate function will not work inside of an arrayfunction as you're expecting, so avoid this.
Another thing is that also here SUBSTITUTE(F2," ",""),LEFT(G2,1),"_",E2), you must use the same rangesize as in the beginning, i.e. A:A... F:F... G:G. the use of F2 will use F2 only in all rows.
So, I recommend the following:
=arrayformula(if(row(F:F)=row(),"UserID",if(F:F="","",if(G:G="","",if(E:E="","",lower(substitute(F:F & left(G:G,1) & "_#" & E:E," ","")))))))

Merge/concatenate CSV-imported dataframes and delete duplicates

I am following up on my previous question.
Have sorted out a loop to import CSVs, concatenate data and remove duplicates.
files = glob.glob('./A08_csv/A08_B1_T*.csv')
dfs = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True) for fp in files]
df = pd.concat(dfs)
df_purged = df.drop_duplicates(inplace=True)
print df_purged
However df.drop_duplicates(inplace=True) does not work (surely I am missing something) and print returns a void. How can I specify to check the duplicates by index? Adding the column name does not seem to work.
Also, how can I transform this loop into a formula, so I can apply this recursive input to csv with different filenames (i.e something that could work for A08_B1_T*.csv (bedroom) and for A08_KI_T*.csv (kitchen) etc.)?
Do you understand the inplace = True option?
If you do it inplace, it means you will modify df, so don't set the values to df_purged.
You here have two solutions: either you want to keep the 'unpurged' dataframe and you do:
df_purged = df.drop_duplicates()
Either you don't care about keeping it and you do:
df.drop_duplicates(inplace = True)
First option your result dataframe will be df_purged, but in the second it will be df which will be purged since you performed it inplace.
That being said, if you want to purge on your index, if you don't need to keep it, you can reset_index and then drop_duplicates like this:
df_purged = df.reset_index().drop_duplicates(['index']).drop('index',1)
And if you need to keep the index (modulo the dropped lines):
df_purged = df.reset_index().drop_duplicates(['index']).set_index('index')
del df.index.name
(Note that once again deleting the index name is only here for aesthetic)
Would this help?
df.drop_duplicates(['col_name'])
Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:
df= df.reset_index().drop_duplicates(subset='Date', 'Time', keep='last').set_index(subset='Date', 'Time')

D3.js unknown number of columns and rows

I m currently creating a chart(data from an external csv file) but I dont know beforehand the number of columns and rows. Could you maybe point me in the right direction as to where I could find some help(or some examples) with this issue?
Thank you
d3.csv can help you here:
d3.csv('myCSVFile.csv', function(data){
//the 'data' argument will be an array of objects, one object for each row so...
var numberOfRows = data.length, // we can easily get the number of rows (excluding the title row)
columns = Object.keys( data[0] ), // then taking the first row object and getting an array of the keys
numberOfCOlumns = columns.length; // allows us to get the number of columns
});
Note that this method assumes that the first row (and only the first row) of your spreadsheet is column titles.
In addition to Tom P's advice, it's worth noting that version 4 of D3 introduced a columns property, which you can use to create an array of column headers (i.e. the dataset's 'keys').
This is useful because (a) it's simpler code and (b) the headers in the array are in the same order that they appear in the dataset.
So, for the above dataset:
headers = data.columns
... creates the same array as:
headers = Object.keys(data[0])
... but the array of column names is in a predictable order.

create hash value for each row of data in dataframe in R

I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.
My plan is to create hash for each row of data in two dataframe with same columns, using digest in digest package, and I suppose hash should be the same for any 2 identical row of data.
I tried to give and unique hash for each row of data, using the code below:
for (loop.ssi in (1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
flush.console()
}
But this is very slow.
Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.
UPDATE
I have updated the code as below:
ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)
ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
c("uid"),
function(df)
{df[,"uid"]<- NULL
hash <- digest(as.character(df))
data.frame(hash=hash)
},
.progress="text")
I self-generated a uid column for the "unique" purpose.
If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))
I know this answer doesn't match the title of the question, but if you just want to see when rows are different you can do it directly:
rowSums(df2 == df1) == ncol(df1)
Assuming both data.frames have the same dimensions, that will evaluate to FALSE for every row that is not identical. If you need to test rownames as well that could be manage seperately and combined with the test of contents, and similarly for colnames (and attributes, and strict tests on column types).
rowSums(df2 == df1) == ncol(df1) & rownames(df2) == rownames(df1)

Resources