I have a cell array with 5 columns and about 5000 rows with string elements. E.g.:
1997 Charles House Materials Chemicals
The years go from 1997 to 2013 and you find repetitions while the years pass by. What I would like to get is a new cell array with the cases in which the combination of the second and third column change. E.g. original cell array:
1997 Charles House Materials Chemicals %initial
1997 Rita Office Financial Bank %initial
1998 Rita Office Financial Bank %no change
1999 Charles House Materials Chemicals %no change
2000 Charles Office Materials Chemicals %change in the 2nd column
2001 Charles Office Materials Chemicals %no change
2003 Rita Star Financial Bank %change in the 2nd column
2005 Charles Castle Materials Chemicals %change in the 2nd column
2010 Rita Moon Financial Bank %change in the 2nd column
I would like for my new array to give me the first/original row and the cases in which you
observe a change. E.g. output:
1997 Charles House Materials Chemicals
1997 Rita Office Financial Bank
2000 Charles Office Materials Chemicals
2003 Rita Star Financial Bank
2005 Charles Castle Materials Chemicals
2010 Rita Moon Financial Bank
My problem is mainly related to the fact I am dealing with strings. If one could help me I would appreciate. Thanks a lot for your availability.
Code
%// Assuming input_cellarr is the input cell array
[~,~,col2id] = unique(input_cellarr(:,2),'stable')
[~,~,col3id] = unique(input_cellarr(:,3),'stable')
[~,unqrows] = unique([col2id col3id],'rows','stable')
out = input_cellarr(unqrows,:) %// Desired output
First of all, I filtered out the first column because the year does not matter. Then, you have to compare the previous with the next line. This is done using X(1:end-1,2:end) and X(2:end,2:end)). Only if row wise equal, nothing has change.
changes=all(strcmpi(X(1:end-1,2:end),X(2:end,2:end)),2)
%at this point we are missing the first line
changes=[true;changes]
%finally use logical indexing to select the relevant data.
result=X(changes,:);
Related
I have a list of guests that come to my events. The list contains 10 years worth of events and guests. I want to know how many new guests are brand new each year, such that they've never attended my events before. How do I calculate this in the calculated field?
For example, John attended my event in 2011, 2012, and 2013. This means John was my new guest in 2011, but not 2012 or 2013. Therefore, the count of new guests in 2011 increments by one.
It is safe to assume that John uses the same email each time he registers for my event. Therefore, my resulting table should look like this:
Email
First Year of Attendance
johndoe#live.com
2011
0) Summary
Use EITHER #1 OR #2
#1: Use if two fields are required (Email and First_Year)
#2: Use if three fields are required (Email, Year and First_Year)
1) Aggregate Year by MIN
If the goal is only to show the 2 fields in the Table, then simply add the Year field as a Metric and aggregate by MIN:
2) Data Blending
To display all three fields in a single Table, one approach is using a Self Data Blend; the below does the trick where Email represents the field with Emails (such as johndoe#live.com) and Year the field with Year values, which can be a Number field (for example 2011) or a Date field (such as 01 Jan 2011):
2.1) Data Blending
Data Source 1
Join Key: Email
Dimension: Year
Data Source 2
Join Key: Email
Metric: First_Year; Rename: Year; Aggregation: MIN
2.2) Table
Data Source: Blended Data Source
Dimension 1: Email
Dimension 2: First_Year
Dimension 3: Year
Editable Google Data Studio Report (Embedded Google Sheets Data Source) and a GIF to elaborate:
Currently I have a excel table that looks like this
A B C D E F G
ID NAME DATE ITEM 2020 3
1234 Alex 09-20-2020 Carrot 2019 2
1234 Alex 09-20-2020 Onion
1234 Alex 09-20-2019 Carrot
1234 Alex 09-20-2019 Mushroom
1234 Alex 09-20-2020 Pasta
1345 Morgan 09-20-2020 Pasta
1345 Morgan 09-20-2020 Tomato Sauce
1145 Jayson 09-20-2020 Tomato Sauce
1145 Jayson 09-20-2020 Cream Sauce
1345 Morgan 09-20-2019 Pasta
1345 Morgan 09-20-2019 Tomato Sauce
I want to be able to count the unique customers for each year using excel functions. This is so that the functions can be transferred to a different computers without setting up the custom functions.
The proccess currently can be done in excel without function by: adding filter to each column, filtering to only show the intended year, using remove duplicate to remove duplicates in NAME, and finally counting the rows (giving reslts seen in G2 & G3). However, I want to be able to do that through excel functions. So far what I have is that I am able to count unique values through
{=SUM(IF(FREQUENCY(IF(LEN(B2:B12)>0,MATCH(B2:B12,B2:B12,0),""),IF(LEN(B2:B12)>0,MATCH(B2:B12,B2:B12,0),''))>0,1))}
Additionally I am also able to SUMPRODUCT() for counting a array with multiple condition so for now I have combined the above forumla with
SUMPRODUCT((YEAR(C2:C12)=G1)+0)
My initial idea was to add the first function into the SUMPRODUCTI() since the first function could also produce a array that it could count. However that quickly did not work as it did not count the unique values corresponding to the year.
My question here is if there is any way to what would be a grouping function so that I can take unique values that are within a year, without transforming the data (through filters of deletion of duplicates). My current understanding with SUMPRODUCT() is that it will only look for unique values in the entire column but not within the range given for the first array.
You have got numeric ID's which you should make use of. If you have got Excel O365, in G1 use:
=COUNT(UNQIUE(FILTER(A$2:A$12,YEAR(C$2:C$12)=F1)))
With older versions, use this CSE-entered formula:
=SUM(--(FREQUENCY(IF(YEAR(C$2:C$12)=F1,A$2:A$12),A$2:A$12)>0))
And drag down.
I'm quite new to BI designing DB, and here some point I do not understand well.
I'm trying to import french census data, where I got population for each city. For each city, I have population with different age classification, that can't really relate with each other.
For instance, let's say that one classification is 00 to 20 years old, 21 to 59, and 60+
And the other is way more precise : 00 to 02, 03 to 05, etc. but the bounds are never the same as the first one classification : I don't have 15 to 20, but 18 to 22, for example.
So those 2 classifications are incompatible. How can I use them in my fact table ? Should I use 2 fact tables and 2 cubes ? Should I use one fact table, and 2 dimensions for 1 cube ? But in this case, I will have double counted facts when I'll sum to have total population for a city, won't I ?
This is national census data, and national classifications, so changing that or estimating population to mix those classifications is not an option. And to be clear, one row doesn't relate to one person, but to one city. My facts are not individuals but cities' populations.
So this table is like :
Line 1 : One city - one amount of population - one code for dim age (ex. 00 to 19 yo) of this population - code (m/f) for the dim gender of that population - date of the census
Line 2 : Same city - one amount of population - one code for dim age (ex. 20 to 34) of this population - code (m/f) for the dim gender - date of the census
And so it goes for a lot of cities, both gender, and multiple years.
Same
I hope this question is clear enough, as english is not my native language and as I'm quite new in DB and BI !
Thanks for helping me with that.
One possible solution using a single fact table and two dimensions for the age ranges:
1 - Categorical range based on the broadest census, for example:
Young 0-20
Adult 21-59
Senior 60+
You could then link the other census to this dimension with approximate values, for example 18-22 could be Young.
2 -Original age range. This dimension could be used for precise age ranges when you report on a single city, it can also help you evaluate the impact of the overlapping bounds (e.g. how many rows are in the young / 18-22 range?)
you can crate one dimention as below
young 1-20
adult 21-59
senior 60+
Classification is
young city 1 : 1-20
young city 2 : 4-23
id field1 field2 field3 field4 .......
1 1 year young_city_1 other .......
2 2 year young_city_1 other .......
3 3 year young_city_1 other .......
4 4 year young_city_1 young_city_2 .......
Now you can report from any item and with any division
i hope it is help you
I have unbalanced panel data and need to exclude observations (in t) for which the income changed during the year before (t-1), while keeping other observations of these people. Thus, if a change in income happens in year t, then year t should be dropped (for that person).
clear
input year id income
2003 513 1500
2003 517 1600
2003 518 1400
2004 513 1500
2004 517 1600
2004 518 1400
2005 517 1600
2005 513 1700
2005 518 1400
2006 513 1700
2006 517 1800
2006 518 1400
2007 513 1700
2007 517 1600
2007 518 1400
2008 513 1700
2008 517 1600
2008 518 1400
end
xtset id year
xtline income, overlay
To illustrate what's going on, I add a xtline plot, which follows the income per person over the years. ID=518 is the perfect non-changing case (keep all obs). ID=513 has one time jump (drop year 2005 for that person). ID=517 has something like a peak, perhaps one time measurement error (drop 2006 and 2007).
I think there should be some form of loop. Initialize the first value for each person (because this cannot be compared), say t0. Then compare t1-t0, drop if changed, else compare t2-t1 etc. Because data is unbalanced there might be missing year-obervations. Thanks for advice.
Update/Goal: The purpose is prepare the data for a fixed effects regression analysis. There is another variable, reported for the entire "last year". Income however is reported at interview date (point in time). I need to get close to something like "last year income" to relate it to this variable. The procedure is suggested and followed by several publications. I try to replicate and understand it.
Solution:
bysort id (year) : drop if income != income[_n-1] & _n > 1
bysort id (year) : gen byte flag = (income != income[_n-1]) if _n > 1
list, sepby(id)
The procedure is VERY IFFY methodologically. There is no need to prepare for the fixed effects analysis other than xtsetting the data; and there rarely is any excuse to create missing data... let alone do so to squeeze the data into the limits of what (other) researchers know about statistics and econometrics. I understand that this is a replication study, but whatever you do with your replication and wherever you present it, you need to point out that the original authors did not have much clue about regression to begin with. Don't try too hard to understand it.
Suppose I have the following data table in Excel
Company Amount Text
Oracle $3,400 330 Richard ERP
Walmart $750 348 Mary ERP
Amazon $6,880 xxxx Loretta ERP
Rexel $865 0000 Mike ERP
Toyota $11,048 330 Richard ERP
I want to go through each item in the "Text" column, search the item against the following range of names:
Mary
Mike
Janine
Susan
Richard
Jerry
Loretta
and return the name in the "Person" column, if found. For example:
Company Amount Text Person
Oracle $3,400 330 Richard ERP Richard
Walmart $750 348 Mary ERP Mary
Amazon $6,880 xxxx Loretta ERP Loretta
Rexel $865 0000 Mike ERP Mike
Toyota $11,048 330 Richard ERP Richard
I've tried the following in Excel which works:
=IF(N2="","",
IF(ISNUMBER(SEARCH(Sheet2!$A$1,N2)),Sheet2!$A$1,
IF(ISNUMBER(SEARCH(Sheet2!$A$2,N2)),Sheet2!$A$2,
IF(ISNUMBER(SEARCH(Sheet2!$A$3,N2)),Sheet2!$A$3,
....
Where $A$1:$A$133 is my range and N2 is the "Text" column values; however, that is a lot of nested code and apparently Excel has a limit on the number of nested IF statements you can have.
Is there a simpler solution (arrays? VBA?)
Thanks!
Use the following formula:
=IFERROR(INDEX(Sheet2!A:A,AGGREGATE(15,6,ROW(Sheet2!$A$1:INDEX(Sheet2!A:A,MATCH("zzz",Sheet2!A:A)))/(ISNUMBER(SEARCH(Sheet2!$A$1:INDEX(Sheet2!A:A,MATCH("zzz",Sheet2!A:A)),N2))),1)),"")