I have environmental data that is arranged into rows for every parameter and I would like to have a single row per sampling event in order to analyze it.
It looks like (with actually ~20 parameter types and a almost a million rows)
ID Location Parameter Result
a1 x1 DO 7.3
a1 x1 pH 8.1
a1 x1 Salinity 32.7
b2 x2 DO 7.6
b2 x2 pH 8.3
b2 x2 Salinity 31.2
I would like it to look like
ID Location DO pH Salinity
a1 x1 7.3 8.1 32.7
b2 x2 7.6 8.3 31.2
However, certain parameters were measured within each site visit at different depths as well. I am toying with how to deal with this aspect of the data conceptually, but obviously it is hard to get a sense of what is important to analyze without being able to visualize it well. There are continuous depth measurements (eg. 0.112, 0.527, 1.244, 5.891 representing the meters down it was collected) and a depth code I could sort by (eg. Surface, Half Meter, Meter, Bottom). I think just accepting the codes will be fine, especially since the bottom depth is actually it's own row as a parameter and that is the only one that should really change much.
I see my options as to either 1) accept that some data will not be in the same row and (I believe) unavailable to analyze together in ArcGIS which is my end goal program for much of this data once cleaned (different parameter types only measured at certain depths). If I do this I might just add the bottom code to the unique ID which is currently a text string of site plus date. or 2) somehow code the new columns with perhaps the depth code being combined with the parameter names. So for location a1 sampling event xxxx I would have a row that had results for perhaps salinityS, salinityM, salinityB, pHS, pHM, pHB, and DO. Hopefully I'm conceptualizing that clearly, but suggestions are very welcome.
Also, there is a time stamp for each parameter. They are all within a negligible window, so I would like to preserve perhaps just the first one for each sampling ID. Such as salinity at 11:37 and pH at 11:38 with the output row just showing 11:37 for the sampling ID.
Any advice would be much appreciated because I have already been banging my head against the wall looking for an efficient way to analyze this massive dataset organized in a way far from my preferred format for too long.
With spread from tidyr:
library(tidyr)
spread(df, Parameter, Result)
Returns:
ID Location DO pH Salinity
1 a1 x1 7.3 8.1 32.7
2 b2 x2 7.6 8.3 31.2
Related
I have a global IGBP land use dataset in which the land cover exists out of forest cover (depicted with a '1') and non-forest cover (depicted with a '0'), hence, each land grid cell has either the value 1 or 0.
This dataset has a spatial resolution of approximately 1 km at the equator, however, I am going to regrid the dataset to a spatial resolution of approx 100 km at the equator. For this new grid resolution I want to calculate the fraction of forest cover (so the fraction of 1's) for each grid cell, but I am not sure how this can be done without GIS. Is there a way to do this with cdo remapping or perhaps with python?
Thank you in advance!
if you want to translate to a new grid that is an integer multiple of the original then you can do
cdo gridboxmean,n,m in.nc out.nc
where n and m are the numbers of points to average over in the lon and lat directions.
Otherwise you can interpolate using the conversative remapping which means that you don't need to worry if the new grid is not a multiple of the old
cdo remapcon,new_grid_specification in.nc out.nc
Note that in the latter case, however, the result is only first order accurate. There is also a slightly slower second order conservative remapping available using the command remapcon2. The paper describing the two implemented conservative remapping methods is Jones (1999). For further info on remapping you can also see my video guide.
Thanks to Robert for reminding also that you may need to convert to float, which would mean using the option
cdo -b f32
I have a sf polygon dataframe with multiple
series (T1, T2, T3, all on the same scale:
they're observations at different time points).
I can plot say T1 with
ggplot(map)+geom_sf(aes(fill=T1))
what I'd like to do is plot all three (T1, T2 and T3)
as facets (separate maps) on the same drawing.
I'm sure there's a way to do this, but I can't find it.
Can anyone tell me how? Thanks!
ADDED: Two additional notes on this question.
First, the data structure described above is
one that could be plotted using spplot, with
the T's being the arguments to spplot's zcol argument.
So in this connection, my question amounts to
asking how to convert an spplot structure to
be usable by geom_sf.
Second, suppose I use sf to read in a shp file
for say 20 polygons. I also have a data frame
consisting of stacked observations for these
same polygons, say for 3 periods, so the
dataframe has 60 rows. How do I merge these
in order to be usable? Can I just stack
3 copies of the sf structure, and than cbind the
dataframe (assuming the rows match up correctly)?
At least in one sense this turns out to be very simple. Given a data structure (ds_sp) that can be plotted with spplot, you can just do the following:
ds_sf <- st_as_sf(ds_sp) # convert to sf form
plot(ds_sf[c("T1","T2")]) # plot the desired series
This isn't quite the same as using facet_wrap with ggplot, but at least it gives you something to work with.
ANOTHER LATER ADDITION : As to the longitudinal + facet_wrap issue, the following seems to work:
If necessary, create a data frame (df1) with the
longitudinal data (longit), an area indicator (fips) and a time
indicator (date) which will be used for faceting, and
anything else you may need.
If necessary, create an sf-compatible version of
the spatial geometry via st_to_sf, as new_poly .
This will be of classes "sf" and "data.frame"
and should have a spatial indicator matching fips
in df1.
Merge the two:
data_new<-df1<-dplyr::inner_join(df1,new_poly,by="fips",all.x=TRUE)
Now produce the plot
ggplot(data_new)+geom_sf(aes(fill=longit,geometry=geometry))+facet_wrap(~date)
and make adjustments from there.
I have one table with two columns
ID Probability
A 1%
B 2%
C 3%
D 4%
I have another table, with some IDs and corresponding weights:
ID Weight
A 50%
D 25%
A 15%
B 5%
B 5%
What I'm looking for is a way, in a single formula, to find the corresponding probabilities for each of the IDs in the second table using the data from the first, multiply each by their respective weights from the second table, then sum the results.
I recognise a simple way to solve it would be to add a proxy column to the second table and list corresponding probabilities using a vlookup and multiplying by the weight, then summing the results, but I feel like there must be a more elegant solution.
I've tried entering the second table IDs as an array in both Vlookup and Index/Match formulas, but while both accept a range as a lookup value, both only execute for the first value of the range instead of cycling through the whole array.
I guess ideally the formula would
set an 1 x 5 array for the IDs,
populate a new 1 x 5 array based on the probabilities from the first table
multiply the new array by the existing 1x5 array for weights
Sum whatever is the result
[edit] So for the above example, the final result would be (50% x 1%)+(25% x 4%) + (15% x 1%) + (5% x 2%) + (5% x 2%) = 1.85%
The real tables are much, much bigger than the examples I've given so a simple Sum() function for individual vlookups is out.
Love to hear of any clever solutions?
Using the same ranges as given by Trương Ngọc Đăng Khoa:
=SUMPRODUCT(SUMIF(A1:A4,D1:D5,B1:B4),E1:E5)
Regards
You can use this formula :
{=SUM(LOOKUP(D1:D5;A1:A4;B1:B4)*E1:E5)}
With table in this :
A B C D E
1 A 1% A 50%
2 B 2% D 25%
3 C 3% A 15%
4 D 4% B 5%
5 B 5%
Great response, thanks guys!
XOR LX, your answer seemed to work in all cases, which is what I was looking for (and seems like it was much simpler than I'd originally thought). I think I misunderstood the way the SUMIF function works.
In case anyone is interested, I also found my own (stupidly complex) solution:
=SUM(IF(A1:A4=TRANSPOSE(D1:D5),1,0)*TRANSPOSE(E1:E5)*B1:B4)
Which basically works by transforming the thing into a 4 x 5 matrix instead. I think I still prefer the XOR LX solution for it's simplicity.
Appreciate the help, everyone!
For my master thesis I have to use SPSS to analyse my data. Actually I thought that I don't have to deal with very difficult statistical issues, which is still true regarding the concepts of my analysis. BUT the problem is now that in order to create my dependent variable I need to use the syntax editor/ programming in general and I have no experience in this area at all. I hope you can help me in the process of creating my syntax.
I have in total approximately 900 companies with 6 year observations. For all of these companies I need the predicted values of the following company-specific regression:
Y= ß1*X1+ß2*X2+ß3*X3 + error
(I know, the ß won t very likely be significant, but this is nothing to worry about in my thesis, it will be mentioned in the limitations though).
So far my data are ordered in the following way
COMPANY YEAR X1 X2 X3
1 2002
2 2002
1 2003
2 2003
But I could easily change the order, e.g. in
1
1
2
2 etc.
Ok let's say I have rearranged the data: what I need now is that SPSS computes for each company the specific ß and returns the output in one column (the predicted values with those ß multiplied by the specific X in each row). So I guess what I need is a loop that does a multiple linear regression for 6 rows for each of the 939 companies, am I right?
As I said I have no experience at all, so every hint is valuable for me.
Thank you in advance,
Janina.
Bear in mind that with only six observations per company and three (or 4 if you also have a constant term) coefficients to estimate, the coefficient estimates are likely to be very imprecise. You might want to consider whether companies can be pooled at least in part.
You can use SPLIT FILE to estimate the regressions specific for each company, example below. Note that one would likely want to consider other panel data models, and assess whether there is autocorrelation in the residuals. (This is IMO a useful approach though for exploratory analysis of multi-level models.)
The example declares a new dataset to pipe the regression estimates to (see the OUTFILE subcommand on REGRESSION) and suppresses the other tables (with 900+ tables much of the time is spent rendering the output). If you need other statistics either omit the OMS that suppresses the tables, or tweak it to only show the tables you want. (You can use OMS to pipe other results to other datasets as well.)
************************************************************.
*Making Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #Comp = 1 to 1000.
COMPUTE #R1 = RV.NORMAL(10,2).
COMPUTE #R2 = RV.NORMAL(-3,1).
COMPUTE #R3 = RV.NORMAL(0,5).
LOOP Year = 2003 to 2008.
COMPUTE Company = #Comp.
COMPUTE Rand1 = #R1.
COMPUTE Rand2 = #R2.
COMPUTE Rand3 = #R3.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Companies.
COMPUTE x1 = RV.NORMAL(0,1).
COMPUTE x2 = RV.NORMAL(0,1).
COMPUTE x3 = RV.NORMAL(0,1).
COMPUTE y = Rand1*x1 + Rand2*x2 + Rand3*x3 + RV.NORMAL(0,1).
FORMATS Company Year (F4.0).
*Now sorting cases by Company and Year, then using SPLIT file to estimate
*the regression.
SORT CASES BY Company Year.
*Declare new set and have OMS suppress the other results.
DATASET DECLARE CoeffTable.
OMS
/SELECT TABLES
/IF COMMANDS = 'Regression'
/DESTINATION VIEWER = NO.
*Now split file to get the coefficients.
SPLIT FILE BY Company.
REGRESSION
/DEPENDENT y
/METHOD=ENTER x1 x2 x3
/SAVE PRED (CompSpePred)
/OUTFILE = COVB ('CoeffTable').
SPLIT FILE OFF.
OMSEND.
************************************************************.
I have a rather specific spatial search I need to do. Basically, have an object (lets call it obj1) with two locations, lets call them point A and point B.
I then have a collection of objects(lets call each one obj2) each with their own A and B locations.
I want to return the top 10 objects from the collection sorted by:
(distance from obj1 A to obj2A) + (the distance from obj1B to obj2B)
Any ideas?
Thanks,
Nick
Update:
Here's a little more detail on the documents and how I want to compare them.
The domain model:
Listing:
ListingId int
Title string
Price double
Origin Location
Destination Location
Location:
Post / Zipcode string
Latitude decimal
Longitude decimal
What i want to do is take a listing object (not in the database) and compare it with the collection of listings in the database. I want the query to return the top 12 (or x) number of listings sorted by the crow flies distance from the origins plus the crow flies distance from destinations.
I don't care about the distance from origin to destination - only about the distance of origin to origin plus destination to destination.
Basically Im trying to find listings where the starting and ending locations are close.
Please let me know if I can clarify more.
Thanks!
Here is how one would solve such a problem in
mysql 4.1 &
mysql 5.
The link from mysql 4.1 seems quite helpful, esp. the first example, it's pretty much what you are asking about.
But if this is not quite helpful, I guess you'd have to loop and do queries either on obj1 or obj2 against its counterpart table.
From algorithmic perspective, I'd find the center of the bounding box, then picked candidates with increasing radius while I find enough.
Also I just want to remind that crow fly distance over the globe is not Pythagoras distance and different formula must be used:
public static double GetDistance(double lat1, double lng1, double lat2, double lng2)
{
double deltaLat = DegreesToRadians(lat2 - lat1);
double deltaLong = DegreesToRadians(lng2 - lng1);
double a = Math.Pow(Math.Sin(deltaLat / 2), 2) +
Math.Cos(DegreesToRadians(lat1))
* Math.Cos(DegreesToRadians(lat2))
* Math.Pow(Math.Sin(deltaLong / 2), 2);
return earthMeanRadiusMiles * (2 * Math.Atan2(Math.Sqrt(a), Math.Sqrt(1 - a)));
}
Sounds like you're building a rideshare website. :)
The bottom line is that in order to sort your query result by surface distance, you'll need spatial indexing built into the database engine. I think your options here are MySQL with OpenGIS extensions (already mentioned) or PostgreSQL with PostGIS. It looks like it's possible in ravenDB too: http://ravendb.net/documentation/indexes/sptial
But if that's not an option, there's a few other ways. Let's simplify the problem and say you just want to sort your database records by their distance to location A, since you're just doing that twice and summing the result.
The simplest solution is to pull every record from the database and calculate the distance to location A one by one, then sort, in code. Trouble is, you end up doing a lot of redundant computations and pulling down the entire table for every query.
Let's once again simplify and pretend we only care about the Chebyshev (maximum) distance. This will work for narrowing our scope within the db before we get more accurate. We can do a "binary search" for nearby records. We must decide an approximate number of closest records to return; let's say 10. Then we query inside of a square area, let's say 1 degree latitude by 1 degree longitude (that's about 60x60 miles) around the location of interest. Let's say our location of interest is lat,lng=43.5,86.5. Then our db query is SELECT COUNT(*) FROM locations WHERE (lat > 43 AND lat < 44) AND (lng > 86 AND lng < 87). If you have indexes on the lat/lng fields, that should be a fast query.
Our goal is to get just above 10 total results inside the box. Here's where the "binary search" comes in. If we only got 5 results, we double the box area and search again. If we got 100 results, we cut the area in half and search again. If we get 3 results immediately after that, we increase the box area by 50% (instead of 100%) and try again, proceeding until we get close enough to our 10 result target.
Finally we take this manageable set of records and calculate their euclidean distance from the location of interest, and sort, in code.
Good luck!
I do not think that you find a solution directly out of the box.
It'll be much more efficient if you use a bounding sphere instead of a bounding box to specify your object.
http://en.wikipedia.org/wiki/Bounding_sphere
C = ( A + B)/2 and R = distance(A,B) /2
You do not precise how much data you want to compare. And if you want to see the closests or the farthest objects pair.
For both case, I think that you have to encode C coordinate as a path in an octtree if you are using 3D or quadtree if you are using 2D.
http://en.wikipedia.org/wiki/Quadtree
This is a first draft I can add more information if this not enough.
If you are not familiar with 3D start with 2D it easier to start with.
I show your latest add, it seems that your problem is very similar to clash detection algorithm.
I think that if you change the coordinate system of the "end-point" by polar coordinate relative to the "start-point". If you round the radial coordinate to your tolerance (x miles), and order them by this value.