Merging yearly .gdb geodatabases - database

I'm working with a database of land parcels and their corresponding crops per year (source)
My end goal is to create a predictive model for crop type, given land parcel. For example, if in the past 4 years a farmer grew soy, soy, corn and corn, what is the likely crop he'll be growing in 5th year.
There are 8 individual .gdb databases for years 2009 to 2016, but the schemas are the same; most importantly, land parcel geometry (polygon) and type of crop.
Ideally I'd want to work with one (time series) database where I have one row per parcel geometry, and one column per year (to record crop type for that year).
What is the best way to achieve this?
A lot of polygons 'repeat' across the years but with very slight variations in borders. So I guess any two polygons that overlap more than a given threshold should be given the same unque id. Samples:
2016 (green) and 2009 (red)
Only 2009 (red)

As you say .gdb, I assume you'd like to use or stick to ArcGIS?
In any case, a very basic approach to get you going (if getting you going is what you need): join the feature's tables.
If there is any sort of unique ID that is consistent throughout all GDBs, join them on this attribute.
(if not, there should also be a tool for a spatial join with a threshold to get the same result, as you suggested)
You can choose which colums to include in the joned table, and you can export the joined tables to a new GDB. Correct me, but I think this might be sufficent for your needs (I just brainstormed here).
If not, there are lots and lots of options using gdal/ogr, python, R, postgis, ... maybe you can list your preferrences in software?

Here are the steps I took:
Step 1. Convert gdb files to shape file, this is because apparently gdb is not an open source format and limited in tool support
ogr2ogr -f "ESRI Shapefile" 2009.shp 2009.gdb
ogr2ogr -f "ESRI Shapefile" 2010.shp 2010.gdb
.... same for other years ....
File names denote the dataset corresponding to that year.
Step 2. Import shape files into PostGIS
shp2pgsql -I -s 28992 2009.shp 2009 | psql -U postgres -d crops
.... same for other years ....
28992 is the SRID that I found with the help of this website.
Step 3. Add a foreign key column in 2010 table which stores the polygon id from 2009. Let's call it global_geo_id. If it's set for a polygon in 2010 it means it's the 'same' polygon from 2009.
Step 4. Do a spacial join between years in PostGIS
update "2010"
set global_geo_id = "2009".global_geo_id
from "2009" where
ST_Intersects("2009".geom,"2010".geom) and
(st_area(st_intersection("2009".geom,"2010".geom))/("2009".shape_area)) > 0.8
0.8 is a similarity threshold; the higher you set the more conservative the matching will be.
Step 5. Repeat steps 3, 4 for other years.

Related

Google Data Studio : how to obtain a SUM related to a COUNT_DISTINCT?

I have a dataset including 3 columns :
ID transac (The unique ID of the transaction - Dimension)
Source (The source of the transaction - Dimension)
Amount € (The amount of the transaction - Stat)
screenshot of my dataset
To Count the number of transactions (for one or more sources), i use COUNT_DISTINCT function
I want to make the sum of the transactions amounts (for one or more sources). But i don't want to additionate the amounts of the transactions with the same ID !
Is there a way to do this calcul with a DataStudio function ?
Thanks for your answers. :-)
EDIT : I saw that we could do this type of calculation via SQL here and I would like to do this in DataStudio (so that I don't have to pre-calculate the amounts per source.)
IMO, your dataset contains wrong data. Each value should be relative only to that line, but this is not the case: if the total is =20, each line should describe the participation of that line to the total. With 4 sources, each line should be =5 or something else that sums 20.
To solve it in DataStudio, you need something like CALCULATE function in PowerBI, but currently DataStudio doesn't support this feature.
But there are some options to consider to repair your data:
If you're sure there are always 4 sources, just create a new calculated field with the expression Amount/4 and SUM it. It is not an elegant solution, but it works.
If your data source is Google Sheets, you can easily repair the data using formulas, like in this example:
Link to spreadsheet
For this spreadsheet, I used this formula in adjusted_amount column: =C2/COUNTIF(A:A,A2). With this column in DataStudio, just use the usual SUM aggregation function to summarize it correctly.

MongoDB and Arctic

I intend to analyse multiple data sets on the same time series (daily EOD). I will need to use computed columns. Use column A + B to create column C (store net result of calculation in column C). Is this functionality available using the MongoDB / Arctic database?
I would also intend to search the data... for example: What happens when the advance decline thrust pushes over 70 when the cumulative TICK was below -100,000 in the past 'n days'
Two data sets: Cumulative TICK and the Advance Decline Thrust (Uses advancers / decliners data). So they would be stored in the database, then I would want to have the capability to search for the above condition. This is achievable with the mongoDB / Arctic database structure?
Just looking for some general information before I move to a DB format. Currently everything I had created is on excel / VBA now its alrady out grown!
Any information greatly appreciated.
Note: I will use the same database for weekly, monthly, yearly and 1 minute, 3 minute, 5 minute 60 minute TICK/TIME based bars - not feeding live but updated EOD
yes, this can be done with arctic. Arctic can store pandas dataframes, and an operation like you have mentioned is trivial in pandas. Arctic is just a store, so you'd want to read the data out of arctic (data is stored in symbols in arctic) and then perform your transform, and then write the data back. Any of the storage engines (VersionStore, TickStore, or ChunkStore) should work for this.

Differentiating between two similar entries in SQL Server based on special characters

My table in SQL server has some entries like shown below.
2934046 Kellogg’s Share Your Breakfast 74672 2407522 Kellogg?s Share Your Breakfast ACTIVE 2015-09-01 9999-12-31
2934046 Kellogg?s Share Your Breakfast 74672 2407522 Kellogg?s Share Your Breakfast ACTIVE 2015-09-01 9999-12-31
Another example could be
2939508 UOL Ação Social 81534 1527484 UOL Ac?o Social ACTIVE 2015-09-01 9999-12-31
2939508 UOL Ac?o Social 81534 1527484 UOL Ac?o Social ACTIVE 2015-09-01 9999-12-31
As it can be seen that both the entries are same, except for the question mark character in the second entry. Even if I do something like
SELECT DISTINCT * from my_table
it is not useful. I have to figure out a way to remove such kinds of duplicate entries based on special characters. My manager says that the entries with question marks are basically bad data and I should remove them. Does anyone have an idea how to do so ?
You can implement damerau-levenshtein algorithm which evaluates how similar two strings are in a clr project and utilize it for t-sql.
You can experiment with your data to find the proper threshold value in order to accept two strings as duplicates.
A c# example of algorithm implementation can be found here:
Damerau - Levenshtein Distance, adding a threshold
I have face the same problem this year during data load activity but my manager provide me hint to use SSIS fuzzy grouping transformation to find identical records. Please create small SSIS package , add data flow task. Inside data flow task add (source + fuzzy grouping + destination).
Visit -
Adding Fuzzy Group Transform to Identify Duplicates
https://msdn.microsoft.com/en-us/library/jj819770(v=sql.120).aspx

Is it possible to create an SQL query that displays results like this?

Background
I have a database that hold records of all assets in an office. Each asset have a condition, a category name and an age.
A ConditionID can be;
In use
Spare
In Circulation
CategoryID are;
Phone
PC
Laptop
and Age is just a field called AquiredDate which holds records like;
2009-04-24 15:07:51.257
Example
I've created an example of the inputs of the query to explain better what I need if possible.
NB.
Inputs are in Orange in the above example.
I've split the example into two separate queries.
Count would be the output
Question
Is this type of query and result set possible using SQL alone? And if so where do I start? Would it be easier to use Ms Excel also?
Yes it is possible, for your orange fields you can just e.g.
where CategoryID ='Phone' and ConditionID in ('In use', 'In Circulation')
For the yellow one you could do a datediff of days of accuired date to now and divide it by 365 and floor that value, to get the last one (6+ years category) you need to take the minimum of 5 and the calculated value so you get 0 for all between 0-1 year old etc. until 5 which has everything above 6 years.
When you group by that calculated column and select the additional the count you get what you desire.

SPSS :Loop through the values of variable

I have a dataset that has patient data according to the site they visited our mobile clinic. I have now written up a series of commands such as freqs and crosstabs to produce the analyses I need, however I would like this to be done for patients at each site, rather than the dataset as whole.
If I had only one site, a mere filter command with the variable that specifies a patient's site would suffice, but alas I have 19 sites, so I would like to find a way to loop through my code to produce these outputs for each site. That is to say for i in 1 to 19:
1. Take the i th site
2. Compute a filter for this i th site
3. Run the tables using this filtered data of patients at ith site
Here is my first attempt using DO REPEA. I also tried using LOOP earler.
However it does not work I keep getting an error even though these are closed loops.
Is there a way to do this in SPSS syntax? Bear in mind I do not know Python well enough to do this using that plugin.
*LOOP #ind= 1 TO 19 BY 1.
DO REPEAT #ind= 1 TO 20.
****8888888888888888888888888888888888888888888888888888888 Select the Site here.
COMPUTE filter_site=(RCDSITE=#ind).
USE ALL.
FILTER BY filter_site.
**********************Step 3: Apply the necessary code for tables
*********Participation in the wellness screening, we actually do not care about those who did FP as we are not reporting it.
COUNT BIO= CheckB (1).
* COUNT FPS=CheckF(1).
* COUNT BnF= CheckB CheckF(1).
VAL LABEL BIO
1 ' Has the Wellness screening'
0 'Does not have the wellness screening'.
*VAL LABEL FPS
1 'Has the First patient survey'.
* VAL LABEL BnF
1 'Has either Wellness or FPS'
2 'Has both surveys done'.
FREQ BIO.
*************************Use simple math to calcuate those who only did the Wellness/First Patient survey FUB= F+B -FnB.
*******************************************************Executive Summary.
***********Blood Pressure.
FREQ BP.
*******************BMI.
FREQ BMI.
******************Waist Circumference.
FREQ OBESITY.
******************Glucose.
FREQ GLUCOSE.
*******************Cholesterol.
FREQ TC.
************************ Heamoglobin.
FREQ HAEMOGLOBIN.
*********************HIV.
FREQ HIV.
******************************************************************************I Lifestyle and General Health.
MISSING VALUES Gender GroupDep B8 to B13 ('').
******************Graphs 3.1
Is this just Frequencies you are producing? Try the SPLIT procedure by the variable RCDSITE. Should be enough.
SPLIT FILES allows you to partition your data by up to eight variables. Then each procedure will automatically iterate over each group.
If you need to group the results at a higher level than the procedure, that is, to run a bunch of procedures for each group before moving on to the next one so that all the output for a group will be together, you can use the SPSSINC SPLIT DATASET and SPSSINC PROCESS files extension commands to do this.
These commands require the Python Essentials. That and the commands can be downloaded from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) if you have at least version 18.
HTH,
Jon Peck
A simple but perhaps not very elegant way is to select from the menu: Data/Select Cases/If condition, there you enter the filter for site 1 and press Paste, not OK.
This will give the used filter as syntax code.
So with some copy/paste/replace/repeat you can get the freqs and all other results based on the different sites.

Resources