Is overdispersion unidentifiable if each survey site is only visited once? - logistic-regression

I am creating a binomial logistic regression model with two variables: average tree height within a survey site and tree density within that same site. For each site, I only have one measurement for each variable. In this case, is it correct that I cannot measure overdispersion because I cannot group the data into subsets?

Related

How can I estimate an overall intercept in a cumulative link mixed model?

I am testing a cumulative link mixed model, and I want to estimate an overall intercept for the model.
The outcome of interest has 4 categories, so the model has 3 logits each with a unique intercept (threshold coefficient).
The model is tested in R with the ordinal package using the clmm function. I included a random intercept, a random slope, and a cross-level interaction.
The model looks like this:
comp.model.fit<-clmm(competence_ordinal~competence.state.lag1+ tantrum_dur.state.lag1+NEUROw1.c+tantrum_dur.state.lag1:NEUROw1.c+ (1+tantrum_dur.state.lag1|ID_nummer), data, Hess = TRUE, na.action = na.exclude)
Results of the fitted model showed that the cross-level interaction is significant. So, I would like to find out the region of significance of the simple slope, that defines the specific values of the moderator (NEUROw1.c) at which the slope of the regression of the outcome (competence_ordinal) on the focal predictor (tantrum_dur.state.lag1) transitions from non-significance to significance.
To compute a test for simple slopes I need an estimate for the intercept, however, in this type of model an overall intercept is not identified alongside the threshold coefficients.
Therefore, my question is how can I estimate an overall intercept?
Is there a way to constrain the first threshold to zero to be able to estimate the intercept?

ArangoDB - Graph based recommender system

I am using ArangoDB and I am trying to build a graph-based recommender system with it.
The data model just contains users, items and ratings (edges).
Therefore want to calculate the affinity of a user to a movie with the katz measure.
Eventually I want to do this:
Get all (or a certain number of) paths between a user and a item
For all of these paths do the following:
Multiply each edge's rating with a damping factor (e.g. 0.7)
Sum up all calculated values within a path
Calculate the average of all calculated path values
The result is some kind of affinity between a user and an item, weighted with the intermediary ratings and damped by a defined factor.
I was trying to realize something like that in AQL but it was either wrong or much too slow. How could a algorithm like this look in AQL?
From a performance point of view there might be better choices for graph based recommender systems. If someone has a suggestion (e.g. Item Rank or other algorithms), it would also be nice to get some ideas here.
I love this topic but sometimes I get to my borders.
In the following, #start and #end are parameters representing the two endpoints; for simplicity, I've assumed that:
the maximum admissible path length is 10000
"rates" is the name of the "edges" collection
"rating" is the name of the property giving a weight to an edge
the "damping" factor is as per the requirements
FOR v,e,p IN 0..10000 OUTBOUND #start rates
OPTIONS {uniqueVertices: "path"}
FILTER v._id==#end
LET r = AVERAGE(p.edges[*].rating) * 0.7
COLLECT AGGREGATE avg = AVERAGE(r)
RETURN avg

Motivation for k-medoids

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that
the number of metrics that can be used in kmeans is very limited or is there something more?
Is there an example of data, for which it makes much more sense to choose the best representatives
of cluster from the data rather then from R^n?
The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output.
lets take an example.
Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.
But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself.
This is the foremost motivation of introducing k-mediods
Coming to the metrics part you can apply all the metrics that you apply for k-means
Hope this helps.
Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?
1. Technical justification
In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.
Example 2D data showing that:
The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.
The Python code used to generate green points:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55
points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points
2. Business case justification
Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.
2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset
2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.
Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)
The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function.
This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.
High dimensionality problem
In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.
Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies.
As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance.
In this case, vector representation is long as the number of words in your dictionary.
Heterogeneousity and Complex Data Types.
There are many domains where is considerably better to abstract the implementation of an instance:
Graph's nodes clustering;
Car driving behaviour, represented as GPS routes;
Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.
[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
Source: https://github.com/eracle/Gap

House pricing using neural network

I wrote multilayer perceptron implementation (on Python) which is able to classify Iris dataset. It was trained by backpropagation algorithm and uses sigmoid actiovation functions on a hidden and output layers.
But now I want to change it to be able to approximate house price.
(I have dataset of ~300 estates with prices and input parameters like rooms, location etc.)
Now output of my perceptron is in range [0;1]. But as far as I understand if I want to get resulting house price on the output neuron I need to change that activation function somehow right?
Can somebody help me?
I'm new to neural networks
Thanks in advance.
Assuming, for instance, that houses price between $1 and $1,000,000, then you can just map the 0...1 range to the final price range both for the training and for the testing. Just note that 300 estates is a fairly small data set.
To be precise, if a house is $500k, then the target training output becomes 0.5. You can basically divide by your maximum possible home value to get the target training amount. When you get the output value you multiple by the maximum home value to get the predicted price.
So, view the output of the neural network as the percentage of the total cost.

Create a consistent topology using pgrouting

I'm developing an application that needs routing information for certain cities. First, I downloaded a openstreetmap datafile (*.osm) and then I imported it into a postgreSQL database using osm2pgrouting tool (http://workshop.pgrouting.org/chapters/installation.html).
After this, I have the following tables:
nodes: that contains simple locations points
ways: that contains ways with some nodes involved
vertices_tmp: stores nodes that may be used for pgrouting functions like Djistra, A*...etc.
Would I use nodes that isn't in "vertices_tmp" table for calculate distances between nodes? Or I would only do it with the nodes stored in "vertices_tmp"?
Into ways table there are a field named "the_geom" that encapsulates different locations points (nodes). For example:
"MULTILINESTRING((1.5897786 42.5600441,1.5898376 42.5601455,1.589992 42.5605438,1.590095 42.5606795,1.5901782 42.5608026,1.5902238 42.561018,1.5902912 42.5616808,1.5903685 42.561899,1.5904008 42.5620563,1.5903836 42.5624117,1.5904265 42.5627151,1.5904947 42.5628368,1.5905981 42.5629553,1.5906926 42.5631007,1.590802 42.5633238,1.5908604 42.5634883,1.5909501 42.5637139,1.5910869 42.5638755,1.5913053 42.5639639,1.5914994 42.5640237,1.591648 42.5640261,1.5919232 42.5640145,1.5921124 42.5640363,1.5923292 42.5640953,1.592804 42.5643306))"
Can I route with intermediate nodes or only with source/target nodes?
My goal is to be able to routing between different nodes or POIs, depending of its amenity tags, not only driving distance, walking distance too. Furthermore I need to calculate shortest path for source/targets nodes.
Any idea for do this?
You can't use the elements of the nodes table.
If you want to plan route from one POI to another, first you have to find the nearest vertex/edge based on the selected algorithm(Shooting star requires edges, the others use vertices).
After this you can make the routing, just pick an algorithm from THIS SITE
You will find there a good tutorial about different routing solutions and some help for the detailed usage (including how to determine the closest way).

Resources