How can I improve the speed of a query? - postgis

I have this query which takes around 110ms to execute in symfony.
The 1st error I find is that the index is a btree and not a gist, how can I tell doctrine to use a gist index?
#[ORM\Index(name: 'index_geog', columns: ['geog'])]
[...]
#[ORM\Column(type: PostGISType::GEOGRAPHY, nullable: true)]
private string $geog;
here is the query and the analysis:
SELECT
p0_.id AS id_0,
p0_.titre AS titre_1,
p0_.latitude AS latitude_2,
p0_.longitude AS longitude_3,
ST_AsEWKT(p0_.geog) AS geog_4,
ST_Distance(
p0_.geog,
ST_Point(?, ?)
) AS sclr_13
FROM
fiche p0_
ORDER BY
p0_.geog <-> ST_Point(?, ?) ASC
LIMIT
9
Limit (cost=4408.87..4647.49 rows=9 width=1992)
-> Result (cost=4408.87..230109.79 rows=8513 width=1992)
-> Sort (cost=4408.87..4430.16 rows=8513 width=1984)
Sort Key: ((geog <-> '0101000020E61000000000000000001C400000000000804840'::geography))
-> Seq Scan on fiche p0_ (cost=0.00..4231.38 rows=8513 width=1984)

Related

How to Improve the ADO Lookup Speed?

I write a C++ application via Visual Studio 2008 + ADO(not ADO.net). Which will do the following tasks one by one:
Create a table in SQL Server database, as follows:
CREATE TABLE MyTable
(
[S] bigint,
[L] bigint,
[T] tinyint,
[I1] int,
[I2] smallint,
[P] bigint,
[PP] bigint,
[NP] bigint,
[D] bit,
[U] bit
);
Insert 5,030,242 records via BULK INSERT
Create an index on the table:
CREATE Index [MyIndex] ON MyTable ([P]);
Start a function which will lookup for 65,000,000 times. Each lookup using the following query:
SELECT [S], [L]
FROM MyTable
WHERE [P] = ?
Each time the query will either return nothing, or return one row. If getting one row with the [S] and [L], I will convert [S] to a file pointer and then read data from offset specified by [L].
Step 4 takes a lot of time. So I try to profile it and find out the lookup query takes the most of the time. Each lookup will take about 0.01458 second.
I try to improve the performance by doing the following tasks:
Use parametered ADO query. See step 4
Select only the required columns. Originally I use "Select *" for step 4, now I use Select [S], [L] instead. This improves performance by about 1.5%.
Tried both clustered and non-clustered index for [P]. It seems that using non-clustered index will be a little better.
Are there any other spaces to improve the lookup performance?
Note: [P] is unique in the table.
Thank you very much.
You need to batch the work and perform one query that returns many rows, instead of many queries each returning only one row (and incurring a separate round-trip to the database).
The way to do it in SQL Server is to rewrite the query to use a table-valued parameter (TVP), and pass all the search criteria (denoted as ? in your question) together in one go.
First we need to declare the type that the TVP will use:
CREATE TYPE MyTableSearch AS TABLE (
P bigint NOT NULL
);
And then the new query will be pretty simple:
SELECT
S,
L
FROM
#input I
JOIN MyTable
ON I.P = MyTable.P;
The main complication is on the client side, in how to bind the TVP to the query. Unfortunately, I'm not familiar with ADO - for what its worth, this is how it would be done under ADO.NET and C#:
static IEnumerable<(long S, long L)> Find(
SqlConnection conn,
SqlTransaction tran,
IEnumerable<long> input
) {
const string sql = #"
SELECT
S,
L
FROM
#input I
JOIN MyTable
ON I.P = MyTable.P
";
using (var cmd = new SqlCommand(sql, conn, tran)) {
var record = new SqlDataRecord(new SqlMetaData("P", SqlDbType.BigInt));
var param = new SqlParameter("input", SqlDbType.Structured) {
Direction = ParameterDirection.Input,
TypeName = "MyTableSearch",
Value = input.Select(
p => {
record.SetValue(0, p);
return record;
}
)
};
cmd.Parameters.Add(param);
using (var reader = cmd.ExecuteReader())
while (reader.Read())
yield return (reader.GetInt64(0), reader.GetInt64(1));
}
}
Note that we reuse the same SqlDataRecord for all input rows, which minimizes allocations. This is documented behavior, and it works because ADO.NET streams TVPs.
Note: [P] is unique in the table.
Then you should make the index on P unique too - for correctness and to avoid wasting space on the uniquifier.

The Relation between structure of compound Primary key and performance of the queries in Cassandra

I have my own application case similar to Weather recording applications represented as following:
(cityID, sensorID, StartReadingTime, EndReadingTime, AverageValue)
Each city (cityID) has many sensors(sensorID) which reading the values of temperature.
I have composite key on (cityID,SensorID,StartReadingTime).
My application have three main queries:
1- Basic selection (Key lookup)
e.g : SELECT * FROM weather WHERE cityID = ? AND sensorID= ? AND StartReadingTime = ? ;
2- Range search
e.g: SELECT * FROM weather WHERE AverageValue > ? AND AverageValue < ?
3- Aggregation with range search
e.g: SELECT count(*) FROM weather WHERE AverageValue > ? AND AverageValue < ?
I created the table using Cassandra CQLSH with following primary key
PRIMARY KEY ((cityID, sensorID, StartReadingTime), AverageValue)
this the only combination of primary key that i found where i can run all my queries without any error.
Depend on this structure of the primary key, Cassandra will partitioning the data depend on the first element of the PRIMARY KEY where in this case is (cityID, sensorID, StartReadingTime) and clustering the data within the partitions depend on AverageValue. Note that i added AverageValue in the primary key so i can run greater-than and less-than operators.
My problem with this structure is that: when i run Range search and Aggregation queries, the performance is very very slow compared with Mysql, that because (as I understand) that Cassandra will do full scan in all partitions to get the results depend on structure of primary key!! i also tried to create secondary indexing on AverageValue column, but no performance found.
My question, where is my problem with this combination of the primary, partition and cluster keys ? is there any suggestions for getting benefits from secondary indexing?
The describe of the table:
CREATE TABLE quote.weather (
cityid int,
sensorid int,
startreadingtime double,
averagevalue double,
endreadingtime double,
PRIMARY KEY ((cityid, sensorid, startreadingtime), averagevalue)
) WITH CLUSTERING ORDER BY (averagevalue ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX av_index ON quote.weather (averagevalue);

Modeling a non-primary Key relationship

I am trying to model the following relationship with the intent of designing classes for EF code first.
Program table:
ProgramID - PK
ProgramName
ClusterCode
Sample data
ProgramID ProgramName ClusterCode
--------------------------------------
1 Spring A
2 Fall A
3 Winter B
4 Summer B
Cluster table:
ID
ClusterCode
ClusterDetails
Sample data:
ID ClusterCode ClusterDetails
---------------------------------
1 A 10
2 A 20
3 A 30
4 B 20
5 B 40
I need to join the Program table to the Cluster table so I can get the list of cluster details for each program.
The SQL would be
Select
from Programs P
Join Cluster C On P.ClusterCode = C.ClusterCode
Where P.ProgramID = 'xxx'
Note that for the Program table, ClusteCode is not unique.
For Cluster table, neither ClusterCode nor ClusterDetail is unique.
How would I model this so I can take advantage of navigation properties and code-first?
assuming you have mapped above two tables and make an association between them and you are using C#, you can use a simple join :
List<Sting> clustedDets=new ArrayList<String>();
var q =
from p in ClusterTable
join c in Program on p equals c.ClusterTable
select new { p.ClusterDetails };
foreach (var v in q)
{
clustedDets.Add(v.ClusterDetails);
}

Hive query, better option to self join

So I am working with a hive table that is set up as so:
id (Int), mapper (String), mapperId (Int)
Basically a single Id can have multiple mapperIds, one per mapper such as an example below:
ID (1) mapper(MAP1) mapperId(123)
ID (1) mapper(MAP2) mapperId(1234)
ID (1) mapper(MAP3) mapperId(12345)
ID (2) mapper(MAP2) mapperId(10)
ID (2) mapper(MAP3) mapperId(12)
I want to return the list of mapperIds associated to each unique ID. So for the above example I would want the below returned as a single row.
1, 123, 1234, 12345
2, null, 10, 12
The mapper Strings are known, so I was thinking of doing a self join for every mapper string I am interested in, but I was wondering if there was a more optimal solution?
If the assumption that the mapper column is distinct with respect to a given ID is correct, you could collect the mapper column and the mapperid column to a Map using brickhouse collect. You can clone the repo from that link and build the jar with Maven.
Query:
add jar /complete/path/to/jar/brickhouse-0.7.0-SNAPSHOT.jar;
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select id
,id_map['MAP1'] as mapper1
,id_map['MAP2'] as mapper2
,id_map['MAP3'] as mapper3
from (
select id
,collect(mapper, mapperid) as id_map
from some_table
group by id
) x
Output:
| id | mapper1 | mapper2 | mapper3 |
------------------------------------
1 123 1234 12345
2 10 12

SQL Server 2008 Stored Proc Performance where Column = NULL

When I execute a certain stored procedure (which selects from a non-indexed view) with a non-null parameter, it's lightning fast at about 10ms. When I execute it with a NULL parameter (resulting in a FKColumn = NULL query) it's much slower at about 1200ms.
I've executed it with the actual execution plan and it appears the most costly portion of the query is a clustered index scan with the predicate IS NULL on the fk column in question - 59%! The index covering this column is (AFAIK) good.
So what can I do to improve the performance here? Change the fk column to NOT NULL and fill the nulls with a default value?
SELECT top 20 dbo.vwStreamItems.ItemId
,dbo.vwStreamItems.ItemType
,dbo.vwStreamItems.AuthorId
,dbo.vwStreamItems.AuthorPreviewImageURL
,dbo.vwStreamItems.AuthorThumbImageURL
,dbo.vwStreamItems.AuthorName
,dbo.vwStreamItems.AuthorLocation
,dbo.vwStreamItems.ItemText
,dbo.vwStreamItems.ItemLat
,dbo.vwStreamItems.ItemLng
,dbo.vwStreamItems.CommentCount
,dbo.vwStreamItems.PhotoCount
,dbo.vwStreamItems.VideoCount
,dbo.vwStreamItems.CreateDate
,dbo.vwStreamItems.Language
,dbo.vwStreamItems.ProfileIsFriendsOnly
,dbo.vwStreamItems.IsActive
,dbo.vwStreamItems.LocationIsFriendsOnly
,dbo.vwStreamItems.IsFriendsOnly
,dbo.vwStreamItems.IsDeleted
,dbo.vwStreamItems.StreamId
,dbo.vwStreamItems.StreamName
,dbo.vwStreamItems.StreamOwnerId
,dbo.vwStreamItems.StreamIsDeleted
,dbo.vwStreamItems.RecipientId
,dbo.vwStreamItems.RecipientName
,dbo.vwStreamItems.StreamIsPrivate
,dbo.GetUserIsFriend(#RequestingUserId, vwStreamItems.AuthorId) as IsFriend
,dbo.GetObjectIsBookmarked(#RequestingUserId, vwStreamItems.ItemId) as IsBookmarked
from dbo.vwStreamItems WITH (NOLOCK)
where 1 = 1
and vwStreamItems.IsActive = 1
and vwStreamItems.IsDeleted = 0
and vwStreamItems.StreamIsDeleted = 0
and (
StreamId is NULL
or
ItemType = 'Stream'
)
order by CreateDate desc
When it's not null, do you have
and vwStreamItems.StreamIsDeleted = 0
and (
StreamId = 'xxx'
or
ItemType = 'Stream'
)
or
and vwStreamItems.StreamIsDeleted = 0
and (
StreamId = 'xxx'
)
You have an OR clause there which is most likely the problem, not the IS NULL as such.
The plans will show why: the OR forces a SCAN but it's manageable with StreamId = 'xxx'. When you use IS NULL, you lose selectivity.
I'd suggest changing your index make StreamId the right-most column.
However, a view is simply a macro that expands so the underlying query on the base tables could be complex and not easy to optimise...
The biggest performance gain would be for you to try to loose GetUserIsFriend and GetObjectIsBookmarked functions and use JOIN to make the same functionality. Using functions or stored procedures inside a query is basically the same as using FOR loop - the items are called 1 by 1 to determine the value of a function. If you'd use joining tables instead, all of the items values would be determined together as a group in 1 pass.

Resources