Finding the location of millions of records from IPAddress - sql-server

I have a table that has 2.5 million IP addresses mostly in the U.S. I want to use this data to find their time zone. So far I have downloaded the GeoLite city tables from Maxmind and imported them to my server.
http://dev.maxmind.com/geoip/legacy/geolite/
The first table in Maxmind (Blocks) has a starting IP integer column an ending IP integer column and a LocID corresponding to an integer that is within that range. The table starts from the integer 16 million and goes to about 1.5 billion. The second table has geographical information corresponding to the LocID in the first table.
In a CTE, I used the code below to convert the IP addresses in my table to the integer format. The code seems to output the correct value. I also included the primary key ID column, and the regular IP address.The CTE is called CTEIPInteger.
(TRY_CONVERT(bigint, PARSENAME(IpAddress,1)) +
TRY_CONVERT(bigint, PARSENAME(IpAddress,2)) * 256 +
TRY_CONVERT(bigint, PARSENAME(IpAddress,3)) * 65536 +
TRY_CONVERT(bigint, PARSENAME(IpAddress,4)) * 16777216 ) as IPInteger
I then created a non clustered index on both the starting and ending IP integer columns.
I tried using a join as follows.
select IPAddress,IPInteger,LocID
from CTEIPInteger join Blocks
on IPInteger>= StartIpNum and IPInteger<=EndIpNum
The first 1000 records load pretty quickly but after the computer runs forever without outputting anything.
For the Blocks table, I have also tried indexes on just StartIPNum and I also tried with an index on only the LocID.
How should I obtain the time zones? Am I using the right database? If I have to I might be willing to pay for Geolocation service.

Related

Postgres ordering table by element in large data set

I have a tricky problem trying to find an efficient way of ordering a set of objects (~1000 rows) that contain a large (~5 million) number of indexed data points. In my case I need a query that allows me to order the table by a specific datapoint. Each datapoint is a 16-bit unsigned integer.
I am currently solving this problem by using an large array:
Object Table:
id serial NOT NULL,
category_id integer,
description text,
name character varying(255),
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
data integer[],
GIST index:
CREATE INDEX object_rdtree_idx
ON object
USING gist
(data gist__intbig_ops)
This index is not currently being used when I do a select query, and I am not certain it would help anyway.
Each day the array field is updated with a new set of ~5 million values
I have a webserver that needs to list all objects ordered by the value of a particular data point:
Example Query:
SELECT name, data[3916863] as weight FROM object ORDER BY weight DESC
Currently, it takes about 2.5 Seconds to perform this query.
Question:
Is there a better approach? I am happy for the insertion side to be slow as it happens in the background, but I need the select query to be as fast as possible. In saying this, there is a limit to how long the insertion can take.
I have considered creating a lookup table where every value has it's own row - but I'm not sure how the insertion/lookup time would be affected by this approach and I suspect entering 1000+ records with ~5 million data points as individual rows would be too slow.
Currently inserting a row takes ~30 seconds which is acceptable for now.
Ultimately I am still on the hunt for a scalable solution to the base problem, but for now I need this solution to work, so this solution doesn't need to scale up any further.
Update:
I was wrong to dismiss having a giant table instead of an array, while insertion time massively increased, query time is reduced to just a few milliseconds.
I am now altering my generation algorithm to only save a datum if it non-zero and changed from previous update. This has reduced insertions to just a few hundred thousands values which only takes a few seconds.
New Table:
CREATE TABLE data
(
object_id integer,
data_index integer,
value integer,
)
CREATE INDEX index_data_on_data_index
ON data
USING btree
("data_index");
New Query:
SELECT name, coalesce(value,0) as weight FROM objects LEFT OUTER JOIN data on data.object_id = objects.id AND data_index = 7731363 ORDER BY weight DESC
Insertion Time: 15,000 records/second
Query Time: 17ms
First of all, do you really need a relational database for this? You do not seem to be relating some data to some other data. You might be much better off with a flat-file format.
Secondly, your index on data is useless for the query you showed. You are querying for a datum (a position in your array) while the index is built on the values in the array. Dropping the index will make the inserts considerably faster.
If you have to stay with PostgreSQL for other reasons (bigger data model, MVCC, security) then I suggest you change your data model and ALTER COLUMN data SET TYPE bytea STORAGE external. Since the data column is about 4 x 5 million = 20MB it will be stored out-of-line anyway, but if you explicitly set it, then you know exactly what you have.
Then create a custom function in C that fetches your data value "directly" using the PG_GETARG_BYTEA_P_SLICE() macro and that would look somewhat like this (I am not a very accomplished PG C programmer so forgive me any errors, but this should help you on your way):
// Function get_data_value() -- Get a 4-byte value from a bytea
// Arg 0: bytea* The data
// Arg 1: int32 The position of the element in the data, 1-based
PG_FUNCTION_INFO_V1(get_data_value);
Datum
get_data_value(PG_FUNCTION_ARGS)
{
int32 element = PG_GETARG_INT32_P(1) - 1; // second argument, make 0-based
bytea *data = PG_GETARG_BYTEA_P_SLICE(0, // first argument
element * sizeof(int32), // offset into data
sizeof(int32)); // get just the required 4 bytes
PG_RETURN_INT32_P((int32*)data);
}
The PG_GETARG_BYTEA_P_SLICE() macro retrieves only a slice of data from the disk and is therefore very efficient.
There are some samples of creating custom C functions in the docs.
Your query now becomes:
SELECT name, get_data_value(data, 3916863) AS weight FROM object ORDER BY weight DESC;

Adding datalength condition makes query slow

I have a table mytable with some columns including the column datekey (which is a date and has an index), a column contents which is a varbinary(max), and a column stringhash which is a varchar(100). The stringhash and the datekey together form the primary key of the table. Everything is running on my local machine.
Running
SELECT TOP 1 * FROM mytable where datekey='2012-12-05'
returns 0 rows and takes 0 seconds.
But if I add a datalength condition:
SELECT TOP 1 * FROM mytable where datekey='2012-12-05' and datalength(contents)=0
it runs for a very long time and does not return anything before I give up waiting.
My question:
Why? How do I find out why this takes such a long time?
Here is what I checked so far:
When I click "Display estimated execution plan" it also takes a very long time and does not return anything before I give up waiting.
If I do
SELECT TOP 1000 datalength(contents) FROM mytable order by datalength(contents) desc
it takes 7 seconds and returns a list 4228081, 4218689 etc.
exec sp_spaceused 'mytable'
returns
rows reserved data index_size unused
564019 50755752 KB 50705672 KB 42928 KB 7152 KB
So the table is quite large at 50 GB.
Running
SELECT TOP 1000 * FROM mytable
takes 26 seconds.
The sqlservr.exe process is around 6 GB which is the limit I have set for the database.
It takes a long time because your query needs DATALENGTH to be evaluated for every row and then the results sorted before it can return the 1st record.
If the DATALENGTH of the field (or whether it contains any value) is something you're likely to query repeatedly, I would suggest an additional indexed field (perhaps a persisted computed field) holding the result, and searching on that.
This old msdn blog post seems to agree with #MartW answer that datalength is evaluated for every row. But it's good to understand what is really meant by "evaluated" and what is the real root of the performance degradation.
As mentioned in the question, the size of every value in the column contents may be large. It means that every value bigger than ~8Kb is stored in special LOB-storage. So, taking into account the size of the other columns, it's clear that most of the space occupied by the table is taken by this LOB-storage, i.e. it's around 50Gb.
Even if the length of contents column for every row has been already evaluated, which is proved in post linked above, it's still stored in LOB. So engine still needs to read some parts of the LOB-storage to execute the query.
If LOB-storage isn't in RAM at the time of a query execution then we need to read it from a disk, which is of course much slower than from RAM. Also possibly the read of LOB-parts is rather randomized than linear which is even more slow as it tends to raise the whole number of memory-blocks needed to be read from a disk.
At the moment it probably won't be using the primary key because of the stringhash column included before the datekey column. Try adding an additional index that just contains the datekey column. Once that key is created if it's still slow you could also try a query hint such as:
SELECT TOP 1 * FROM mytable where datekey='2012-12-05' and datalength(contents)=0 WITH INDEX = IX_datekey
You could also create a seperate length column that's updated either in your application or in an insert / update trigger.

Choosing appropriate DataType in SQL Server table

I have a large transaction table in SQL server which is used to store about 400-500 records each day. What is the data type should I use in my PK column? The PK column stores numeric values, for which integer seems suitable but I'm afraid it will exceed the maximum value for integer since I have so many records everyday.
I am currently using integer data type for my PK column.
With a type INT, starting at 1, you get over 2 billion possible rows - that should be more than sufficient for the vast majority of cases. With BIGINT, you get roughly 922 quadrillion (922 with 15 zeros - 922'000 billions) - enough for you??
If you use an INT IDENTITY starting at 1, and you insert a row every second, you need 66.5 years before you hit the 2 billion limit .... so with 400-500 rows per day - it will take centuries before you run out of possible values... take 1'000 rows per day - you should be fine for 5883 years - good enougH?
If you use a BIGINT IDENTITY starting at 1, and you insert one thousand rows every second, you need a mind-boggling 292 million years before you hit the 922 quadrillion limit ....
Read more about it (with all the options there are) in the MSDN Books Online.
I may be wrong here as maths has never been my strong point, but if you use bigint this has a max size of 2^63-1 (9,223,372,036,854,775,807)
so if you divide that by say 500 to get roughly the number of days-worth of records you get 18446744073709600 days-worth of 500 new records.
divide again by 365, gives you 50539024859478.2 years-worth of 500 records a day
so (((2^63-1) / 500) / 365)
if that's not me being stupid then that's a lot of days :-)

Efficient way to compute number of visits in an IP range and time range

Assume there is a popular web server, the number of visits to this web server can be tens of thousands in an hour, in order to analyse the statistical property of these visits, we want to know the number of requests in a specific time range and IP range.
For example, we have 1012 requests in the following format:
(IP address , visiting time)
Suppose we want to know how many visits came from the IP range [10.12.72.0 , 10.12.72.255] during 2p.m and 4p.m.
The only candidate ideas i can think of are:
(1)use B-TREE to index this large data set using one dimension, for instance build a B-TREE on the parameter IP. Using this B-TREE we can quickly get the number of request coming from any specific IP range, but how can we know how many of these visits are between 2p.m and 4p.m?
(2)use BITMAP, but similar to B-TREE, due to space requirement the BITMAP can only be built on one dimension, for instance IP address, we don't know how many of these request are issued between 2p.m and 4p.m.
Is there any efficient algorithm, thx? The number of queries can be quite large
Your first step is figure out the precision you need...
TIME:
Do you need, to the millisecond, time stamps or is, to the hour, good enough?
Number of hours since 1970 can fit in under a million, 3 bytes ~integer
Number of milliseconds and you need 8 bytes ~long
IP:
Are all your IPs v4 (4 bytes) or v6 (16 bytes)?
Will you ever search by specific IP or will you only use IP ranges?
If the latter you could just use the class C for each IP 123.123.123.X (3 bytes)
Assuming:
1 hour time precision is good enough
3 byte IP class C is good enough
Re-Organizing your data (2 possible structures pick one):
Database:
You can use a relational database
Table: Hits
IPClassC INT NON-CLUSTERED INDEX
TimeHrsUnix INT NON-CLUSTERED INDEX
Count BIGINT DEFAULT VALUE (1)
Flat Files:
You can use more flat files
Have 1 flat file for each class C IP that appears in your logs (max 2^24)
Each file is 8B (big int) * 1MB (Hrs Since 1970 to 2070) = 8MB in size
How to load your new data structure:
Database:
Parse your logs (read in memory one line at a time)
Convert record to 3 byte IP and 3 byte Time
Convert your IP class C to an integer and your Time hrs to an integer
IF EXISTS(SELECT * FROM Hits WHERE IPClassC = #IP AND TimeHrsUnix = #Time)
UPDATE Hits SET Count = Count + 1 WHERE IPClassC = #IP AND TimeHrsUnix = #Time
Else
INSERT INTO Hits VALUES(#IP, #Time)
Flat Files:
Parse your logs (read in memory one line at a time)
Convert record to 3 byte IP and 3 byte Time
Convert your IP to a string and your time to an integer
if File.Exist(IP) = False
File.Create(IP)
File.SetSize(IP, 8 * 1000000)
CountBytes = File.Read(IP, 8 * Time, 8)
NewCount = Convert.ToLong(CountBytes) + 1
CountBytes = Convert.ToBytes(NewCount)
File.Write(IP, CountBytes, 8 * Time, 8)
Querying your new data structures:
Database:
SELECT SUM(Count) FROM Hits WHERE IPClassC BETWEEN #IPFrom AND #IPTo AND TimeHrsUnix BETWEEN #TimeFrom AND #TimeTo
Flat File:
Total = 0
Offset = 8 * TimeFrom
Len = (8 * TimeTo) - Offset
For IP = IPFrom To IPTo
If File.Exist(IP.ToString())
CountBytes = File.Read(IP.ToString(), Offset, Len)
LongArray = Convert.ToLongArray(CountBytes)
Total = Total + Math.Sum(LongArray)
Next IP
Some extra tips:
If you go the database route your likely going to have to use multiple partitions for the database file
If you go the flat file route you may want to break your query into threads (assuming your SAS will handle the bandwidth). Each thread would handle a sub set of the IP/Files in the range. Once all threads completed the totals from each would be summed.
You want a data structure that supports orthogonal range counting.
10^12 is a large number (TERA) - certainly too large for in-memory processing.
I would store this in a relational database with a star schema, use a time dimension, and pre-aggregate by time of day (e.g. hour bands), IP subnets, and other criteria that you are interested in.

100k Rows Returned in a random order, without a SQL time out please

Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).

Resources