Export from BigQuery using tabledata: list is slow - export

I have written a simple Java application to export tables from Google BigQuery using the tabledata: list method (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) and using pageToken for paging. No matter what I set the maxResults parameter to, I can only retrieve about 5000 lines per request (depending on row size). As requests take several seconds, this way I can only download 100mb per minute on average.
The ways I have found to speed this up so far:
Batch (not good in my case)
batch export to Google Cloud Storage
Parallelising requests
using startIndex
using dynamic table partitions
It seems the most performant way for my use case is last option, combined with the snapshot decorator to get a stable result in case of changing tables:
myproject:mydataset.mytable#timestamp$0-of-3
myproject:mydataset.mytable#timestamp$1-of-3
myproject:mydataset.mytable#timestamp$2-of-3
So my questions are:
Is there a better (=faster) approach
Do the tabledata list request count against the limit of 50 concurrent requests

You can first Export a BigQuery table to Google Cloud Storage using configuration.extract property of Jobs: insert
Then you can download file to location of your interest

Related

How to efficiently store and retrieve aggregated data in DynamoDB? [duplicate]

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.
Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.
Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.
Basic aggregation can be done using scan() and query() in lambda.

Google Search API Wildcard

I have a Python project running on Google App Engine. I have a set of data currently placed at datastore. On user side, I fetch them from my API and show them to the user on a Google Visualization table with client side search. Because the limitations I can only fetch 1000 record at one query. I want my users search from all records that I have. I can fetch them with multiple queries before showing them but fetching 1000 records already taking 5-6 second so this process can exceed 30 seconds timeout and I don't think putting around 20.000 records on a table is good idea.
So I decided to put my records on Google Search API. Wrote a script to sync important data between datastore and Search API Index. When perform a search, couldn't find anything like wildcard character. For example let's say I have user field stores a string which contains "Ilhan" value. When user search for "Ilha" that record not show up. I want to show record includes "Ilhan" value even if it partially typed. So basically SQL equivalent of my search should be something like "select * from users where user like '%ilh%'".
I wonder if there is a way to that or is this not how Search API works?
I setup similar functionality purely within datastore. I have a repeated computed property that contains all the search substrings that can be formed for a given object.
class User(ndb.Model):
# ... other fields
search_strings = ndb.ComputedProperty(
lambda self: [i.lower() for i in all_substrings(strings=[
self.email,
self.first_name,
self.last_name,], repeated=True)
Your search query would then look like this:
User.query(User.search_strings == search_text.strip().lower()).fetch_page(20)
If you don't need the other features of Google Search API and if the number of substrings per entity won't put you at risk of hitting the 900 properties limit, then I'd recommend doing this instead as it's pretty simple and straight forward.
As for taking 5-6 seconds to fetch 1000 records, do you need to fetch that many? why not fetch only 100 or even 20 and use the query cursor for the user to pull the next page only if they need it.

Fetch large JSON from Datastore

I've created an API on Google Cloud Endpoints that getting all datas from a single entity in Datastore. The NoSQL request (A really simple one : Select * from Entity) is performed with Objectify.
This datastore entity is populated with 200 rows (entities) and each row (entity) has a list of children entities of same kind :
MEAL:
String title
int preparationTime
List< Ingredient > listOfIngredients (child entities...)
...
So when I fetch API, a JSON is returned. It's size is about 641Ko and it has 17K lines.
When I look at the API explorer, it tells me that request takes 4 seconds to execute :
I would like to decrease that time, because it's a really high one... I've already :
Increase GAE instance to F2
Enable Memcache
It helps a little but I don't think this is the best efficient way...
Should I use Big Query to generate the JSON file faster ? Or maybe there is another solution ?
Do you need all the entity in a single request ?
if Not, then you can batch fetch entities using Cursor Queries and display as per your need, say for eg: fetch 20 or 30 entities at a time depending on your need.
If Yes,
Does your meal entity changes often
If No, you can generate a json file and store it in GCS, and whenever your entity changes you can update the json file, so that on the client end fetching will be lot faster and using etag header, new content can be pulled easily
If Yes,
then i think batch fetching is only effective way to pull those many entities

How to implement an efficient map tile engine on Google App Engine?

I am trying to implement a map tile engine on Google App Engine.
The map data is stored in the database, the datastore (big table). The problem is that 20 requests might come in at approximately the same time to draw 20 tiles based upon the same set of rows in the database.
So 20 requests come in, if I write the code to read from the database for each request, then I will be doing 20 read's, which are the same, from the database, one read for each tile image output. Since each read is the same query, it doesn't make sense to do the same query 20 times. In fact, this is very inefficient.
Can anyone suggest a better way to do this?
If I use the memcache, I need to put the data into memcache, but there are 20 requests coming in at the same time for the data, then if I do a nieve implementation then 20 processes will be writing to memcache, since they are all going at the same time in parallel.
I am programming in Google Go version 1 beta on Google App Engine, I refer to the Python doc's here since they are more complete.
References:
Google datastore http://code.google.com/appengine/docs/python/datastore/overview.html
Leaflet JS I am using for showing map tiles http://leaflet.cloudmade.com/
To clarify.
I generate the tile images from data in the database, that is, I query the database for the data (this is not the tile image), then I draw the data into a image and render the image as a JPEG. As GAE is efficient for drawing images on the server side http://blog.golang.org/2011/12/from-zero-to-go-launching-on-google.html
I don't know about how Google App Engine does it, but MySQL has a query cache so that if the same query gets asked twice in a row, then it uses the results from the first to answer the second. Google is smart about things, so hopefully they do that as well. (You might be able to figure out if they are by timing it.)
One thing you might need to make sure of is that the queries are exactly the same, not just returning the same results. For example, you don't want query1 to be
SELECT lat, lng FROM mytable WHERE tileX=1 AND tileY=1
and query2 to be
SELECT lat, lng FROM mytable WHERE tileX=1 AND tileY=2
I make tiles with gazillions of polygons, and when I did timing and optimization, I found to my surprise that it was faster to return ALL values and weed out the ones I didn't want in PHP than it was to stick in a WHERE clause to the SQL. I think that partly it was because the WHERE clause was different for every tile so the MySQL server couldn't cache effectively.
Organize tile entities so that you can find them via key instead of querying for them, i.e. using get() instead of query(). If you identify a tile based on several criteria, then create a natural ID by combining the criteria. E.g. if you would find a tile based on vertical and horizontal position inside an image then you'd do: naturalID = imageID + verticalID + horizontalID (you can also add separators for better viewing).
Once you have your own unique IDs you can use it to save the tile in Memcache.
If your tiles are immutable (= once created, their content does not change), than you can also cache them inside instance in a global map.
Edit: removed Objectify reference as I just realized you use python.
Edit2: added point 3.
A few things comes to mind:
How do you query for the tiles? you should be able to get the tiles using Key.get() which is way more efficient than query
Try to reduce the number of requests, using levels should reduce the number to around 4 request to retrieve the map.

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status

Resources