I have been collecting data in BigQuery for analysis purposes. However, the size of the data is growing and I only need 2 weeks of recent data. I wanted to erase data that is not used. I did some research and I found out that there is an expiration option for partitioned data.
Current setup:
My table is a partitioned table
I use a Lambda Function with a code similar to this in order to put data into the table (I have tried adding timePartitioning option, but it didn't work so that's why I am asking on stackoverflow if anyone knows)
wait bq
.dataset("dataset name")
.table('tablename' + '$' + partitionTime)
.load( filename, {
sourceFormat: 'CSV',
schema,
skipLeadingRows: 1,
timePartitioning: {
expirationMs: "300000"
}
});
Where partitionTime is in the format YYYYMMDD (this places the data inserted into that partition)
Thank you for all your comments and taking time to read my trouble :)
Have a nice day.
As you can see here, the function load accepts three parameters:
source (needed)
metadata (optional)
callback (optional)
The options that you need can be set in the metadata parameter. In the link provided above you can also notice that the BigQuery SDK uses API calls to perform the given operations.
In this link and in the printscreen bellow, you can see how to build a correct API call for BigQuery
In the field timePartitioning you can add DAY as your tipe of partitioning and your expiration time in miliseconds.
In the end, your code would have a small change:
wait bq
.dataset("dataset name")
.table('tablename')
.load( filename, {
sourceFormat: 'CSV',
schema,
skipLeadingRows: 1,
timePartitioning: {
type: "DAY",
expirationMs: "300000"
}
});
I hope it helps
Related
I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!
You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).
I need to export all data from Silverpop automated message with silverpop API
apparently there are no many information on the net apart from the official guide "XML API Developer’s Guide ENGAGE"
I need to know how to:
retrieve a list of Automated message
extract / download data of selected report (for all days-not single one)
finally (again not documented in the official guide); how to programmatically export final report having set MOVE_TO_FTP=true
(the guide quotes Use the MOVE_TO_FTP parameter to retrieve the output file programmatically)
thank you very much in advance for any help in this
You can use the RawRecipientDataExport XML API export to get the following:
• One or more mailings
• One or more Mailing/Report ID combinations (for Autoresponders)
• A specific Database (optional: include related queries)
• A specific Group of Automated Messages
• An Event Date Range
• A Mailing Date Range
For automated messages, you can use the following XML tags <CAMPAIGN_ACTIVE/>, <CAMPAIGN_COMPLETED/>, and <CAMPAIGN_CANCELLED/> to retrieve active Groups of Automated Messages, retrieve completed Groups of Automated Messages, and retrieve canceled Groups of Automated Messages, respectively.
To get data for all days and not just one, you can set a date range for send dates and event dates, by putting your desired date ranges within the <SEND_DATE_START> and <SEND_DATE_END> tags and <EVENT_DATE_START> and <EVENT_DATE_END> tags. The date formats are like this: 12/02/2011 23:59:00
Hope this helps.
I search for a query which is pretty similar to this one. But as an extension, I do not want to count all objects, but just over the ones, that are fairly recent.
In my case, there are two models. Let one be the Source and one be the Data. As result I'd like to get a list of all Sources ordered by the number of data records, that has been collected during the last week.
For me it is not iteresting, how many data records have been collected in total, but if there is a recent activity of that source.
Using the following code snippet from the above link, I cannot make up how to subquery the Data Table before.
from django.db.models import Count
activity_per_source = Source.objects.annotate(count_data_records=Count('Data')) \
.order_by('-count_data_records')
The only ways I came up with, would be to write native SQL or to process this in a loop and individual queries. Is there a Django-Query version?
(I use a MySQL database and Django 1.5.4)
Checkout out the docs on the order of annotate and filter: https://docs.djangoproject.com/en/1.5/topics/db/aggregation/#order-of-annotate-and-filter-clauses
Try something along the lines of:
activity_per_source = Source.objects.\
filter(data__date__gte=one_week_ago).\
annotate(count_data_records=Count('Data')).\
order_by('-count_data_records').distinct()
There is a way of doing that mixing Django queries with SQL via extra:
start_date = datetime.date.today() - 7
activity_per_source = (
Source.objects
.extra(where=["(select max(date) from app_data where source_id=app_source.id) >= '%s'"
% start_date.strftime('%Y-%m-%d')])
.annotate(count_data_records=Count('Data'))
.order_by('-count_data_records'))
The where part will filter the Sources by its Data last date.
Note: replace table and field names with actual ones.
When I first started developing this project, there was no requirement for generating large files, however it is now a deliverable.
Long story short, GAE just doesn't play nice with any large scale data manipulation or content generation. The lack of file storage aside, even something as simple as generating a pdf with ReportLab with 1500 records seems to hit a DeadlineExceededError. This is just a simple pdf comprised of a table.
I am using the following code:
self.response.headers['Content-Type'] = 'application/pdf'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.pdf'
doc = SimpleDocTemplate(self.response.out, pagesize=landscape(letter))
elements = []
dataset = Voter.all().order('addr_str')
data = [['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE']]
i = 0
r = 1
s = 100
while ( i < 1500 ):
voters = dataset.fetch(s, offset=i)
for voter in voters:
data.append([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname ])
r = r + 1
i = i + s
t=Table(data, '', r*[0.4*inch], repeatRows=1 )
t.setStyle(TableStyle([('ALIGN',(0,0),(-1,-1),'CENTER'),
('INNERGRID', (0,0), (-1,-1), 0.15, colors.black),
('BOX', (0,0), (-1,-1), .15, colors.black),
('FONTSIZE', (0,0), (-1,-1), 8)
]))
elements.append(t)
doc.build(elements)
Nothing particularly fancy, but it chokes. Is there a better way to do this? If I could write to some kind of file system and generate the file in bits, and then rejoin them that might work, but I think the system precludes this.
I need to do the same thing for a CSV file, however the limit is obviously a bit higher since it's just raw output.
self.response.headers['Content-Type'] = 'application/csv'
self.response.headers['Content-Disposition'] = 'attachment; filename=output.csv'
dataset = Voter.all().order('addr_str')
writer = csv.writer(self.response.out,dialect='excel')
writer.writerow(['#', 'STREET', 'UNIT', 'PROFILE', 'PHONE', 'NAME', 'REPLY', 'YS', 'VOL', 'NOTES', 'MAIN ISSUE'])
i = 0
s = 100
while ( i < 2000 ):
last_cursor = memcache.get('db_cursor')
if last_cursor:
dataset.with_cursor(last_cursor)
voters = dataset.fetch(s)
for voter in voters:
writer.writerow([voter.addr_num, voter.addr_str, voter.addr_unit_num, '', voter.phone, voter.firstname+' '+voter.middlename+' '+voter.lastname])
memcache.set('db_cursor', dataset.cursor())
i = i + s
memcache.delete('db_cursor')
Any suggestions would be very much appreciated.
Edit:
Above I had documented three possible solutions based on my research, plus suggestions etc
They aren't necessarily mutually exclusive, and could be a slight variation or combination of any of the three, however the gist of the solutions are there. Let me know which one you think makes the most sense, and might perform the best.
Solution A: Using mapreduce (or tasks), serialize each record, and create a memcache entry for each individual record keyed with the keyname. Then process these items individually into the pdf/xls file. (use get_multi and set_multi)
Solution B: Using tasks, serialize groups of records, and load them into the db as a blob. Then trigger a task once all records are processed that will load each blob, deserialize them and then load the data into the final file.
Solution C: Using mapreduce, retrieve the keynames and store them as a list, or serialized blob. Then load the records by key, which would be faster than the current loading method. If I were to do this, which would be better, storing them as a list (and what would the limitations be...I presume a list of 100,000 would be beyond the capabilities of the datastore) or as a serialized blob (or small chunks which I then concatenate or process)
Thanks in advance for any advice.
Here is one quick thought, assuming it is crapping out fetching from the datastore. You could use tasks and cursors to fetch the data in smaller chunks, then do the generation at the end.
Start a task which does the initial query and fetches 300 (arbitrary number) records, then enqueues a named(!important) task that you pass the cursor to. That one in turn queries [your arbitrary number] records, and then passes the cursor to a new named task as well. Continue that until you have enough records.
Within each task process the entities, then store the serialized result in a text or blob property on a 'processing' model. I would make the model's key_name the same as the task that created it. Keep in mind the serialized data will need to be under the API call size limit.
To serialize your table pretty fast you could use:
serialized_data = "\x1e".join("\x1f".join(voter) for voter in data)
Have the last task (when you get enough records) kick of the PDf or CSV generation. If you use key_names for you models you, should be able to grab all of the entities with encoded data by key. Fetches by key are pretty fast, you'll know the model's keys since you know the last task name. Again, you'll want to be mindful size of your fetches from the datastore!
To deserialize:
list(voter.split('\x1f') for voter in serialized_data.split('\x1e'))
Now run your PDF / CSV generation on the data. If splitting up the datastore fetches alone does not help you'll have to look into doing more of the processing in each task.
Don't forget in the 'build' task you'll want to raise an exception if any of the interim models are not yet present. Your final task will automatically retry.
Some time ago I faced the same problem with GAE. After many attempts I just moved to another web hosting since I could do it. Nevertheless, before moving I had 2 ideas how to resolve it. I haven't implemented them, but you may try to.
First idea is to use SOA/RESTful service on another server, if it is possible. You can even create another application on GAE in Java, do all the work there (I guess with Java's PDFBox it will take much less time to generate PDF), and return result to Python. But this option needs you to know Java and also to divide your app to several parts with terrible modularity.
So, there's another approach: you can create a "ping-pong" game with a user's browser. The idea is that if you cannot make everything in a single request, force browser to send you several. During first request make only a part of work which fits 30 seconds limit, then save the state and generate 'ticket' - unique identifier of a 'job'. Finally, send the user response which is simple page with redirect back to your app, parametrized by a job ticket. When you get it. just restore state and proceed with the next part of job.
I would like to create a logger using CouchDB. Basically, everytime someone accesses the file, I would like like to write to the database the username and time the file has been accessed. If this was MySQL, I would just add a row for every access correspond to the user. I am not sure what to do in CouchDB. Would I need to store each access in array? Then what do I do during update, is there a way to append to the document? Would each user have his own document?
I couldn't find any documentation on how to append to an existing document or array without retrieving and updating the entire document. So for every event you log, you'll have to retrieve the entire document, update it and save it to the database. So you'll want to keep the documents small for two reasons:
Log files/documents tend to grow big. You don't want to send large documents across the wire for each new log entry you add.
Log files/documents tend to get updated a lot. If all log entries are stored in a single document and you're trying to write a lot of concurrent log entries, you're likely to run into mismatching document revisions on updates.
Your suggestion of user-based documents sounds like a good solution, as it will keep the documents small. Also, a single user is unlikely to generate concurrent log entries, minimizing any race conditions.
Another option would be to store a new document for each log entry. Then you'll never have to update an existing document, eliminating any race conditions and the need to send large documents between your application and the database.
Niels' answer is going down the right path with transactions. As he said, you will want to create a different document for each access - think of them as actions. Here's what one of those documents might look like
{
"_id": "32 char hash",
"_rev": "32 char hash",
"when": Unix time stamp,
"by": "some unique identifier
}
If you were tracking multiple files, then you'd want to add a "file" field and include a unique identifier.
Now the power of Map/Reduce begins to really shine, as it's extremely good at aggregating multiple pieces of data. Here's how to get the total number of views:
Map:
function(doc)
{
emit(doc.at, 1);
}
Reduce:
function(keys, values, rereduce)
{
return sum(values);
}
The reason I threw the time stamp (doc.at) into the key is that it allows us to get total views for a range of time. Ex., /dbName/_design/designDocName/_view/viewName?startkey=1000&endkey=2000&group=true gives us the total number of views between those two time stamps.
Cheers.
Although Sam's answer is an ok pattern to follow I wanted to point out that there is, indeed, a nice way to append to a Couch document. It just isn't very well documented yet.
By defining an update function in your design document and using that to append to an array inside a couch document you may be able to save considerable disk space. Plus, you end up with a 1:1 correlation between the file you're logging accesses on and the couch doc that represents that file. This is how I imagine a doc might look:
{
"_id": "some/file/path/name.txt",
"_rev": "32 char hash",
"accesses": [
{"at": 1282839291, "by": "ben"},
{"at": 1282839305, "by": "kate"},
{"at": 1282839367, "by": "ozone"}
]
}
One caveat: You will need to encode the "/" as %2F when you request it from CouchDB or you'll get an error. Using slashes in document ids is totally ok.
And here is a pair of map/reduce functions:
function(doc)
{
if (doc.accesses) {
for (i=0; i < doc.accesses.length; i++) {
event = doc.accesses[i];
emit([doc._id, event.by, event.at], 1);
}
}
}
function(keys, values, rereduce)
{
return sum(values);
}
And now we can see another benefit of storing all accesses for a given file in one JSON document: to get a list of all accesses on a document just make a get request for the corresponding document. In this case:
GET http://127.0.0.1:5984/dbname/some%2Ffile%2Fpath%2Fname.txt
If you wanted to count the number of times each file was accessed by each user you'll query the view like so:
GET http://127.0.0.1:5984/test/_design/touch/_view/log?group_level=2
Use group_level=1 if you just want to count total accesses per file.
Finally, here is the update function you can use to append onto that doc.accesses array:
function(doc, req) {
var whom = req.query.by;
var when = Math.round(new Date().getTime() / 1000);
if (!doc.accesses) doc.accesses = [];
var event = {"at": when, "by": whom}
doc.accesses.push(event);
var message = 'Logged ' + event.by + ' accessing ' + doc._id + ' at ' + event.at;
return [doc, message];
}
Now whenever you need to log an access to a file issue a request like the following (depending on how you name your design document and update function):
http://127.0.0.1:5984/my_database/_design/my_designdoc/_update/update_function_name/some%2Ffile%2Fpath%2Fname.txt?by=username
A comment to the last two anwers is that they refer to CouchBase not Apache CouchDb.
It is however possible to define updatehandlers in CouchDb but I have not used it.
http://wiki.apache.org/couchdb/Document_Update_Handlers