App Engine backup never finishes only clue is failure in map reduce worker_callback - google-app-engine

Over the last few weeks we have repeatedly failed on doing a complete backup of the data store using the datastore admin tool. We thought the issues had to do with quota errors we were running into so we switched our application from a free to a paid app and we still have problems.
Each time we are attempting to back up to the blobstore and what occurs is that the process never finishes. We see the backup in our Pending Backups list but it never actually completes. We only have a total of 43MB of data right now so we don't see it as a data transfer problem. Looking at our default Task Queues it shows that we have two pending tasks one is a call to /_ah/mapreduce/controller_callback and another is a call to /_ah/mapreduce/worker_callback
The worker_callback racks up its retry count and the only error clue we have is on the Previous Run tab it shows the last http response code to be 500. There is no error message, nothing shows up in our error logs, it just keeps trying over and over again.
We've been able to narrow the backup problems to a specific entity kind for a particular namespace but we can't figure out why that entity kind is failing whereas the others are not. The major difference is the entity kind has a large number of embedded entities, but if the app engine is able to read / put those entities we can't understand why it seems to be having problems backing it up. The particular namespace that the error occurs in has the largest data stored for that entity kind compared to the other namespaces we have setup.
We think if we can see what error is occurring in the worker_callback we may be able to figure out why the backup is failing, or what is wrong with our data that's preventing the backup. Is there something we need to setup / enable through settings / configuration files to give us more detailed information on the backup? Or is there some other avenue we should explore to figure out how to investigate/fix this problem?
I should mention we are using the Java SDK as well as Objectify V3 to work with the data store. We are also backing up data to the Blobstore.
Thank you.

Well with the app engine team's help we figured what the problem was and we worked around the issue. I want to give details in case anyone else runs into this problem.
From issue 8363 the app engine team indicated that from their logs they could see that the map reduce failed because of the large number of properties that our entity kind had. The specific entity kind that was causing the failure had a large number of variable properties that was generating errors when map reduce tried to write out a schema. They indicated that the solution on their end was to ignore entities that were like this in the backup to make it so the backup worked successfully.
What we did to work around the issue and make the backup work was change how we told objectify to store out data. The large number of properties were being created due to our use of the #embedded keyword on a HashMap() class member field. Since the embedded keyword breaks down classes into individual components it was generating a large number of properties. We switched the member field to be #serialized and then ran a conversion process to make it use the new serialized property. This made the backup / restore work again.
You can read more about the differences between embedded and serialized on objectify's website

snielson, would you mind opening an issue on our Public issue tracker here. Remember to add your Application ID so we can further debug this specific scenario.
Thanks!

Related

How can I dump the .parquet data that is in Azure DataLakeStorage to a Microsoft SQL Server database using Nifi?

I've been looking for information for a long time and I can't get it. I'm starting to think it can't be done if the .parquet are in Azure DataLake Storage.
I have a folder with subfolders in Azure DataLake Storage. In these subfolders there are many .parquet. I manage to get them out using ListAzureDataLakeStorage + FetchAzureDataLakeStorage combination. Then I try to pass them through a PutDatabaseRecord (which I think is the correct processor for the dump in the DB).
I think I have the PutDatabaseRecord well configured. But when executing it gives me an error: "Failed to process session due to Failed to process StandardFlowFileRecord due to java.lang.NullPointerException: Name is null".
I'm not sure I'm using the PutDatabaseRecord right. I thought that PutDatabaseRecord read the flowfiles that came to it interpreting their content as .parquet (it is supposed to use a ParquetReader as a RecordReader), being able to understand the data as records. But it surprises me that it is not necessary to indicate how to interpret the .parquet, nor how to map its columns with those of the DB table. It still doesn't work as I think and it needs the flowfile content to already arrive as records?
The truth is that I can't explain myself better either because I don't really understand what is considered a record in Nifi or how a record is related to a reading of a .parquet.
Either I am missing a processor or something I am configuring wrong. But the only thing I find is the FetchParquet, which seems to be able to read a .parquet and put it into the FlowFile as records. However, it can only be used with ListHDFS or ListFile, which do not allow me to fetch data from Azure Data Lake Storage
After several tests (using the ConvertRecord and QueryRecord processors), I have come to the conclusion that the problem is in the reading that the ParquetReader does of the content of the FlowFiles that arrive. Well, every processor that needs a ParquetReader gives the same error. Downloading the content of the FlowFile that enters the processor that the ParquetReader uses (whatever it is) and using a .parquet viewer I have verified that this content is fine.
Without knowing what to do, I have attached a screenshot of the specific error. I still don't know what "Name" the error refers to.
Error Name is null
Note: I also posted my problem on Cloudera, perhaps better explained. I leave the link in case someone wants to look at it. (https://community.cloudera.com/t5/Support-Questions/How-can-I-dump-the-parquet-data-that-is-in-Azure/td-p/316020)
In the end, the closest thing to the error I was getting was found here (https://issues.apache.org/jira/browse/NIFI-7817). It seems that it is an error related to the creation of the ParquetReader. This makes sense because it would hit any processor that used a ParquetReader. In addition, the FlowFiles did not even enter the processor that used it.
I was using Nifi version 1.12.1. I have downloaded version 1.13.2 and it no longer gives the Name error. In addition, it is seen that the Flow Files already enter the processor. On the download page of the new version (https://nifi.apache.org/download.html) you can access the Release Notes and the Migration Guidance to know what has been fixed with respect to previous versions and with which processors you have to be careful when migrating.
However, even though the data goes into the processor, it still gives me an error, but it is different and I will open it in another post.

Is my GAE Search corrupt?

I've got a single index in a GAE Search application.
When I call index.put I get the OverQuotaError: The API call search.IndexDocument() required more quota than is available.
When I got to the GAE Console and look under Search then my index articleIndex contains no documents but its Amount Used is 78.2KB. I've also tried retrieving documents but none are returned.
I've tried using a new index but I can the same error message in my application's logs.
I have a copy of my application and that continues to work fine - this uses the same code and data but in a separate application space.
I created a new app with the code from my "corrupted" installation and the new installation indexes fine.
Has anyone had a GAE Search index that, although empty, is listed as taking up space?
I've tried running my GAE routines right after my daily quota is renewed ().
The quota representing storage usage is reconciled nightly, so if you have recently removed documents from the index that fact will not immediately be reflected. However, you say that using a new index still produces the same problem?
Note that daily quota renewal (not to be confused with the reconciliation mentioned above) does not affect the storage limit.
If you are still having troubles, you can file a report on the external issue tracker, mention the app ID and index name, and we can help investigate the current status of the index.

NDB query().iter() of 1000<n<1500 entities is wigging out

I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?
This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.

Will Jira complain if I set the Resolution date to a date before the creation via direct DB write?

Some colleagues were using an Excel file to keep track of some issues, and they have decided to switch to a better system, asking me to setup a Jira project for them and to import all the tickets. A way or the other I have done everything, but the resolution date is now wrong, because it's the one of when I ran the script to import them into Jira. They would like to have the original one, so that they can know when an issue was really fixed. Unfortunately there's no way to change it from Jira's interface, so I have to access the DB directly. The command, for the record, is like:
update jiraissue
set RESOLUTIONDATE = "2015-02-16 14:48:40"
where pkey = "OV001-1";
Now, low-level writes to a database in general are dangerous, and I am wondering whether there can be any risks. Our test server is not available right now, so I'd have to work directly on the production one. One thing I had seen on our test server is that this seemed to work, except that JQL queries such as
resolved < 2015-03-20
are wrong because they still use the old Resolution date. Clearly, I have to reindex; but I'm wondering whether it is safe. Does Jira perform some consistency checks? Like, verifying that a ticket is solved after it is created. In my case, since I have modified the resolution date but not the creation, it is a clear inconsistency. Will Jira complain about this? Is there the risk to corrupt the DB? And if I also modify the creation date, do I have to watch out for other things?
We are using Jira 5.2.11.
I have access to the test server again, and I have tried it. I have modified all the RESOLUTIONDATE fields I had to fix, and when I reloaded the page the new date was there. Jira didn't complain about anything. I reindexed the server, so that queries yield correct results, and I saw no issues. Then I even ran the integrity checks (Administration -> System -> Integrity Checker), and no error was found.
Finally I did the same on the production server, and again everything is running fine.
I can therefore conclude that the operation is not dangerous at all, and it can be done safely.

Need ideas on retrieving data from a website

I'm stumped and need some ideas on how to do this or even whether it can be done at all.
I have a client who would like to build a website tailored to English-speaking travelers in a specific country (Thailand, in this case). The different modes of transportation (bus & train) have good web sites for providing their respective information. And both are very static in terms of the data they present (the schedules rarely change). Here's one of the sites I would need to get info from: train schedules The client wants to provide users the ability to search for a beginning and end location and determine, using the external website's information, how they can best get there, being provided a route with schedule times for the different modes of chosen transport.
Now, in my limited experience, I would think the way to do that would be to retrieve the original schedule info from the external site's server (via API or some other means) and retain the info in a database, which can be queried as needed. Our first thought was to contact the respective authorities to determine how/if this can be done, but this has proven to be problematic due to the language barrier, mainly.
My client suggested what is basically "screen scraping", but that sounds like it would be complicated at best, downloading the web page(s) and filtering through the HTML for relevant/necessary data to put into the database. My worry is that the info on these mainly static sites is so static, that the data isn't even kept in a database to build the page and the web page itself is updated (hard-coded) when something changes.
I could really use some help and suggestions here. Thanks!
Screen scraping is always problematic IMO as you are at the mercy of the person who wrote the page. If the content is static, then I think it would be easier to copy the data manually to your database. If you wanted to keep up to date with changes, you could then snapshot the page when you transcribe the info and run a job to periodically check whether the page has changed from the snapshot. When it does, it sends an email for you to update it.
The above method could also be used in conjunction with some sort of screen scaper which could fall back to a manual process if the page changes too drastically.
Ultimately, it is a case of how much effort (cost) is your client willing to bear for accuracy
I have done this for the following site: http://www.buscatchers.com/ so it's definitely more than doable! A key feature of a web scraping solution for travel sites is that it must send you emails if anything went wrong during the scraping process. On the site, I use a two day window so that I have two days to fix the code if the design changes. Only once or twice have I had to change my code, and it's very easy to do.
As for some examples. There is some simplified source code here: http://www.buscatchers.com/about/guide. The full source code for the project is here: https://github.com/nicodjimenez/bus_catchers. This should give you some ideas on how to get started.
I can tell that the data is dynamic, it's to well structured. It's not hard for someone who is familiar with xpath to scrape this site.

Resources