Preventing duplicates with MapReduce to BigQuery pipeline

Preventing duplicates with MapReduce to BigQuery pipeline - google-app-engine

I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas

Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).

Related

can (should) I use Flink like an in-memory database?

I've used batch Beam but am new to the streaming interface. I'm wondering about the appropriateness of using Apache Flink / Beam kind of like an in-memory database -- I'd like to constantly recompute and materialize one specific view of my data based on edge triggered updates.
More details: I have a few tables in a (normal) database, ranging from thousands to millions of rows, and each one has a many-to-many (M2M) relationship with other ones. Picture to explain:
Hosts <-M2M #1-> Table 1 <-M2M #2-> Table 2 <-M2M #3-> Table 3
Table 1 is a set of objects that the hosts need to know about, and each host needs to know about all downstream rows referenced directly or indirectly by the objects in Table 1 that it's related to. When changes happen anywhere other than the first many-to-many relationship M2M #1, it's not obvious which hosts need to be updated without traversing "left" to find the hosts and then traversing "right" to get all the necessary configuration. The objects and relationships at most levels change frequently, and I need sub-second latency to go from "a record or relationship changed" to recalculating any flattened config files with changes in them so that I can push updates to the hosts very quickly.
Is this an appropriate use case for streaming Flink / Beam? I have worked with Beam in a different system but only in batch mode, and I think that it would be a great tool to use here if I could edge-trigger it. The part I'm getting stuck on is, in batch mode, the PCollections are all "complete" in the sense that I can always join all records in one table with all records in another table. But with streaming, once I process a record once, it gets removed from its PCollection and can't be joined against future updates that arrive later on and relate to it, right? IIUC, it's only available within a window, but I effectively want an infinitely long window where only outdated versions of items in a PCollection (e.g. versions of them which have been overwritten by a new version that came in over the stream) would be freed up.
(Also, to bootstrap this system, I would need to scan the whole database to prefill all the state before I could start reading from a stream of edge-triggered updates. Is that a common pattern?)

I don't know enough about Beam to answer that part of the question, but you can certainly use Flink in the way you've described. The simplest way to accomplish this with Flink is with a streaming join, using the SQL/Table API. The runtime will materialize both tables into managed Flink state, and produce new and/or updated results as new and updated records are processed from the input tables. This is a commonly used pattern.
As for initially bootstrapping the state, before continuing to ingest the updates, I suggest using a CDC-based approach. You might start by looking at https://github.com/ververica/flink-cdc-connectors.

Is google Datastore recommended for storing logs?

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!

Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.

I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

AppEngine & BigQuery - Where would you put stat/monitoring data?

I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.

Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.

Handling large number of ids in Solr

I need to perform an online search in Solr i.e user need to find list of user which are online with particular criteria.
How I am handling this: we store the ids of user in a table and I send all online user id in Solr request like
&fq=-id:(id1 id2 id3 ............id5000)
The problem with this approach is that when ids become large, Solr is taking too much time to resolved and we need to transfer large request over the network.
One solution can be use of join in Solr but online data change regularly and I can't index data every time (say 5-10 min, it should be at-least an hour).
Other solution I think of firing this query internally from Solr based on certain parameter in URL. I don't have much idea about Solr internals so don't know how to proceed.

With Solr4's soft commits, committing has become cheap enough that it might be feasible to actually store the "online" flag directly in the user record, and just have &fq=online:true on your query. That reduces the overhead involved in sending 5000 id's over the wire and parsing them, and lets Solr optimize the query a bit. Whenever someone logs in or out, set their status and set the commitWithin on the update. It's worth a shot, anyway.

We worked around this issue by implementing Sharding of the data.
Basically, without going heavily into code detail:
Write your own indexing code
use consistent hashing to decide which ID goes to which Solr server
index each user data to the relevant shard (it can be a several machines)
make sure you have redundancy
Query Solr shards
Do sharded queries in Solr using the shards parameter
Start an EmbeddedSolr and use it to do a sharded query
Solr will query all the shards and merge the results, it also provides timeouts if you need to limit the query time for each shard
Even with all of what I said above, I do not believe Solr is a good fit for this. Solr is not really well suited for searches on indexes that are constantly changing and also if you mainly search by IDs than a search engine is not needed.
For our project we basically implement all the index building, load balancing and query engine ourselves and use Solr mostly as storage. But we have started using Solr when sharding was flaky and not performant, I am not sure what the state of it is today.
Last note, if I was building this system today from scratch without all the work we did over the past 4 years I would advise using a cache to store all the users that are currently online (say memcached or redis) and at request time I would simply iterate over all of them and filter out according to the criteria. The filtering by criteria can be cached independently and updated incrementally, also iterating over 5000 records is not necessarily very time consuming if the matching logic is very simple.

Any robust solution will include bringing your data close to SOLR (batch) and using it internally. NOT running a very large request during search which is low latency thing.
You should develop your own filter; The filter will cache the online users data once in a while (say, every minute). If the data changes VERY frequently, consider implementing PostFilter.
You can find a good example of filter implementation here:
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

one solution can be use of join in solr but online data change
regularly and i cant index data everytime(say 5-10 min, it should be
at-least an hr)
I think you could very well use Solr joins, but after a little bit of improvisation.
The Solution, I propose is as follows:
You can have 2 Indexes (Solr Cores)
1. Primary Index (The one you have now)
2. Secondary Index with only two fields , "ID" and "IS_ONLINE"
You could now update the Secondary Index frequently (in the order of seconds) and keep it in sync with the table you have, for storing online users.
NOTE: This secondary Index even if updated frequently, would not degrade any performance provided we do the necessary tweaks like usage of appropriate queries during delta-import, etc.
You could now perform a Solr join on the ID field on these two Indexes to achieve what you want. Here is the link on how to perform Solr Joins between Indexes/ Solr Cores.

writing then reading entity does not fetch entity from datastore

I am having the following problem. I am now using the low-level
google datastore API rather than JDO, that way I should be in a
better position to see exactly what is happening in my code. I am
writing an entity to the datastore and shortly thereafter reading it
from the datastore using Jetty and eclipse. Sometimes the written
entity is not being read. This would be a real problem if it were to
happen in production code. I am using the 2.0 RC2 API.
I have tried this several times, sometimes the entity is retrieved
from the datastore and sometimes it is not. I am doing a simple
query on the datastore just after committing a write transaction.
(If I run the code through the debugger things run slow enough
that the entity has a chance of being read back on the second pass).
Any help with this issue would be greatly appreciated,
Regards,

The development server has the same consistency guarantees as the High Replication datastore on the live server. A "global" query uses an index that is only guaranteed to be eventually consistent with writes. To perform a query with strongly consistent guarantees, the query must be limited to an entity group, using an "ancestor" key.
A typical technique is to group data specific to a single user in a group, so the user can see changes to queries limited to the user's group with strong consistency guarantees. Another technique is to use fancier client logic to update the client's local view as soon as the change is submitted, so the user sees the change in the UI immediately while the update to the global index is in progress.
See the docs on queries and transactions.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight