cloudant dashdb sync issue - cloudant

We have created a warehouse with source database in cloudant,
We had ran schema discovery process on near about 40,000 records initially.Our cloudant database consist of around 2 millions records.
Now the issue we are facing is that we have got many records in _OVERFLOW Table in DashDB (means that they have got rejected ) with error like "[column does not exist in the discovered schema. Document has not been imported.]"
Seems to me the issue is that cloudant database which is actually result of dbcopy ,contains partials in the docs and as those partials are created internally by cloudant with value which we can judge only after the partials gets created like "40000000-5fffffff" in the dd doesn't get discovered by schema discovery process and now all docs which have undiscovered partials are being rejected by cloudant-dashdb sync.
Does anyone has any idea how to resolve it..

The best option for you to resolve this is with a simple trick: feed the schema discovery algorithm exactly one document with the structure you want to create in your dashDB target.
If you can build such a "template" document ahead of time, have the algorithm discover that one and load it into dashDB. With the continuous replication from Cloudant to dashDB you can then have dbcopy load your actual documents into the database that serves as source for your cloudant-dashdb sync.

We had ran schema discovery process on near about 40,000 records initially.
Our database consist of around 2 millions records
Do all these 2 millions share the same schema? I believe not.
"[column does not exist in the discovered schema. Document has not been imported.]"
It means that during your initial 40'000 records scan application didn't find any document with that field.
Let's say sequence of documents in your Cloudant db is:
500'000 docs that match schema A
800'000 docs that match schema B
700'000 docs that match schema C
And your discovery process checked just first 40'000. It never got to schema B and C.
I would recommend to re-run discovery process and process all 2 millions records. It will take time, but will guarantee that all fields are discovered.

Related

BigQuery to Snowflake using external tables

I've created a pipeline to export data from BigQuery into Snowflake using Google Cloud services. I'm seeing some issues as the original data source is Firebase. The way Firebase exports analytics data is in a dataset called analytics_{GA_project_id} and the events are sent to a partitioned table called events_{YYYYMMDD}. The pipeline runs smoothly but after a few days I've noticed when running the gap analysis, there is more data in BigQuery (approx 3% at times) than in Snowflake. I've raised a ticket with Firebase and Google Analytics for Firebase and they confirmed there are events that are delayed up to 3 days. I was looking at what other solutions I'd potentially explore and Snowflake can also connect to an External Table hosted in Google Cloud (GCS).
Is there a way to create a replicated table that automatically syncs with BigQuery to push the data in an external table Snowflake can connect with?
The process I've set up relies on a scheduled query to unload any new daily table in GCS which is cost-efficient...I could change the query to have an additional check for new data but every time I'd have to scan the entire 2 months worth of data (or assume only 3 days delay which is not good practice) so this will add a lot more consumption.
I hope there is a more elegant solution.
I think you found a limitation of BigQuery's INFORMATION_SCHEMA.TABLES: It won't tell you when a table was updated last.
For example
SELECT *
FROM `bigquery-public-data`.stackoverflow.INFORMATION_SCHEMA.TABLES
shows that all these tables where created in 2016 — but there's no way to see that they were updated 3 months ago.
I'm looking at the "Create the BigQuery Scheduled Script" you shared at https://github.com/budaesandrei/analytics-pipelines-snowflake/blob/main/firebase-bigquery-gcs-snowflake/Analytics%20Pipeline%20DevOps%20-%20Firebase%20to%20Snowflake.ipynb.
My recommendation is instead of relying on a BigQuery scheduled function, to start doing this outside BigQuery to have access to the whole API. Exports will be cheaper this way too.
For example, the command line bq show bigquery-public-data:stackoverflow.posts_answers will show you the Last modified date for that table.
Let's skip the command line, and let's look at the API directly.
tables.list can help you find all tables created after certain date: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list
then iterating with table.get will get you the last modified for each: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table
then a job.insert can extract that table to GCS for Snowpipe to automatically import: https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationExtract
make sure to give the exports of the same table a new name, so Snowpipe can identify as new tables
then you can use METADATA$FILENAME to deduplicate the imports of updated versions of the same table in Snowflake: https://docs.snowflake.com/en/user-guide/querying-metadata.html
All of these are available in the Python BigQuery SDK. You'll just need to rely on a different scheduler and runtime, but the best news is you'll save a lot in BigQuery exports, as they are mostly free when you use this way instead of inside BigQuery.
https://cloud.google.com/bigquery/pricing#data_extraction_pricing
I think the issue is just the nature of the GA load to Bigquery process. Google say that same day data is loaded at least 3 times a day (no further detail given) and if there is an error, it is rolled back.
Looking at our own data, there is usually 2 intraday tables on any given day so the last fully loaded date is 2 days ago. If the final intraday load (which changes it from intraday to final table) has an error then there is a possibility that the last full day is from 3 days ago
I'm not familiar with firebase myself but one way to avoid this in BQ directly is to only query the final tables (those in the format ga_sessions_YYYYMMDD)
I've managed to find a solution by inserting an additional column using a Scheduled Script called gcs_export_timestamp. On every run it checks if there is a new daily table and exports it and then loops through the already exported tables where gcs_export_timestamp is null (meaning new rows). I've described the process here.

efficient way to get deleted documents

I'm searching for an efficient way to get the list of documents deleted in a Cloudant database.
Background: I have a Cloudant database containing 4 million records. The business logic allows also documents to be deleted. Data from this database is loaded daily into a SQL data warehouse and needs to be also marked as deleted.
A full reload is no option since it takes too long. Also querying the _changes stream seems not to scale well if the Cloudant database contains so many documents.
I would use the _changes feed and apply a server-side filter function (http://guide.couchdb.org/draft/notifications.html) to eliminate all documents that don't have the _deletedproperty set. Your change feed listener would therefore only be notified whenever a DELETE operation is reported and network traffic kept to a minimum.

How to find delta between two SOLR collections

We are using Lucid works Solr version 4.6.
Our source system basically stores data into two destination systems (one through real time and another thorough the batch mode). Data is ingested into Solr through the real time route.
We need to periodically synch the data ingested in Solr with the data ingested into the batch system.
The design we are currently trying to evaluate is to import the data from batch system into another Solr collection, but really not sure how to sync both collections (i.e the one with realtime data and second is through batch import).
I read through data import handlers but this will override the existing data in Solr. Is there any way in which we can identify the delta between the two collections and ingest that only.
There is no good way; there are a couple of things you can do:
When data is coming into the real time system there is a an import timestamp. Then do a range query to pull in the new stuff. I think new versions of Solr already have a field for this.
Log IDs of documents going into the first Solr and then index these.
Separate queue for the other collection

Is google Datastore recommended for storing logs?

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

Indexing SQLServer data with SOLR

What is the best way of syncing the database change with solr incremental indexing? What is the best way of getting MSSQL server data to be indexed by solr?
Thank so much in addvance
Solr works with plugins. you will need to create your own data importer plugin that will be called in a periodically manner (based on notifications, time period that passed, etc). You will point your solr configuration to the class that will be called upon update.
Regarding your second Q, I used a text file, that holds a time date description. Each time Solr was started it looked at said file and retrieved from the DB the relevant data that was changed in the DB from that point on (the file is updated when the index is updated).
I would suggest reading a good solr/lucene book/guide such as lucidworks-solr-refguide-1.4 before getting started, so you will be sure that your architectural solution is correct

Resources