How to keep BigQuery schema in sync with code

How to keep BigQuery schema in sync with code - database

I am developing various services which talk directly to BigQuery, by streaming rows into the database. Right now I am updating the schema directly from the Google cloud UI, which has been causing issues as you can imagine due to forgetfulness!
I would like to understand how best to keep code & schemas aligned for what are still fast evolving services and schemas.
My current ideas are:
use something like Terraform, but I am unsure on how this works on live tables which need updating or migrating
add code to the service to check / set the schema, which would at least throw errors if not automate the process
Thanks in advance!
EDIT:
To give more clarity as request in the comments; we are using a cloud run microservices to stream rows into bigquery, the services are written in python/node. Their primary goal is to do some light transform on the data and store in BQ.
Not really sure what more to add, my ideal scenario is that we have something in the code which also defines or at least checks the schema, to keep the code and db in sync.

As any databases, you have to follow some rules or best practices to avoid some errors and conflicts. For example, you have to avoir to update manually the schema, you can choose to do it always with code (which is better because you can follow the schema change thanks to your git history).
Then you can have a side process that check the schema and update it with the newer changes. Or do this at startup (only if you aren't in serverless and the app start duration isn't a concern for you).
Terraform is perfect to deploy infrastructure, but much more limited to update/patch existing components. I don't recommend it.

Related

Best approach to interact with same data base table from more than one microservices

I have a situation, where I need to add/update/retrieve records from same database table from more than one microservices. I can think of below three approaches, please help me pick up the best suitable approach.
Having a dedicated Microservices say database-data-manager which will interact with data base and & add/update/retrieve data and all the other microservices will call the end points of database-data-manager to add/update/retrieve data when required.
Having a maven library called database-data-manager and all the other microservices will use this library for the db interactions.
Having the same code(copy paste) in all the applications to take care of db interactions.
Approach - 1 seems expensive as we need to host a dedicated application for a basic functionality.
Approach - 2 would reduce boiler plate code but difficult to manage library version.
Approach - 3 would cause lot of boiler plate code and maintenance efforts to keep similar code in all the microservices.
Please suggest, Thanks in advance.

A strict definition of "microservice" would include the fact it's essentially self-contained... that would include any data storage it might need. So you really have a collection of services talking to a common database. Schematics aside...
Option 1 sounds like it's on the right track: you need to have something sitting between the microservices and database. This could be a cache or a dedicated proxy service. Let's say you have an old legacy system which is really fragile, controlling data in/out through a more capable service, acting as a proxy, is a well proven pattern.
Such a proxy might do a bulk read of the database, hold the data in memory to service high-volumes of reads, and handle updates.
Updating is non-trivial and there are various options:
The services cached data becomes the pseudo master - updates are applied to the cached data first, then go into a queue to apply to the underlying database.
The services data is used only for data-reads; updates are applied to the database first, and if the update is successful it is then applied to the cached data.
Option one is great for performance, on the assumption that the proxy service is really good at managing the data and satisfying service requests. But, depending on how you implement, it might be vulnerable to outages, in which case you might lose any data that has made it into the cache but not into the pipeline that gets it into the database.
Option 2 is good for ensuring a solid master set of data, but there's the risk that consuming services might read cached data that is now out of date because it's just being updated in the database.
In terms of implementation, a queue of some sort to handle getting updates to the database might be something you want to consider, as it would give you a place to control how updates (and which updates) get to the database.

How to setup deployments in Azure so that they use different databases depending on the environment?

You can easily swap two deployments between staging and production environment in the Azure Management Portal by swapping their VIP. When working on a staging version of the services we want to use a staging database as well so we don't risk clobbering actual customer data. However, after swapping staging and production services the now-production (and formerly staging) deployment should obviously work on the production database.
So essentially the database to use would depend on whether the instance runs in the Staging or Production environment. Is there a good way of achieving that? Relying on the VIP and hard-coding the database switching based on that is probably not the best idea, I guess.

My recommendation would be to stop using the "staging slot" of a service for the function you used a traditional "staging environment" for. When I'm speaking to folks about Windows Azure, I strongly recommend they use the staging slots only to smoke test a new deployment before it goes live. If they want a more protracted sort of testing, the kind many of us are used to having on-premises, then use a separate service and possibly even a separate subscription (the later is great if you want cost transparency).
All this said, your only real options are to have a second service configuration that is specific for production that you update to before you execute the VIP swap, or you write some code that allows the service to detect which slot it's in and pull the appropriate of two configuration settings.
However, as I outlined in the first paragraph, I think there's a better way to do things. :)

In a recent release of Azure Websites, the story here has changed. You may now specify that any app setting or connection string is a "slot setting", pinning it to the particular slot. To solve your issue, you would simply set the connection string(s) in each slot and take care to check 'Slot Setting'.
I'm less clear if this is an advisable approach now. Database schema migration and rollback aren't baked in, and I'm unsure how to handle that correctly. Also only app settings and connection strings work this way, so, for example, system.net.mail settings cannot be pinned to a slot. For that, you'd need to change code to get mail server info, etc. from app settings or else use some other approach.

Re: "When working on a staging version of the services we want to use a staging database as well so we don't risk clobbering actual customer data." There is not a built-in way to do this.
If you wish to test without risk to production data, consider doing this testing in another Azure account - one that doesn't even have access to the production database. Then, when you think the system is tested and ready to go live, only then bring it up into the staging slot next to your production instance for a final smoke test.
I can imagine scenarios where you'd also want to a run through a few scenarios on the staging instance before doing a VIP Swap, but don't want to pollute production data. For this, many companies use special accounts - data associated with these accounts is known (or marked somehow) to be not from real customers so can be skipped in reporting and billing and such.
Re: "Relying on the VIP and hard-coding the database switching based on that is probably not the best idea, I guess." If by hard-coding, you mean reading it from a config file, that is probably not a bad idea, if you use an approach as mentioned above. I have heard of some folks going with a "figure out if we are in a staging slot and do something different in the code" approach, but I rather recommend what I described above.

Interacting with external DB via Django

I'm working on a Django app that interacts with an existing database (think ERP/transaction type data) to perform analysis. There will be minimal/no updating of the existing database mainly reading data in. Its just a simple small setup so no replication etc. issues to think about re. updating.
The analysis would result in new records created within the Django Model.
Currently the existing DB runs on PostgreSQL.
I am aware of Alex Gaynor's GSOC multidb code which, from what I gather is ticket #1142 which has no patch yet to trunk.
So from what I gather there are three options I can see:
1) Point Django db to the same db as the ERP and let it create the tables it needs within it (all the ERP tables have a prefix so there would be no collision) however this strikes me as hackey and a recipe for disaster.
2) Create a new db for Django and automatically copy over the required tables. Better but I cant update, thought I can probably live with this.
3) Try out the multidb patch.
Are there other better ideas out there? I'm leaning towards at least trying out the multidb patch but I'm a little worried about stability and forwards compatibility.

How about not using Django's ORM layer at all for that DB? It the interaction is minimal, you might do it faster by just using direct SQL with the appropriate postgresql-python library.

Using a common database for collaborative development

Some of the people in my project seem to think that using a common development database with everyone connecting to it is the best thing. I think that it isn't and each developer having his own database (with periodic updated data dumps) is the best. Am I right or wrong? Have you encountered any problems in any of these approaches?

Disk space and CPU should be cheap enough that every developer can run their own instance of the database, with an automated build under version control. This is needed to allow developers to be bold in hacking on the database, in isolation from any other developer's concurrent hacking.
The caveat being, of course, that any changes they make to their private instance are useless to anyone else unless it can be automatically applied during the build process. So there needs to be a firm policy that application code can't depend on any database state unless that state is represented by version-controlled, unit-tested changes to the DDL.
For an excellent guide on the theory and practice of treating the database definition as another part of the project code, and coordinating changes and refactorings, see Refactoring Databases: Evolutionary Database Design by Scott W. Ambler and Pramod Sadalage.

I like having my own copy of the database for development, because it gives you the flexibility to rapidly change things without worrying how it will impact others.
However, if all the developers are hacking away on their own copy of the database, it becomes more and more difficult to merge everyone's work together in the end.
I think you can get the best of both worlds by letting developers work on a local copy during day-to-day development, but each developer should probably merge their work into a common copy on a pretty regular basis. Writing a lot of unit tests helps too.

We share a single database amongst all our developer (20-odd) but we've got it structured so that everyone has their own tables.
You don't need a separate database per developer if you structure the application right. It should be configurable which database or table-prefix it uses anyway so you can easily move it between instances (unit test, system test, acceptance test, production, disaster recovery and so on).
The advantage to using a single database is that the cost of maintenance is amortized. You don't have your DBAs trying to handle a lot of databases (or, if you're a small-DB shop, you don't have every developer trying to maintain their own database when they're better utilized in developing).

Having a single point of Failure is not a good thing isn't it?

I prefer a single, shared database. But it's very dependent on the situation and the applications being developed.
What works for me may not work for you. Go with your gut.

If you are working with Hibernate or any hibernate-based platform you can configure your database to be created when you start your server (create-drop option). This is very useful when you are adding new attributes to your classes. If this is the case each developer must have his own copy of the DB.
If you are not changing the DB structure at all then you can use a single shared DB.
In this second case is not a must. I prefer to have my own DB where I can do whatever I want. On the other hand remember that some queries can take a lot of time and this will affect your whole team if you are sharing a DB.

Django database scalability

We have a new django powered project which have a potential heavy-traffic characteristic(means a heavy db interaction). So we need to consider the database scalability in advance. With some researches, the following questions are still not clear to us:
coarse-grained: how to specify one db table(a django model) to a specific db(maybe in another server)?
fine-grained: how to specify a group of table rows to a specific db(so-called sharding, also can in another db server)?
how to specify write and read to different db?(which will be helpful for future mysql master/slave replication)
We are finding the solution with:
be transparent to application program(means we don't need to have additional codes in views.py)
should be in ORM level(means only needs to specify in models.py)
compatible with the current(or future) django release(to keep a minimal change for future's upgrading of django)
I'm still doing the research. And will share in this thread later if I've got some fruits.
Hope anyone with the experience can answer. Thanks.

Don't forget about caching either. Using memcached to relieve your DB of load is key to building a high performance site.
As alex said, django-core doesn't support your specific requests for those features, though they are definitely on the todo list.
If you don't do this in the application layer, you're basically asking for performance trouble. There aren't any really good open source automation layers for this sort of task, since it tends to break SQL axioms. If you're really concerned about it, you should be coding the entire application for it, not simply hoping that your ORM will take care of it.

There is the GSoC project by Alex Gaynor that in future will allow to use multiple databases in one Django project. But now there is no cross-RDBMS working solution.
There is no solution right now too.
And again - there is no cross-RDBMS solution. But if you are using MySQL you can try excellent third-party Django application called - mysql_replicated. It allows to setup master-slave replication scenario easily.

here for some reason we r using django with sqlalchemy. maybe combination of django and sqlalchemy also works for your needs.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight