Trying to dive into a task related to dimensional models and data analysis but stuck at fundamental approach.
Context: we have ~100GB of data generated daily, currently stored on a aw3 server that was created by another team for their own analysis purposes. Our team needs to do a different type of analysis on the data, but aren't allowed to modify the existing schema. Our plan is to clone the data to our own server, and run a daily pull to update. I'm trying to figure out the proper way to approach this.
Initial intuition is to use boto3 to download, then process it with pyspark, directly parsing it into a clone of the dimensional model, and then tweaking it as needed for our analysis purposes.
One team member brought up the point that pyspark (or more specifically spark) is optimized for distributed databases- clusters of hardware to share the load. Thus how much efficiency would we even retain by running pyspark on a single server? Other people on the team are used to simply writing java scripts to manually parse through files as text, but that seems extremely inefficient.
So my question is, what tool should we use? pyspark and just accept that it's fine even without a cluster? some other tool that is optimized for running a dimensional schema on a single server? or something else?
Just hoping to be pointed in the proper direction, thanks!
Related
I am working on migrating my old indexing tool to solr(version 7). But I am not so sure, how do I index my files to solr.
Data in our system are located at oracle DB, mysql and cassendra. But update in these DBs are not so frequent(2-3 times in 24 hrs) and these will be the source for my solr index files.
In one of the collections I will have around 300k-400k records and in another somewhere around 5k.
I could come up with 2 methods.
Create ETL pipeline from diff data source using apache Storm.
Use Kafka connect source and sink.
which among 2 is good for system like ours? or is both method overkill for system like ours?
The size of the data is small enough that just do whatever you're comfortable with - either use an existing tool or write a small indexer in a language you have experience with. There is no need to overthink this at that stage.
And outside of that - it's usually impossible to make a recommendation without have in depth knowledge of your situation, except for very specific questions.
Does anyone know how to confirm if a branch was merged via SQL query? In the long run, I want to create an on demand SSRS report so this can be reviewed after a series of releases have been deployed. I know that there are specific Command bit values taken from tbl_Version (I did this to identify a renamed branch) but I haven't been able to identify the bit values that identify a branch if it was merged.
Any ideas?
These tables are not supported for reporting. Any solution you build should use the TFS Object Model to load the data into the warehouse. Or use the client object model directly to retrieve the data.
What you're trying to do is really hard since the data isn't stored in the warehouse. You can query merge data by calling the VersionControlServer.QueryMerges method or by going through each Changeset individually after calling VersionControlServer.QueryHistory.
Building a DatawarehouseAdapter is harder still, since there is little documentation available and since you should have a deep knowledge of both the TFS Object Model, the Warehouse structure and Analysis Server in general. There is a ranger project underway to provide additional guidance on this subject, but until that's done you will find mostly just a scattered few blogposts and a very bad example.
You might be able to find good pointers towards building something outside of the scope of Reporting Server in open source projects, such as the TfsChangeLog project or the Community TFS Build Manager.
I've seen some code read large data from mat files instead of doing queries on a database. What are the benefits of doing this as oppose to using a database? Is it possible to easily move the mat file contents into a database and vice versa?
Reading data from mat file, is also a "database" in which you read your data from file.
Eventually, you will have to implement queries by yourself, and take care of many other issues.
Also, it is not a scalable solution, which means that for a large amount of data, it won't work well.
Of course, if you have small amount of data, and only basic queries, the fuss of setting up a database, using SQL isn't worth it.
Regarding your second question, it really depends on the data you have there.
I agree with Andrey. It depends on the data and what you want to do with it. I created a small program in Matlab that queries a relatively small .mat database but as the database and users grew performance has been going down.
In the light of this we decided to use a MySQL database. I created a small java application that talks to the database and imported that into Matlab to move data between Matlab and MySQL. But I had to create specific queries for my data. If someone can bring me a better solution I would be grateful.
Perhaps it wouldn't be such a bad idea to generate a general script that moves data between .mat data between Matlab and a SQL database. Store the data in a structure and use that to create the tables.
If you want to discuss something like this further via email I would be happy to. Maybe we can learn a thing or two from each other.
We are overhauling our product by completely moving from Microsoft and .NET family to open source (well one of the reasons is cost cutting and exponential increase in data).
We plan to move our data model completely from SQL Server (relational data) to Hadoop (the famous key-Value pair ecosystem).
In the beginning, we want to support both versions (say 1.0 and new v2.0). In order to maintain the data consistency, we plan to sync the data between both systems, which is a fairly challenging task and error prone, but we don't have any other option.
A bit confused where to start from, I am looking up to the community of experts.
Any strategy/existing literature or any other kind of guidance in this direction would be greatly helpful.
I am not entirely sure how your code is structured, but if you currently have a data or persistence layer, or at least a database access class where all your SQL is executed through, you could override the save functions to write changes to both databases. If you do not have a data layer, you may want to considering writing one before starting the transition.
Otherwise, you could add triggers in MSSQL to update Hadoop, not sure what you can do in Hadoop to keep MSSQL in-sync.
Or, you could have a process that runs every x minutes, that manually syncs the two databases.
Personally, I would try to avoid trying to maintain two databases of record. Moving changes from a new, experimental database to your stable database seems risky. You stand the chance of corrupting your stable system. Instead, I would write a convertor to move data from your relational DB to Hadoop. Then every night or so, copy your data into Hadoop and use it for the development and testing of your new system. I think test users would understand if you said your beta version is just a test playground, and won't effect your live product. If you plan on making major changes to your UI and fear some will not want to transition to 2.0, then you might be trying to tackle too much at once.
Those are the solutions I came up with... Good luck!
Consider using a queuing tool like Flume (http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/) to split your input between both systems.
I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible