When I use Power BI Import (Connect to SQL Server and select Import) to get data to Power BI Desktop, is there a data volume limitation?
I know for DirectQuery there is no data volume limitation. What about Import? Is it 1GB?
Import Data method has the size limitation of 1GB per model. So without using Power BI Premium, this method is not so much Scale-able.
The very first assumption that you might get after reading above explanation about Import Data is that; if you have a database with 100GB, then if you import it into Power BI, you will get 100GB file size in Power BI. This is not true. Power BI leverages compression engine of xVelocity and works on a Column-store in-memory technology. Column store in-memory technology compresses data and stores it in a compressed format. Sometimes you might have a 1GB Excel file, and when you import it into Power BI, your Power BI file ends up with only 10MB. This is mainly because of compression engine of Power BI. However, the compression rate is not always that. This depends on many things; the number of unique values in the column, sometimes data types and many other situations. I will write a post later that explains the compression engine in details.
Short read for this part is: Power BI will store compressed data. The size of data in Power BI would be much smaller than its size in the data source.
http://radacad.com/directquery-live-connection-or-import-data-tough-decision
Import Mode Limitations in Power BI
Depends on the imported data size, A lot of consumed memory and disk space
On your machine (during the implementation),
On the online/on-prem server (when it published).
The Power BI file size can’t be bigger than 1 GB.
You will get an error If the file size is bigger than 1 GB, in this case, you must have the Power BI Premium that allows having 50 GB file size.
No recent data without a refresh.
For more details, please, check Power BI: Switch from Import to DirectQuery Mode
Related
I have a 16GB CSV that I have imported into Power BI desktop. The workstation I am using is an Azure VM running Windows Server 2016 (64GB Memory). The import of the file takes a few seconds, however, when I try to filter the data set in query editor to a specific date range, it takes a fairly long time (it is still running and has been around 30 minutes so far). The source file (16GB CSV) is being read from a RAM disk that has been created on the VM.
What is the best approach/practice when working with data sets of this size? Would I get better performance importing the CSV in SQL server and then using direct query when filtering the data set to a date range? I would have thought it would run fairly quickly with my current setup as I have 64GB memory on available on that VM.
When the data size is significant, you also need appropriate computing power to process it. When you import these rows in Power BI, the Power BI itself needs this computing power. If you import the data in SQL Server (or in Analysis Services, or other), and you use Direct Query or Live Connection, you can delegate computations to the database engine. With Live Connection all your modeling is done on the database engine, while in Direct Query modeling is also done in Power BI and you can add computed columns and measures. So if you you Direct Query, you still must be careful what is computed where.
You ask for "the best", which is always a bit vague. You must decide for yourself depending on many other factors. Power BI is Analysis Services by itself (when you run Power BI Desktop you can see the Microsoft SQL Server Analysis Services child process running), so importing the data in Power BI should give you similar performance as if it was imported in SSAS. To improve the performance in this case, you need to tune your model. If you import the data in SQL Server, you need to tune the database (proper indexing and modeling).
So to reach a final decision you must test these solutions, consider pricing and hardware requirements and depending on that, decide what is the best for your case.
Recently, Microsoft made a demo with 1 trillion rows of data. You may want to take a look at it. I will also recommend to take a look at aggregations, which could help you improve the performance of your model.
I am considering options apart from Azure Import Export.
The data is more than 2Tb. but divided into multiple files. I would have 8-10 files ranging from 30Gb to 500Gb And we can use certificates for any web service that can be utilized for this transfer.
AzCopy is usually recommended for transfer of data upto 100GB or less.
Azure Data Factory, Data management Gateway by creating a pipeline and gateway and transferring data.
Please do suggest if there is any other way to do this transfer.
Based on my experience, Azcopy will provide you the best performance and is recommended even when data size is greater than 20 TB. Read this following KB article for more information.
One thing you should determine is which datacenter provides the lower BLOB storage latency based on your location. Use this URL for that purpose. The lower the values you get on that URL, the better.
I have a project that requires using SQL Server Analysis Services, but we've also started looking at PowerBI.
I'm not entirely clear on how PowerBI functions, and where the computations/data storage takes place. If we use PowerBI for generating the analytics, is there still a benefit to having an Analysis Services layer?
To Use Analysis Services or Not?
It depends. If you already have an Analysis Services (SSAS) model, as Caio mentions, then I wouldn't get rid of it. Power BI works very well with Analysis Services and Analysis Services is going to offer a lot of enterprise-grade options that Power BI isn't going to improve upon (such as the ability to handle millions of new rows each day).
However, if you don't have an Analysis Services model already, SSAS isn't a prerequisite for using Power BI. As Mike mentions, Power BI is fully featured by itself and can easily handle most needs (importing data, modeling the data, and then visualizing the data).
To answer your question about computation and storage, Power BI has a number of layers:
An ETL layer (M). This is how data is brought into your model.
A modeling layer (DAX). This is where the data is stored, and where calculations run.
A visualization layer.
When you use Power BI with Analysis Services in Direct Query mode, then the ETL & data modeling side of things are handled by SSAS. All computation & data storage happens in Analysis Services and Power BI becomes a visualization layer only, sending queries to Analysis Services as needed for your reports.
When you don't have Analysis Services (and are using Data Import mode), then the data is stored in Power BI and all the computations run inside Power BI too.
Pros & Cons of Each Option
The advantage of using Power BI without SSAS is speed of delivery. Everything is handled in one file by one person. If you need to change your data model to make a report work, you can do that within Power BI. When you have a SSAS model, making changes to your data model can be cumbersome (partly because you have to use another tool and partly because any changes will affect all users).
The advantage of using Power BI with SSAS is scalability. Configured correctly, a single Analysis Services model can grow to handle hundreds of gigabytes, hundreds of reports, and hundreds of users with no issue. Analysis Services offers a level of enterprise robustness that goes beyond what you'd want a Power BI file to handle.
That said, introducing Analysis Services brings a number of disadvantages: most importantly, licensing & maintaining a SQL Server & keeping that server up-to-date. Power BI Desktop is updated monthly and is a quick download to get the latest & greatest DAX features. Using SSAS means you have to wait for new releases of SQL Server that include the same DAX features, then test & install them.
Conclusion
If you're not dealing with vast amounts of data (e.g. millions of new rows each month), one way to know if you need the enterprise-grade features of Analysis Services would be to think about the reports needed at the end of the project. If there's a dozen or less reports and you plan to build them all yourself, then Power BI alone offers a lot of advantages. If, on the other hand, there's a whole department of report writers waiting for you to build a data model, then Analysis Services is the way to go.
Sidenote
What's more important than Analysis Services vs. Power BI for ETL/modeling is getting your data model right. A poor data model will be slow using either tool. A well-designed data model will be fast using either option. Make sure to spend plenty of time understanding best practices when it comes to modeling your data. "Analyzing Data with Power BI and Power Pivot for Excel" by Alberto Ferrari & Marco Russo is well worth picking up if you're new to data modeling & BI in general. (Not saying you are.)
Yes, you absolutely need to keep your Analysis Services layer (and other data sources you might have). Power BI is a reporting tool and should receive data pre-aggregated as much as possible, enough to be able to plot charts, display tables, apply filters, etc. The heavy lifting is done at the data source level.
There are a number of limitations in Power BI, and you should plan for that.
For instance:
There is a 1 million row limit for returning data when using
DirectQuery. This does not affect aggregations or calculations used to
create the dataset returned using DirectQuery, only the rows returned.
For example, you can aggregate 10 million rows with your query that
runs on the data source, and accurately return the results of that
aggregation to Power BI using DirectQuery as long as the data returned
to Power BI is less than 1 million rows. If more than 1 million rows
would be returned from DirectQuery, Power BI returns an error.
https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-use-directquery/
You probably dont need a separate Analysis Services instance - only for very large models. In the default Import mode you are only limited by a model size of 1GB for Free or Pro accounts. Due to effective data compression this can be many millions of rows. A rough basis for estimation would be 50m rows in 1GB. Performance is excellent.
Power BI actually spins up an internal Analysis Services instance when a model is in use, which handles all the analytic/calculation requirements. When using Power BI Desktop this runs on your PC (you can watch it in the Task Manager). When using the web service it runs in the cloud. With Power BI Report Server it runs on an on-premise server. You can connect to any of those using Excel Pivot Tables etc, just as you would with regular Analysis Services.
We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html
Is SQL Server 2008 a good option to use as an image store for an e-commerce website? It would be used to store product images of various sizes and angles. A web server would output those images, reading the table by a clustered ID. The total image size would be around 10 GB, but will need to scale. I see a lot of benefits over using the file system, but I am worried that SQL server, not having an O(1) lookup, is not the best solution, given that the site has a lot of traffic. Would that even be a bottle-neck? What are some thoughts, or perhaps other options?
10 Gb is not quite a huge amount of data, so you can probably use the database to store it and have no big issues, but of course it's best performance wise to use the filesystem, and safety-management wise it's better to use the DB (backups and consistency).
Happily, Sql Server 2008 allows you to have your cake and eat it too, with:
The FILESTREAM Attribute
In SQL Server 2008, you can apply the FILESTREAM attribute to a varbinary column, and SQL Server then stores the data for that column on the local NTFS file system. Storing the data on the file system brings two key benefits:
Performance matches the streaming performance of the file system.
BLOB size is limited only by the file system volume size.
However, the column can be managed just like any other BLOB column in SQL Server, so administrators can use the manageability and security capabilities of SQL Server to integrate BLOB data management with the rest of the data in the relational database—without needing to manage the file system data separately.
Defining the data as a FILESTREAM column in SQL Server also ensures data-level consistency between the relational data in the database and the unstructured data that is physically stored on the file system. A FILESTREAM column behaves exactly the same as a BLOB column, which means full integration of maintenance operations such as backup and restore, complete integration with the SQL Server security model, and full-transaction support.
Application developers can work with FILESTREAM data through one of two programming models; they can use Transact-SQL to access and manipulate the data just like standard BLOB columns, or they can use the Win32 streaming APIs with Transact-SQL transactional semantics to ensure consistency, which means that they can use standard Win32 read/write calls to FILESTREAM BLOBs as they would if interacting with files on the file system.
In SQL Server 2008, FILESTREAM columns can only store data on local disk volumes, and some features such as transparent encryption and table-valued parameters are not supported for FILESTREAM columns. Additionally, you cannot use tables that contain FILESTREAM columns in database snapshots or database mirroring sessions, although log shipping is supported.
Check out this white paper from MS Research (http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-45)
They detail exactly what you're looking for. The short version is that any file size over 1 MB starts to degrade performance compared to saving the data on the file system.
I doubt that O(log n) for lookups would be a problem. You say you have 10GB of images. Assuming an average image size of say 50KB, that's 200,000 images. Doing an indexed lookup in a table for 200K rows is not a problem. It would be small compared to the time needed to actually read the image from disk and transfer it through your app and to the client.
It's still worth considering the usual pros and cons of storing images in a database versus storing paths in the database to files on the filesystem. For example:
Images in the database obey transaction isolation, automatically delete when the row is deleted, etc.
Database with 10GB of images is of course larger than a database storing only pathnames to image files. Backup speed and other factors are relevant.
You need to set MIME headers on the response when you serve an image from a database, through an application.
The images on a filesystem are more easily cached by the web server (e.g. Apache mod_mmap), or could be served by leaner web server like lighttpd. This is actually a pretty big benefit.
For something like an e-commerce web site, I would be moe likely to go with storing the image in a blob store on the database. While you don't want to engage in premature optimization, just the benefit of having my images be easily organized alongside my data, as well as very portable, is one automatic benefit for something like ecommerce.
If the images are indexed then lookup won't be a big problem. I'm not sure but I don't think the lookup for file system is O(1), more like O(n) (I don't think the files are indexed by the file system).
What worries me in this setup is the size of the database, but if managed correctly that won't be a big problem, and a big advantage is that you have only one thing to backup (the database) and not worry about files on disk.
Normally a good solution is to store the images themselves on the filesystem, and the metadata (file name, dimensions, last updated time, anything else you need) in the database.
Having said that, there's no "correct" solution to this.