This blog post covers the details of physical cube processing and storage, including an explanation of the what, when, and why to use any of the three cube storage methods: MOLAP, HOLAP, or ROLAP. These storage-type acronyms stand for Multidimensional, Hybrid, or Relational OLAP.
To view structural changes to your cube, you must process the cube in BIDS/SSDT and then deploy the changes to Analysis Services. To understand what is happening during this process, you must first be aware of what you are creating when you design and build BI objects, such as DSVs, cubes, and dimensions using BIDS/SSDT. Although you are working in a GUI environment, commands in the XMLA SSAS metadata language are being generated behind the scenes.
XML for Analysis
XML for Analysis (XMLA) is an open-standard XML dialect that is used for client-server communications between OLAP servers. XMLA describes its messages in a SOAP-style format designed to access multidimensional databases that contain OLAP cubes. Visit http://www.xmla.org/ to learn more about XMLA. Simple Object Access Protocol (SOAP) is a type of XML dialect. XMLA will sometimes also contain Multidimensional Expressions (MDX) queries, and it is the language of cube metadata. This language includes many methods that allow you to view or modify cube information; two examples are the Discover and Execute methods.
I’m opening this discussion of cube data storage with the XMLA because you need to understand what Analysis Services considers to be data and what it considers to be metadata as you begin to plan your cube storage options. The simplest way to think of this is as follows: Data is the content (rows) in the fact table. Metadata is everything else. By everything else, I mean cube definitions, dimension names, attribute names, hierarchy names, and, most important, dimension attribute values. The following list provides some examples from the samples available with SSAS to help explain this concept more fully:
- Data: All fact table rows from Fact Internet Sales, Fact Internet Sales Reason, and so on. Data does not include rows from the dimension tables; those rows are metadata and treated differently from data during dimension and cube processing.
- Metadata: Examples include names and all attributes for the Customer dimension called Customer, the Customer dimensional hierarchy Customer Geography, the Customer dimension attribute Marital Status, Customer attribute values Married, Single, or Unknown.
When you design storage for your cube, your primary concern is how to store the data (or the fact table rows, not the dimension table rows). This is because the largest amount of data in any cube is generally contained in the facts (or fact table rows). There can be exceptions to this in the case of huge dimensions. We will discuss this case in more detail later in this post.
For data storage, you have three choices at the topmost level:
- MOLAP: Multidimensional OLAP stores a copy of the facts in your SSAS cube. This is not a one-for-one storage option. Owing to efficient storage mechanisms used for cubes, the resultant storage is approximately 10–20 percent the size of the original data; that is, if you have 1GB in your fact table, plan for around 200MB storage on SSAS. However efficient your system, when you chose MOLAP, be aware that you are choosing to make a copy of all source data.
- HOLAP: Hybrid OLAP does not make a copy of the facts in SSAS. It reads this information from the star schema source.
- ROLAP: Relational OLAP does not make a copy of the facts on SSAS. It reads this information from the star schema source.
The other major consideration is the quantity of and the storage of aggregations. An aggregation is a calculated, stored intersection of fact values. An aggregation is calculated at a level higher than the granularity of your fact table. In other words, if the grain of your fact table holds sales amounts for each product sold on each day by each employee, one possible aggregation would be to sum, and store, the daily facts at the week level. In this scenario, SSAS will use calculated aggregations to quickly return sales amounts for each product, for each employee, by year. If there are aggregations at the week level, SSAS will use these as a starting point to calculate the needed yearly results and will not read each value from the day level. An important difference between OLAP aggregations and materialized indexes on calculated columns in a relational database such as SQL Server is that the SSAS can use aggregations from any level in response to a query. The result is that full cube aggregation is never needed to optimize queries. In fact, overaggregation of OLAP cubes is a common design mistake. Baseline values for aggregation are defined a bit later in this section. The storage type you select impacts where aggregations are stored. MOLAP, by default, creates no aggregations. Should you choose to add aggregations, they will be stored in the native SSAS format in your SSAS cube with the fact data. For HOLAP, only aggregations—not fact data—will be stored in your SSAS cube. For ROLAP, aggregations will be written back to the relational database. Note that the order of the column names reflect the positions in the hierarchy that are being aggregated and the type of aggregation performed.
Balance Query Performance with Data Latency
In this section, we will look further into SSAS storage. These storage settings help you to balance query performance with data latency. In earlier sections, you learned that using MOLAP will provide the best query performance. Now, we will look into the other two storage settings, HOLAP and ROLAP. For both storage types, source data is not copied to your cube. While HOLAP aggregations are stored in SSAS, ROLAP aggregations are built and stored in your database.
MOLAP Cube Storage
So, which type of storage is best? Why is MOLAP, with 0 percent, aggregations the default mechanism? In our experience, MOLAP is most often chosen because it provides efficient storage and the fastest query results. Although the default level of aggregations is 0 percent, most of our customers choose to add at least some aggregations to their production cubes. The reason for doing this is that SSAS is optimized to quickly and easily add aggregations during cube and dimension processing, and the improvement in query results usually offsets the overhead in processing time and cube storage file sizes. MOLAP is the storage default because the SSAS query engine is highly optimized for calculating aggregates on large quantities of data; the SSAS engine is more like Excel’s calculation engine than that of SQL Server. Relational database engines, such as SQL Server, are designed to fetch subsets of data efficiently, not necessarily to perform the complex aggregations required for MOLAP. For many customers, MOLAP with 0 percent aggregations, or some small amount such as 20 percent, will produce queries that run quickly enough for all of their end users.
[info]SSAS Standard Edition supports only MOLAP storage with the default number of partitions per cube.[/info]
ROLAP Cube Storage
ROLAP is often used for rapidly changing dimensions (RCDs), and huge dimensions. As an example, let’s say that you are modeling a dimension that contains employee information for a fast-food restaurant chain. The chain has very high employee turnover, as is typical in the fast-food industry. However, a business requirement is to be able to retrieve the most current employee name from the employee dimension at all times with no latency. This type of requirement may lead you to choose a ROLAP dimension. Another example is if you have a huge number of members in a dimension. A business example is that you are modeling the customer dimension for an international shipping company. It is a business requirement to have the name of every customer for all time included in your cube. This may mean that you must include millions, or eventually even billions, of customer names in the dimension. The storage limits for SQL Server tables (maximum number of rows) is still much larger than those in SSAS dimensions.
[plain]Despite the fact that you may have business situations that warrant the consideration of ROLAP dimensions, you should test to make sure that your infrastructure will provide adequate performance given the anticipated load. If you are considering ROLAP dimensions, be sure to test with production level of data before you deploy this configuration into a production environment.[/plain]
HOLAP Cube Storage
In Hybrid OLAP (HOLAP), all aggregations are stored in SSAS, while the source data remains in you RDMBS. HOLAP takes advantage of MOLAP query performance, lowers cube processing times, and reduces cube size. HOLAP storage has lower latency than MOLAP. However, queries needing to drill through to source data will perform poorer than MOLAP queries, since this detail data is not in your cube. Also, HOLAP storage requires the largest percentage of aggregations. If a good deal of your analysis is based on aggregations only, HOLAP may just be worth a try.
Like so many of the advanced storage features, ROLAP dimensions require the Enterprise Edition of SQL Server Analysis Services. You would use ROLAP typically only for dimensions with millions of members; an example might be the customer dimension. This means that the dimensional attribute values will not be copied to and stored on SSAS; rather, they will be retrieved directly from the relational source table or tables. To set a dimension as a ROLAP dimension, go to the Properties window for that dimension, and change the StorageMode property from the default MOLAP to ROLAP.
If you are working with the Standard Edition of SSAS, the only consideration for you is whether or not to add aggregations to your cube because non-MOLAP storage and cube partitions are enterprise-only features. In our experience, for cubes of 100GB or less, it is often prudent to add aggregations to the 20 to 30 percent query-improvement level.
If you are working with the Enterprise Edition of SSAS, you have several choices in terms of cube storage and aggregation:
- Should I use partitions? If your cubes are 250GB or larger, you’ll probably want to partition them by either months, quarters, or years (assuming your time granularity is by the day and not the hour or minute).
- Should I add aggregations? If you are considering partitioning your cubes because they are large, you should also consider customizing the amount of aggregations based on the size of partition and frequency of query.
- Should I use MOLAP, ROLAP, or HOLAP? Generally, you’ll use something other than MOLAP only if you choose to use partitions. Typically, current data (partitioned on some time value, for example, months) is stored as MOLAP with a relatively high level of aggregation (25–30 percent), with older data partitions stored as HOLAP or ROLAP with a lower level of aggregation.