How to Reduce Your Power BI Model Size by 90%

what makes Power Bi so quick and highly effective with regards to efficiency? So highly effective, that it performs complicated calculations over thousands and thousands of rows within the blink of a watch.

On this article, we’ll dig deep to find what’s “beneath the hood” of Energy BI, how your knowledge is being saved, compressed, queried, and at last, introduced again to your report. When you end studying, I hope that you’ll get a greater understanding of the arduous work occurring within the background and recognize the significance of making an optimum knowledge mannequin to get most efficiency from the Energy BI engine.

First look beneath the hood — Components Engine and Storage Engine

First, I need you to fulfill the VertiPaq engine, “mind & muscle tissue” of the system behind not solely Energy BI, but in addition Evaluation Companies Tabular and Excel Energy Pivot. Fact to be mentioned, VertiPaq represents just one a part of the storage engine inside the Tabular mannequin, apart from DirectQuery, which we’ll talk about individually in one of many subsequent articles.

Once you ship the question to get knowledge in your Energy BI report, here’s what occurs:

Components Engine (FE) accepts the request, processes it, generates the question plan, and at last executes it
Storage Engine (SE) pulls the info out of the Tabular mannequin to fulfill the request issued inside the question generated by the Components Engine

Storage Engine works in two other ways to retrieve requested knowledge: VertiPaq retains a snapshot of the info in reminiscence. This snapshot might be refreshed every so often from the unique knowledge supply.

Quite the opposite, DirectQuery doesn’t retailer any knowledge. It simply forwards the question straight to the info supply for each single request.

Photo by RKTW extend on Unsplash

Information within the Tabular mannequin is normally saved both as an in-memory snapshot (VertiPaq) or in DirectQuery mode. Nonetheless, there may be additionally a risk of implementing a hybrid Composite mannequin, which depends on each architectures in parallel.

Components Engine — “Mind” of Energy BI

As I already confused, Components Engine accepts the question, and because it’s in a position to “perceive” DAX (and MDX additionally, however it’s out of the scope of this collection), it “interprets” DAX into a selected question plan, consisting of bodily operations that should be executed to get outcomes again.

These bodily operations might be joins between a number of tables, filtering, or aggregations. It’s essential to know that Components Engine works in a single-threaded approach, which signifies that requests to Storage Engine are all the time being despatched sequentially.

Storage Engine — “Muscular tissues” of Energy BI

As soon as the question has been generated and executed by the Components Engine, the Storage Engine comes into the scene. It bodily goes by way of the info saved inside the Tabular mannequin (VertiPaq) or goes on to a special knowledge supply (SQL Server, for instance, if DirectQuery storage mode is in place).

Relating to specifying the storage engine for the desk, there are three potential choices to select from:

Import mode — primarily based on VertiPaq. Desk knowledge is being saved in reminiscence as a snapshot. Information might be refreshed periodically
DirectQuery mode — knowledge is being retrieved from the info supply at question time. Information resides in its unique supply earlier than, throughout, and after the question execution
Twin mode — a mixture of the primary two choices. Information from the desk is being loaded into reminiscence, however at question time it can be retrieved straight from the supply

Versus Components Engine, which doesn’t assist parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As we now have drawn an enormous image beforehand, let me clarify in additional element what VertiPaq does within the background to spice up the efficiency of our Energy BI studies.

After we select Import mode for our Energy BI tables, VertiPaq performs the next actions:

Reads the info supply, transforms knowledge right into a columnar construction, encodes, and compresses knowledge inside every of the columns
Establishes a dictionary and index for every of the columns
Prepares and establishes relationships
Computes all calculated columns and calculated tables, and compresses them

The 2 principal traits of VertiPaq are:

VertiPaq is a columnar database
VertiPaq is an in-memory database

As you possibly can see within the illustration above, columnar databases retailer and compress knowledge differently from conventional row-store databases. Columnar databases are optimized for vertical knowledge scanning, which signifies that each column is structured in its personal approach and bodily separated from different columns!

With out going into deep evaluation about benefits and downsides between row-store vs column-store databases, since it might require a separate collection of articles, let me simply pinpoint a couple of key differentials by way of efficiency.

With columnar databases, single-column entry is quick and efficient. As soon as the computation begins to contain a number of columns, issues develop into extra complicated, because the middleman steps’ outcomes should be quickly saved not directly.

Merely mentioned, columnar databases are extra CPU-intensive, whereas row-store databases enhance I/O, due to many scans of ineffective knowledge.

Thus far, we painted an enormous image of the structure that permits Energy BI to totally shine as an final BI software. Now, we’re able to dive deeper into particular architectural options and consequently leverage this data to benefit from our Energy BI studies, by tuning our knowledge mannequin to extract the utmost from the underlying engine.

Inside VertiPaq in Energy BI — Compress for achievement!

As you may recall from the earlier a part of this text, we scratched the floor of VertiPaq, a robust storage engine, which is “accountable” for the blazing-fast efficiency of most of your Energy BI studies (each time you’re utilizing Import mode or Composite mannequin).

3, 2, 1…Fasten your seatbelts!

One of many key traits of the VertiPaq is that it’s a columnar database. We realized that columnar databases retailer knowledge optimized for vertical scanning, which signifies that each column has its personal construction and is bodily separated from different columns.

That truth permits VertiPaq to use several types of compression to every of the columns independently, selecting the optimum compression algorithm primarily based on the values in that particular column.

Compression is being achieved by encoding the values inside the column. However, earlier than we dive deeper into an in depth overview of encoding strategies, simply remember the fact that this structure just isn’t solely associated to Energy BI — within the background is a Tabular mannequin, which can be “beneath the hood” of Evaluation Companies Tabular and Excel Energy Pivot.

Worth Encoding

That is essentially the most fascinating worth encoding kind since it really works solely with integers and, due to this fact, requires much less reminiscence than, for instance, when working with textual content values.

How does this look in actuality? Let’s say we now have a column containing quite a lot of cellphone calls per day, and the worth on this column varies from 4.000 to five.000. What the VertiPaq would do, is to search out the minimal worth on this vary (which is 4.000) as a place to begin, then calculate the distinction between this worth and all the opposite values within the column, storing this distinction as a brand new worth.

At first look, 3 bits per worth may not seem like a major saving, however multiply this by thousands and thousands and even billions of rows and you’ll recognize the quantity of reminiscence saved.

As I already confused, Worth Encoding is being utilized solely to integer knowledge kind columns (foreign money knowledge kind can be saved as an integer).

Hash Encoding (Dictionary Encoding)

That is most likely essentially the most incessantly used compression kind by a VertiPaq. Utilizing Hash encoding, VertiPaq creates a dictionary of the distinct values inside one column and afterward replaces “actual” values with index values from the dictionary.

Right here is an instance to make issues clearer:

As chances are you’ll discover, VertiPaq recognized distinct values inside the Topics column, constructed a dictionary by assigning indexes to these values, and at last saved index values as tips to “actual” values. I assume you’re conscious that integer values require approach much less reminiscence area than textual content, in order that’s the logic behind such a knowledge compression.

Moreover, by with the ability to construct a dictionary for any knowledge kind, VertiPaq is virtually knowledge kind unbiased!

This brings us to a different key takeover: irrespective of in case your column is of textual content, bigint or float knowledge kind — from VertiPaq perspective it’s the identical — it must create a dictionary for every of these columns, which means that each one these columns will present the identical efficiency, each by way of velocity and reminiscence area allotted! In fact, by assuming that there are not any important variations in dictionary sizes between these columns.

So, it’s a fable that the info kind of the column impacts its dimension inside the knowledge mannequin. Quite the opposite, the variety of distinct values inside the column, which is named cardinality, principally influences column reminiscence consumption.

RLE (Run-Size-Encoding)

The third algorithm (RLE) creates a sort of mapping desk, containing ranges of repeating values, avoiding storing each single (repeated) worth individually.

Once more, looking at an instance will assist to raised perceive this idea:

In actual life, VertiPaq doesn’t retailer Begin values, as a result of it will probably shortly calculate the place the following node begins by summing earlier Depend values.

As highly effective as it’d have a look at first look, the RLE algorithm is extremely depending on the ordering inside the column. If the info is saved the way in which you see within the instance above, RLE will carry out nice. Nonetheless, in case your knowledge buckets are smaller and rotate extra incessantly, then RLE wouldn’t be an optimum resolution.

Yet one more factor to bear in mind relating to RLE: In actuality, VertiPaq doesn’t retailer knowledge the way in which it’s proven within the illustration above. First, it performs Hash encoding and creates a dictionary of the topics, after which applies the RLE algorithm, so the ultimate logic, in its most simplified approach, can be one thing like this:

So, RLE happens after Worth or Hash Encoding, in these eventualities when VertiPaq “thinks” that it is sensible to compress knowledge moreover (when knowledge is ordered in that approach that RLE would obtain higher compression).

Re-Encoding issues

Irrespective of how “good” VertiPaq is, it will probably additionally make some unhealthy choices, primarily based on incorrect assumptions. Earlier than I clarify how re-encoding works, let me simply briefly iterate by way of the method of information compression for a selected column:

VertiPaq scans a pattern of rows from the column
If the column knowledge kind just isn’t an integer, it is going to look no additional and use Hash encoding
If the column is of integer knowledge kind, some extra parameters are evaluated: if the numbers within the pattern linearly enhance, VertiPaq assumes that it’s most likely a main key and chooses Worth encoding
If the numbers within the column are moderately shut to one another (the quantity vary just isn’t very huge, like in our instance above with 4.000–5.000 cellphone calls per day), VertiPaq will use Worth encoding. Quite the opposite, when values fluctuate considerably inside the vary (for instance between 1.000 and 1.000.000), then Worth encoding doesn’t make sense, and VertiPaq will apply the Hash algorithm

Nonetheless, it will probably occur generally that VertiPaq comes to a decision about which algorithm to make use of primarily based on the pattern knowledge, however then some outlier pops up and it must re-encode the column from scratch.

Let’s use our earlier instance for the variety of cellphone calls: VertiPaq scans the pattern and chooses to use Worth encoding. Then, after processing 10 million rows, rapidly it discovered a 500.000 worth (it may be an error, or no matter). Now, VertiPaq re-evaluates the selection, and it will probably determine to re-encode the column utilizing the Hash algorithm as a substitute. Absolutely, that may affect the entire course of by way of the time wanted for reprocessing.

Lastly, right here is the record of parameters (so as of significance) that VertiPaq considers when selecting which algorithm to make use of:

Variety of distinct values within the column (Cardinality)
Information distribution within the column — column with many repeating values might be higher compressed than one containing incessantly altering values (RLE might be utilized)
Variety of rows within the desk
Column knowledge kind — impacts solely the dictionary dimension

Lowering the info mannequin dimension by 90% — actual story!

After we laid the theoretical floor for understanding the structure behind the VertiPaq storage engine, and which varieties of compression it makes use of to optimize your Energy BI knowledge mannequin, it’s the proper second to get our palms soiled and apply our information in a real-life case!

Start line = 776 MB

Our knowledge mannequin is sort of easy, but memory-intensive. We have now a truth desk (factChat), which comprises knowledge about dwell assist chats and one dimension desk (dimProduct), which pertains to a truth desk. Our truth desk has round 9 million rows, which shouldn’t be an enormous deal for Energy BI, however the desk was imported as it’s, with none extra optimization or transformation.

Now, this pbix file consumes a whopping 777 MB!!! You may’t imagine it? Simply have a look:

Simply bear in mind this image! In fact, I don’t have to let you know how a lot time this report must load or refresh, and the way our calculations are sluggish due to the file dimension.

…and it’s even worse!

Moreover, it’s not simply 776 MBs that take our reminiscence, since reminiscence consumption is being calculated taking into consideration the next components:

PBIX file
Dictionary (you’ve realized concerning the dictionary to start with sections of this text)
Column hierarchies
Person-defined hierarchies
Relationships

Now, if I open Process Supervisor, go to the Particulars tab, and discover the msmdsrv.exe course of, I’ll see that it burns greater than 1 GB of reminiscence!

Oh, man, that actually hurts! And we haven’t even interacted with the report! So, let’s see what we are able to do to optimize our mannequin…

Rule #1 — Import solely these columns you actually need

The primary and an important rule is: maintain in your knowledge mannequin solely these columns you actually need for the report!

That being mentioned, do I actually need each the chatID column, which is a surrogate key, and the sourceID column, which is a main key from the supply system? Each of those values are distinctive, so even when I have to rely the whole variety of chats, I’d nonetheless be effective with solely certainly one of them.

So, I’ll take away the sourceID column and examine how the file seems to be now:

By eradicating only one pointless column, we saved greater than 100 MB!!! Let’s study additional what might be eliminated with out taking a deeper look (and we’ll come to this later, I promise).

Do we actually want each the unique begin time of the chat and UTC time, one saved as a Date/Time/Timezone kind, the opposite as Date/Time, and each going to a second degree of precision??!!

Let me eliminate the unique begin time column and maintain solely UTC values.

One other 100 MB of wasted area gone! By eradicating simply two columns we don’t want, we diminished the scale of our file by 30%!

Now, that was with out even trying into the main points of the reminiscence consumption. Let’s now activate DAX Studio, my favourite software for troubleshooting Energy BI studies. As I already confused a couple of occasions, this software is a MUST in case you plan to work significantly with Energy BI — and it’s utterly free!

One of many options in DAX Studio is a VertiPaq Analyzer, a really great tool constructed by Marco Russo and Alberto Ferrari from sqlbi.com. After I connect with my pbix file with DAX Studio, listed below are the numbers associated to my knowledge mannequin dimension:

I can see right here what the costliest columns are in my knowledge mannequin and determine if I can discard a few of them, or if I have to maintain all of them.

At first look, I’ve few candidates for removing — sessionReferrer and referrer columns have excessive cardinality and due to this fact can’t be optimally compressed. Furthermore, as these are textual content columns and should be encoded utilizing a Hash algorithm, you possibly can see that their dictionary dimension is extraordinarily excessive! If you happen to take a more in-depth look, you possibly can discover that these two columns take virtually 40% of my desk dimension!

After checking with my report customers in the event that they want any of those columns, or possibly solely certainly one of them, I’ve obtained a affirmation that they don’t carry out any evaluation on these columns. So, why on Earth ought to we bloat our knowledge mannequin with them??!!

One other sturdy candidate for removing is the LastEditDate column. This column simply exhibits the date and time when the report was final edited within the knowledge warehouse. Once more, I checked with the report customers, they usually didn’t even know that this column exists!

I eliminated these three columns, and the result’s:

Oh, God, we halved the scale of our knowledge mannequin by simply eradicating a couple of pointless columns.

Fact be informed, there are a couple of extra columns that could possibly be dismissed from the info mannequin, however let’s now deal with different strategies for knowledge mannequin optimization.

Rule #2 — Scale back the column cardinality!

As chances are you’ll recall from the earlier a part of the article, the rule of thumb is: the upper the cardinality of a column, the tougher for VertiPaq to optimally compress the info. Particularly if we’re not working with integer values.

Let’s take a deeper have a look at VertiPaq Analyzer outcomes:

As you see, even when the chatID column has larger cardinality than the datetmStartUTC column, it takes virtually 7 occasions much less reminiscence! Since it’s a surrogate key integer worth, VertiPaq applies Worth encoding, and the scale of a dictionary is irrelevant. However, Hash encoding is being utilized for the column of date/time knowledge kind with excessive cardinality, so the dictionary dimension is enormously larger.

There are a number of strategies for decreasing the column cardinality, reminiscent of splitting columns. Listed below are a couple of examples of utilizing this method.

For Integer columns, you possibly can break up them into two even columns utilizing division and modulo operations. In our case, it might be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization approach have to be carried out on the supply aspect (on this case, by writing a T-SQL assertion). If we use the calculated columns, there is no such thing as a profit in any respect, for the reason that unique column must be saved within the knowledge mannequin first.

An identical approach can convey important financial savings when you might have decimal values within the column. You may merely break up values earlier than and after the decimal as defined in this article.

Since we don’t have any decimal values, let’s deal with our downside — optimizing the datetmStartUTC column. There are a number of legitimate choices to optimize this column. The primary is to examine in case your customers want granularity larger than the day degree (in different phrases, are you able to take away hours, minutes, and seconds out of your knowledge).

Let’s examine what financial savings this resolution would convey:

The very first thing we discover is that our file is now 271 MB, so 1/3 of what we began with. VertiPaq Analyzer’s outcomes present that this column is now virtually completely optimized, going from taking on 62% of our knowledge mannequin to simply barely over 2.5%! That’s huuuuge!

Nonetheless, it appeared that the day-level grain was not effective sufficient, and my customers wanted to investigate figures on the hour degree. OK, so we are able to at the least eliminate minutes and seconds, and that may additionally lower the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
                ,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
                ,customerID
                ,userID
                ,ipAddressID
                ,productID
                ,countryID
                ,userStatus
                ,isUnansweredChat
                ,totalMsgsOp
                ,totalMsgsUser
                ,userTimezone
                ,waitTimeSec
                ,waitTimeoutSec
                ,chatDurationSec
                ,sourceSystem
                ,topic
                ,usaccept
                ,transferUserID
                ,languageID
                ,waitFirstClick
            FROM factChat

It appeared that my customers additionally didn’t want a chatVariables column for evaluation, so I’ve additionally eliminated it from the info mannequin.

Lastly, after disabling Auto Date/Time in Choices for Information Load, my knowledge mannequin dimension was round 220 MB! Nonetheless, one factor nonetheless bothered me: the chatID column was nonetheless occupying virtually 1/3 of my desk. And that is only a surrogate key, which isn’t utilized in any of the relationships inside my knowledge mannequin.

So, right here I used to be inspecting two totally different options: the primary was to easily take away this column and mixture the variety of chats, counting them utilizing the GROUP BY clause. Nonetheless, there can be no profit by retaining the chatID column in any respect, because it’s not getting used anyplace in our knowledge mannequin. As soon as I’ve eliminated it from the mannequin, one final time, let’s examine the pbix file dimension:

Please recall the quantity we began at: 776 MB! So, I’ve managed to cut back my knowledge mannequin dimension by virtually 90%, making use of some easy strategies which enabled the VertiPaq storage engine to carry out extra optimum compression of the info.

And this was an actual use case, which I confronted over the last 12 months!

Basic guidelines for decreasing knowledge mannequin dimension

To conclude, right here is the record of basic guidelines it’s best to bear in mind when attempting to cut back the info mannequin dimension:

Preserve solely these columns your customers want within the report! Simply sticking with this one single rule will prevent an unbelievable quantity of area, I guarantee you…
Attempt to optimize column cardinality each time potential. The golden rule right here is: take a look at, take a look at, take a look at…and if there’s a important profit from, for instance, splitting one column into two, or to substitute a decimal column with two complete quantity columns, then do it! However, additionally remember the fact that your measures should be rewritten to deal with these structural adjustments, to be able to show anticipated outcomes. So, in case your desk just isn’t huge, or if it’s important to rewrite tons of of measures, possibly it’s not price splitting the column. As I mentioned, it is dependent upon your particular situation, and it’s best to rigorously consider which resolution makes extra sense
Identical as for columns, maintain solely these rows you want: for instance, possibly you don’t have to import knowledge from the final 10 years, however solely 5! That can even cut back your knowledge mannequin dimension. Discuss to your customers, ask them what they actually need, earlier than blindly placing all the things inside your knowledge mannequin
Mixture your knowledge each time potential! Meaning — fewer rows, decrease cardinality, so all good issues you’re aiming to attain! If you happen to don’t want hours, minutes, or seconds degree of granularity, don’t import them! Aggregations in Energy BI (and Tabular mannequin generally) are an important and huge matter, which is out of the scope of this collection, however I strongly suggest you examine Phil Seamark’s blog and his collection of posts on inventive aggregations utilization
Keep away from utilizing DAX calculated columns each time potential, since they don’t seem to be being optimally compressed. As a substitute, attempt to push all calculations to a knowledge supply (SQL database, for instance) or carry out them utilizing the Energy Question editor
Use correct knowledge sorts (for instance, in case your knowledge granularity is on a day degree, there is no such thing as a want to make use of Date/Time knowledge kind. Date knowledge kind will suffice)
Disable Auto Date/Time choice for knowledge loading (this can take away a bunch of robotically created date tables within the background)

Conclusion

After you realized the fundamentals of the VertiPaq storage engine and totally different strategies it makes use of for knowledge compression, I wished to wrap up this text by exhibiting you a real-life instance of how we are able to “assist” VertiPaq (and Energy BI consequently) to get the very best out of report efficiency and optimum useful resource consumption.

Thanks for studying, hope that you just loved the article!

Source link

LLM Optimization: LoRA and QLoRA | Towards Data Science

The Secret Power of Data Science in Customer Support

Agentic RAG Applications: Company Knowledge Slack Agents

DeepSeek-R1 İnceleme. Geçtiğimiz haftalarda OpenAI’nın o1–127… | by Ümit | Feb, 2025

The AI & ML Revolution in APAC: Transforming Business Industries and Elevating Service Quality | by Vertisystem | Mar, 2025

A Guide to Cloud Migration for Legacy Applications

Text-To-Image using Diffusion model with AWS Sagemaker Distributed Training | by Aniketp | Mar, 2025

AI learns how vision and sound are connected, without human intervention | MIT News

Most Popular

The real impact of AI on your organization

How Giordano’s CEO Is Leading the Iconic Pizza Brand Forward

Time Series Analysis: Reading the Rhythms Hidden in Data | by Everton Gomede, PhD | Apr, 2025

Our Picks

How Much Do Nvidia Employees Make? Median Salary Revealed

Empowering AI with Precision: Wisepl’s Expert Animal Dataset Annotation Service | by Wisepl | May, 2025

Smart Caching for Fast LLM Tools — ColdStarts & HotContext, Part 1 | by Zeneil Ambekar | May, 2025