Tag Archives: Power Query

Changes to the Power BI Data Management Gateway – Nov 2013

Several of the Power BI preview components were updated last week, most notably, Power Query and the administration app in BI sites. Without much fanfare, the Data Management Gateway was updated as well. There is a fair bit of documentation on what’s new in Power Query, and I’ve added my own thoughts here. There’s also some documentation outlining the changes to the the admin application here. However, although I looked, I haven’t found anything for the DMG.

I therefore decided to poke around a bit myself, and discovered a few things. This should obviously not be taken as a comprehensive list, and if I find anything else, I’ll update it. What follows are the changes that I’ve observed.

Performance Improvements With Azure data sources

In my article on working with the data management gateway, I observed that refreshing a relatively large model (1MM rows, 20 columns) required upwards of 10 minutes. After updating the my version of DMG, refresh required only about 1.5 minutes. I am told that performance was a focus for the team, and that focus appears to have paid off.

No more Gateway Limits

If you set up an on premises data source, you can expose it as an OData feed, and consume it with Power Query. In early testing, users would bump into a limit with large data sets. As soon as the data stream hit 100 MB, an error would result. This was due to a built in limit in the gateway. In current testing, it appears that this limit has been completely removed in this version.

Oracle support

Until now, the only data refresh support has been for SQL server. In this version of the gateway, Oracle support has been added. I haven’t tested it, as I don’t have any Oracle handy, but it’s there.

image

Credential storage and use

When creating a data connection, it is necessary to input credentials. These credentials are used to connect to the original data source when the model is refreshed. Due to security constraints, Microsoft cannot “see” the credentials, and they are stored locally. With this version of the gateway, there is now an option to save the (encrypted) credentials in the cloud. This helps with restoring gateways. 

System Health OData Feed

In the admin center, it’s possible to get quick visibility into the performance of all gateway machines.

image

With the November 2013, you can also get this performance information via an OData feed, and perform your own analysis, and use tools like Power View to visualize it.

image

This is what I’ve uncovered so far. I’ll update this post if I learn of any more.

Changes to Data Loading in the November 2013 Power Query Update

Last week, Microsoft released a new version of the Power Query preview for Power BI. It’s a significant update and contains some welcome improvements. I’m not going to list them all out here, as the Power BI team has done so here on their blog. I do however want to focus in on the changes to the basic data load procedure, and how queries are built. The process is easier, and significantly faster, but I still have some concerns, which I will get to.

Most of the changes concern data navigation. One of the major changes is the ability to directly use T-SQL when importing data from SQL server. This makes the tools significantly more useful when dealing with complex queries, or if you just happen to be handy with T-SQL. The opportunity to use your query occurs immediately after you select SQL Server as a data source.

image

Once you click OK, you are presented with the new Query editor window. Previously, the navigator and the query were contained in this window, but the “meta” query information, such as the query name and load properties were displayed along with the data in the worksheet. This lead to a fair bit of confusion, and this new change is welcome. You can see this new unified dialog below.

image

It’s worth pointing out a few things about this window. Firstly, at the top, we see the new editor ribbon. All of the functions in the ribbon were previously available in context menus, but the ribbon makes them far more discoverable (and therefore easier to use). In the command line, you can see that the native T-SQL query has been transformed into the internal Power Query language. It can be edited further at this point. On the right of the window you can see that in addition to the Applied Steps list, you can edit the name of the query, and change the load settings. Previously, the load had to be initiated before the settings could be changed, and this is therefore quite welcome. However, I still have some concerns with how this works, and I’d like to expand on it further.

To start with, I believe that the defaults are backwards. Previously (and with an older version of Power Query) I discussed the different ways that you can bypass Excel and load data directly into the data model. This is important not simply because of the row limit in Excel (1,048,576 rows) but because of the limits imposed on Excel Services by Office 365 (10 MB file size). Power BI allows for data models up to 250 MB, but only if the data is contained in the model – the 10 MB Excel limit still applies. I discuss this at length in this article.  With this default, users will load data into the worksheet without thinking about it, upload the file to Power BI, and run straight into the file size limit. I can see this as being very frustrating, and limiting adoption.

Another option might be to warn the user – “your file size is approaching 10 MB – would you like to load data into the model exclusively?”, or something of that nature.

My other concern is that now you don’t necessarily need to see the Query editor at all when importing data. On the surface, this is a good thing, and if the data load defaults were as I suggest, I would cheer it, but we run directly into the same problem right now. Let’s say that I just want to import a lot of data from a single SQL table. From Excel, I will just select the Power Query tab, and select “From SQL Server Database”. Now if I don’t enter a custom query as above, and click OK, I won’t see the editor window. Instead, I see the data navigator in the worksheet.

image 

One big feature to note here is the checkbox on the top of the navigator “Select Multiple Items”. With this version of Power Query, you can now select multiple tables at once – something that was previously unavailable. While you can bring up the editor window by selecting the Edit Query button at the bottom, you can load the data simply by clicking on the Load button. This is simple, but doing so will load all of the data directly into Excel. At best, you wind up with a workbook that is likely too big for Power BI (obviously depending on the amount of data), but at worst you bump into the row limit.

image

Unfortunately, when this happens, it isn’t immediately obvious how to fix the problem – the edit query button disappears. You can edit the query by double clicking on the query name, and changing the load options. However, users may just assume that they can’t use Power BI to do what they need, which is dead wrong. Again, even a few warning dialogs here would help significantly.

I like this new Power Query. It’s easier, has more features, and is significantly faster than its predecessors. I would also like to see it succeed, and I’d hate misunderstandings, or minor UI issues to get in the way of that. 

Analyzing Data with Power BI from Beginning to End

I’ve been writing a fair bit lately about specific aspects of Power BI, and how they all fit together. I decided to put together a “real world” demonstration of as many aspects of the product as possible, and this article is the result. I don’t dive too deep on any of the specific areas, but hopefully this walkthrough will help give a sense of the power and simplicity of working with this (these?) tool (s).

In this example, we will explore some publicly available data published by the US National Oceanic and Atmospheric Administration (NOAA). To do so, we will first use Power Query to acquire and transform the data. Next, we will utilize Power Pivot to build our model, and then create a Power Pivot and a Power Map to visualize it. Finally, we will publish the data to a Power BI enabled site where it can be consumed by anyone with a browser or a mobile client.

Data Acquisition and Transformation with Power Query

The NOAA keeps a record from about 800 weather stations around the world. The data collected includes daily maximum and minimum temperatures, rainfall, and a host of other measures. All of this data is available through their web/ftp site here. Well it should be. As I write this, the site is offline, as are most other US government websites. Luckily, I have a local copy of all the files. Feel free to contact me for a copy if they’re not back soon. Update – For the interim, I’ve made the files available here.

The way that the data is provided is via a series of text files. Each text file in a folder represents a single weather station, and each row contains all the data for a given measure for a month. The data are not delimited, they are distinguished by their position in the row, with the data for each day in the month being presented as a separate measure. Normally, a tool like SSIS would be required to transform the data into something that could be analyzed, but as you’ll shortly see Power Query is up to the task.

Unfortunately, Power Query cannot work with FTP sources, but it’s a rather simple matter to copy the files from the FTP site to a local drive, and there are tools that can keep the content synchronized for refresh purposes. Note, if anyone from Microsoft is reading this, I believe that FTP and WebDAV folders would be a great supplement to the Folder data source, particularly when it comes to automatic refresh. Once we have the files locally, we can begin the import process. First, we open a new Excel workbook, select the Power Query tab, and then in the “From File” dropdown, we select the new feature “From Folder”.

image

Next, we select the folder that contains all of the data files. Power Query returns the data for the files in the folder.

image

In our case, the file name does correspond to the ID of the station, but the rest of the file metadata is not that useful to us. However, what is useful is the fact that using Power Query, we can drill into the files. Depending on the requirement, this can be done one of two ways. By clicking on the word “Binary” in the Content column, you will be able to drill into that file, and subsequent operations will be performed on that file. However, if you click on the little drilldown icon at the right of the Content column header, you will drill into all files in the folder. Subsequent operations will be performed on all files in the folder, and refresh operations will pick up any new files.

It should be noted that for this to work, all files in the folder must have the same schema. (In addition to data like this, this feature is particularly useful for parsing and iterating through server log files.) Since we want all station data, this is the operation that we will perform next. What we get back is a column of seemingly randomly structured data.

image

Luckily for us, there is a key to the data structure (found in the file readme.txt if you’re following along). The basic key pattern is:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
YEAR         12-15   Integer
MONTH        16-17   Integer
ELEMENT      18-21   Character
VALUE1       22-26   Integer
MFLAG1       27-27   Character
QFLAG1       28-28   Character
SFLAG1       29-29   Character
VALUE2       30-34   Integer
MFLAG2       35-35   Character
QFLAG2       36-36   Character
SFLAG2       37-37   Character
  .           .          .
  .           .          .
  .           .          .
VALUE31    262-266   Integer
MFLAG31    267-267   Character
QFLAG31    268-268   Character
SFLAG31    269-269   Character
------------------------------

In plain English, this means that the first 11 characters are the ID (the station ID actually, from which the country can be extracted), the next 4 characters are the year, the next 2 the month, the next 4 the “element” (the measure name). The next 4 “variables” are the value (5 characters), and 3 different flags (one character each, which we are not interested in) for the first day of the month. This pattern of 4 variables is then repeated 30 more times, once for each day of the month. A value of –9999 represents a null value.

We have some parsing to do.

Luckily, Power Query makes this (relatively) easy. Right click on the column header, select split column, and then select number of characters.

image

Next, for our first column, we set the number of characters to 11, and we tell it to split once, as far left as possible. We’ll use this option for our remaining splits.

image

Click on OK, and we will now have 2 columns. We then right click on the leftmost column, and rename it to ID.

image

Next, we right click on our rightmost column header and repeat the process (using 4 characters this time), this time renaming it to Year. We then do the same for the month (2 characters) and element (4 characters) columns. Once this is done, we can click the dropdown on the elements column. Since we are only interested in the maximum and minimum daily temperatures, we deselect all, and reselect these two elements. This will limit the amount of data that we will pull into our model.

image

At this point, it is also worth noting the steps window on the right hand side of the query editor. This shows all of the transformation steps that have occurred to get us where we “are”. Any of these steps can be deleted, which is very useful if a mistake is made, and all subsequent steps will be affected by it. Clicking on a previous step takes you back to that point in the development of the transformation as well.

Once we have filtered our data, we continue on with the parsing. We then split off the next 5 columns, and rename the column to “1”. The next three columns we don’t need, so we first split them off (3 characters) and then delete the column by right clicking and selecting “Remove”. We then repeat this, renaming the value column to 2. Finally, we continue repeating this process until the value for 31 is complete. This is a tremendously tedious process, but it works rather well. I’ll make another feature request here for anyone in a position to do something about it. In the Power Query editor, it would be nice to be able to select a number of steps and either copy them forward, or be able to repeat them x times. This would save a great deal of time in these situations.

Its also a good idea to give the query an understandable name, so we right click on “Query 1” and rename it to “Temperature Data” .

image

As mentioned above, the figure –9999 represents a null value. Well, we know this, but the model won’t. At this point we need to convert these to proper nulls. The way that this is done is to select all of the numbered column headers, right click on them and then choose “Replace Value”. We replace the value –9999 with the word null.

image

Now that we have our nulls, our data is good. However, from an analysis perspective, the data is hardly in the best shape. What we need is to is to have each day’s data represented by a row, not a column. Once again, Power Query provides a method for us to accomplish this through the unpivot function. Simply select all of the numbered columns, right click, and choose “Unpivot Columns”.

image

Instantly, the numbered rows are transformed into rows, with the column titles in a column named “Attribute”. Simply rename it to “Day”, and we’re ready to import our data. To do so, click “Done” and Power Query will begin the extract, transform, and load process. In this case, it takes quite some time to iterate through all of the files (about 15 minutes) so be patient.

While the data is loading, you will need to make some adjustments to your Query Settings (to the right of your Excel window). Specifically, you’ll want to turn off the “Load to worksheet” option. As I’ve discussed previously this is important to keep the size of the actual worksheet small. in this case, it’s also important as the sheer volume of data exceeds Excel’s ability to store it. When complete, your worksheet should appear as follows.

image

You can see that there’s a fair amount of data here, about 31.5 million rows. Did I mention that we’re working with Excel here? We’ve disallowed loading to the worksheet, so at this point, we need to load into the embedded data model. Do that by clicking on the link at the bottom of the Query Settings screen.

There is also some more supporting data that we want to bring into our model. The following files need to be imported:

  • ghcnd-stations.txt – contains metadata for all of the different weather stations
  • ghcnd-countries.txt – contains metadata for all participating countries
  • ghcnd-states – contains metadata for US states and Canadian provinces

The keys to these files are also in the readme.txt file, and we bring in their data using the same procedure that we used above for the actual weather data, with the exception that we use the “From file” data source. Finally, we give each query a logical name.

The beauty of all this is that anytime we wish, we can refresh the data. Any new or changed data will be picked up and added to the model following our transformation.

Model Editing In Power Pivot

Once this is complete, we are ready to edit our model. We do so by selecting the Power Pivot tab, and clicking “Manage” in the ribbon. First, we work on the actual temperature data tab. The temperature values reported are in tenths of a degree Celsius. This doesn’t mean much to most people, we should at least report the data in degrees C. To do this, Click on Add Column, and add the formula [Value]/10. Rename the column to Temperature.

We always want to be working with temperature averages. To do so, add a calculated measure named Average Temperature (C) (in the calculation area at the bottom). Its formula should be Average Temperature (C):=AVERAGE([Temperature]). We no longer need the value column or the temperature column for analysis, so we need to hide them from client tools. To do so, select both the Value header and the Temperature header, right click and select “Hide from Client Tools”. Finally, we set the formatting of our calculated measure to be Decimal Number.

image

Next, we want to establish a relationship between the data and the country table. We don’t have a column for country, but according to the data key, the first two characters of the country code correspond to the country code. We can therefore add another calculated column with the formula LEFT([ID],2). We then rename the column to Country Code. We only need this column for the relationship, so we can hide this from client tools as well.

image

To establish the relationship, change to the diagram view in the ribbon. Drag the Country Code field from the Temperature Data table onto the CountryCode field from the Countries table. While we’re in this view, we can drag the ID field from Temperature Data onto the ID field of the stations data. this will allow us to take advantage of the location of each of the stations. Finally, drag the State column from the Stations table onto the State Code column of the States table. At this point, we can go ahead and hide any unnecessary columns from client tools.

image

Our model is ready. At this point, we can move forward with analysis.

Create a Simple Power View Report

We want to create a simple Power View report to display temperature trends over time, Move back to the Excel Workbook and click the Insert tab in the ribbon. Then select the Power View button – this opens up a new Power View design surface. We won’t be needing the Filters window, so we can close that.

Right away, if you look at the Power View Fields window, you will notice that every table is duplicated – one with a “filled in” title bar, and one without. This window is displaying both defined Power Query queries, and the table contained within the model (Power Pivot). The “filled in” one is the one that we want to be working with. Using the other will cause duplicate tables to be created in the model – something that we obviously don’t want. This is due to an issue with the preview for Power Query and should be fixed in the near future.

We’ll be working with “Average Temperatures”  as a measure so from the start, we’ll pick it and “Year” from the Temperature Data table. Year is a number, so Power View will want to total it by default. We don’t want this, so in the Fields selector, select the dropdown for year and select “Do Not Summarize”.

image

Next, resize the table on the design surface so that it fills the width of the report. Then, with the table selected, from the Design tab in the ribbon, select “Other Chart” and “Line”

image

What we get is a graph of overall average of all recorded temperatures starting around 1840. It might be more interesting to see the maximum and minimum temperatures, so to do that, just select the “Element” field, and Power View will add it to the Legend, showing both values, maximum temperature in blue, minimum in red. It would be nice if we could switch those colours, but unfortunately, all you can do at the moment is to select a different theme, none of which start with red.

Next, it would be nice to be able to see this data by individual reporting country. We can do this by first displaying a chart showing the overall average temperature per country. We click on the design surface outside of the temperature chart, and select the ”Country” field from the Country table, and the “Average Temperature” field from the Temperature Data table. Again,we resize the table to the width of the report, and then select “Column Chart”, “Clustered Column from the Design tab. We then get a scrollable list of countries with their overall average temperatures.

We could sort this by temperature, but since it will service as a picker, alphabetical makes more sense. Being from Canada, I’m naturally interested in our weather patterns, so I click on the Canada data bar (a negative value…shiver). Immediately the report above is filtered by Canadian data.

image

It seems that this whole climate change thing isn’t working out as well as I’d hoped….

Finally, we want to give the report a good title, and clean up some formatting. After we do this, we can move on to the next step, creating a Power Map report.

Create a Simple Power Map Report

Creating a Power Map is similar to a Power View. From the Insert tab on the ribbon, select Map, and then “Launch Power Map” from the dropdown.

image

At this point a new window will be launched where you can build your Power Map report. What we want to do is to show the relative temperatures for each country using a colour scale from low to high, and to show the average maximums and minimums for each recording station. To start with, for our first layer, we are going to use Country from the Countries table for our Geography data, so we select it. Every country with data will get a small point on the map.

image

Clicking next, we then get to select our data for the first layer. The data that we want to map is Average Temperature, So we select it. The resulting column visualization makes the earth look like a pincushion. What we want is to fill in the country boundaries with relative colour values. To do this, we select the  region type for the visualization. This is better, but it fills the regions with shades of blue from light blue to dark blue for the warmest averages. It’s far more typical for red to represent heat, and we can do this by first selecting the settings icon.

image

We can then select a colour palette for our data, and red is a more logical color. Now you can spin the globe around to check out your next vacation hotspot. However, we want to get more specific than country, we also want to plot details for each reporting weather station. First, we need to add a layer. To do so, click on the Layers icon and select the add layer icon.

image

We then repeat the same steps that we did for the first layer, only instead of selecting country, we select Latitude and Longitude for the geography fields. After clicking Next, we leave the report as a Column type, and select Average Temperature as the height. Next, we select Element as the category option. We want the report to use clustered columns, so we now select that option to the right of the category title.

image

Finally, we want TMAX to be red, and TMIN to be blue, so we again click on the settings icon for this layer, and set the colours appropriately.

image

From this, it’s easy to see why Canada’s numbers are so low – there are a disproportionate number of stations in the far north! We’re now ready to share our report with Power BI.

Sharing Reports With BI Sites

Sharing your report is almost as simple as saving to a BI Sites enabled Office 365 library. We can either upload it directly, or save it to a SkyDrive Pro enabled folder, which is what I do. At this point, the file is 100 MB. On a side note…. this Excel file, that contains over 30 million rows of data is only 100 MB in size. That’s impressive. After the file appears in the O365 library, we can attempt to open it. However if we do, we’ll get a message saying that the file is to big to open in the browser.

That would be true in a regular Office 365 library, but it isn’t too large for Power BI. The problem is that first, the file needs to be enabled. More information of file size limits and enablement can be found here. To enable the workbook for Power BI, first navigate the the Power BI app, locate your file (it will be showing a small caution symbol), click the ellipsis, and select “Enable”.

Enabling the file will perform a few operations:

  • The model will be extracted from the file and stored elsewhere, drastically reducing the file size
  • A thumbnail is extracted from the file and used in the tile

It should be noted that if the file is uploaded via the Power BI app, the saving and enablement steps will be performed automatically.

Clicking on the file, either here, of within its actual library will now open the file, and we can also interact with the model (in this case by clicking on the data for Bermuda).

image

At the top, you’ll notice a warning message. It is letting you know that although this workbook can be viewed in a browser, and interacted with, it cannot be edited in a browser. This is because it is beyond the maximum edit size limit of 30 MB. You may also notice that we are using an HTML5 version of our Power View report. By default, Silverlight is used, but if you click on the new icon on the bottom left of the report, HTML5 will be used. It will also be used if you have a system that does not support Silverlight. This is true anywhere in Office 365, not just in Power BI enabled sites.

Finally, you may also notice that our Power Map report is missing from the equation. This is because for the moment, Power Map is not supported in the browser. A workaround for the time being is to export the Power Map report to an MP4 video file, and store it separately. MP4 files can be rendered in a browser, and on mobile devices.

Viewing In the Power BI Mobile Apps

Speaking of mobile devices the Power BI app for Windows 8 is currently available in the Windows store, with a version for iOS coming shortly. Once installed and run you’ll be shows a couple of sample reports from a demo site. To navigate to your own site, right click anywhere, or swipe up from the bottom, and select “Browse”.

image

On the next screen, we can select from our registered sites. If our site has not yet been registered, we can do so by again right-clicking, or swiping up. Then, we can enter the URL of your site. The URL to be entered here is that of the parent site of the library in which your report is stored. Once we load from this site, it will be automatically registered.

The next few screens allow us to navigate down to the file in question. Once the file appears in the list, selecting it (not opening it) gives us the option to save it as a favourite. If we mark a report as a favourite, it will appear on the home screen whenever the app is started.

image

Finally, opening it will open the first names object in the file. You can navigate to other named objects by right clicking, or swiping down from the top.

image

As you will notice, all names objects in the workbook appear by default. We can control what appears through the Excel publishing options, as outlined here.

As you can see, we have just taken a very large amount of publicly available data, transformed it and analyzed it within Excel, and published it out for everyone in the organization no matter what device they are using. I haven’t even gotten into the Q&A piece – primarily because it’s not available yet for custom data sets. I plan on updating this article when it is.

Power BI is powerful indeed.

Software Boundaries and Limits for Power BI

The Office team have published a  comprehensive resource outlining all of the current boundaries and limits for SharePoint Online generally. Power BI offers a host of new capability, and in the absence of anything official from Microsoft, I thought it might be a good idea to itemize my findings below. I’m clearly leaving things out, and if you have something to add, I’d love to here from you. This list is over and above the inherent boundaries of Office 365.

This is obviously not official, and I’ll try to modify it as these limits change, but for now, here are the current limits for Power BI, in its preview form.



Feature Description
Maximum workbook size (not enabled) 10 MB
Maximum size of non-model workbook data (enabled) 10 MB
Maximum size of data model 250 MB
Maximum workbook size (enabled) 260 MB (250 MB model + 10 MB workbook data)
Maximum size of workbook for editing 30 MB (This includes interactive refresh. Larger workbooks still support interaction)
Number of tables/views in a database visible to Power Query Navigator 2,000
Maximum number of thumbnail reports in Mobile application 6
Maximum OData query size with the Data Management Gateway 100 MB

 

Updated 2013-11-17 – Navigator limit changed to 2,000 and OData query limit removed with the Nov 2013 version of the gateway.

Power BI – Working With the Data Management Gateway

 

Update August 2014 – Version 1.2 of the DMG offers significant changes, which I’ve written about here.

While it isn’t the flashiest, the Data Management Gateway is arguably the most important part of Power BI. Most of the other features of Power BI have either been available already, or could be achieved through alternate means. What was next to impossible was to have a workbook stored in Office 365 automatically update its data from an on premises data source in a manner similar to Power Pivot for SharePoint. The Data Management Gateway does this, and a fair bit more. This article will attempt to explore some of its intricacies, tricks, and things to look out for. This article is being written while using the invitation only preview of Power BI, and some of the specifics could will change by public preview, and certainly by General Availability.

The Data Management Gateway is a Windows service, and can run on any Windows PC. Well, not just any PC, it needs Windows 7 or above, but it’s a relatively light service. In my case for the preview, I’m running it on my main workstation. It’s also relatively small, and you can run multiple gateways within your organization. If you want to have a look at it, you can download it here whether or not you have a Power BI enabled Office 365 tenant, but if you don’t, you won’t be able to do much with it.

Installing the gateway is pretty straightforward, and is well documented here, so I won’t go into details on that. Once it is installed, it establishes a communication channel with your Power BI service in the cloud (no firewall holes required), and essentially waits for requests. Once it gets one, it acts as a broker between your cloud service and your on premises data.

Most configuration is done in the cloud, in the Power BI Admin Center. If you’re interested in monitoring its activity on the gateway machine, You can do so using Resource Monitor or Task Manager. The process that you’re looking for is diawp.exe or diahost.exe. I have no idea there those names come from, but I’m going to guess Data Integration Agent Worker Process and host.

image

Once installed, the gateway performs two major semi related functions. The first is the aforementioned on premises data refresh capability. In addition to this, the gateway also provides the ability to publish your on-premises data sources as an OData feed that is will be accessible in the cloud. In its current version, the gateway only supports a limited set of data sources – all of them SQL Server. The official list can be found at the bottom of this document. Although its not listed specifically, SQL Azure databases are also supported.

I’m going to drill down a little on these two main functions, and share some of my experiences. We’ll start with the OData feed.

Publishing an OData Feed

An OData feed is published by creating a data source. This is done from within the Admin Center by navigating to the data sources tab, and selecting “new data source”.

image

This starts a data source “wizard”. The first question to be answered is what will this data source be used for.

image

It isn’t necessary to create an OData feed in order to refresh on-premises data, but it is necessary to to create a data connection (more on this later). The “enable cloud access” option essentially tells the gateway that it’s OK to allow on demand and automatic data refreshes from the cloud. The “Enable OData Feed” option is pretty self explanatory – if you don’t enable it, the only thing that the connection can be used for is data refresh. These two can be selected independently of each other. After selecting next, you are presented with the connection info screen.

image

To start with, you’ll begin by giving the connection a name. Since the connection will typically be to a specific database, the name should reflect that. You may also wish to add something about the database’s location if you deal with similar data sets. The next selection is that of the gateway. You MUST have a gateway in order to create a data connection. You can have multiple gateways registered in your tenant, and the selection of the gateway will dictate the connection provider choices. It’s an easy thing to forget to choose your gateway, and then think that the page is broken because there are no Provider choices.

You can choose one of two methods to create your connection – connection string, or (if you’re lazy like me) you can choose Connection Properties, and let the tool build your string for you. When it comes to troubleshooting data refresh, you may find that the connection string method is more helpful, but either way should be equivalent. Once you have completed the bulk of this form, you need to enter credentials. First, select the method of authentication – your choices are Windows Authentication or SQL Authentication. Then select the credentials button which launches the Data Source Manager.

image

The Data Source Manager is a one click application that communicates directly with the gateway and is used to register the connection’s credentials directly with it – they are not stored in the Office 365 tenant or the BI Sites app. Because of this direct connection, you need to be on a network that is local to the gateway. This will not work from a remote location. If, like me, you are not joined to a domain, you will also need to be on the gateway machine itself.

On launch, it will go through a series of checks (this can take a while), verifying the gateway and the tenant. When it’s ready, it will show “Data Source Manager initialized successfully”, and you can go ahead and enter the credentials. Once you do, be sure to use the “test connection” button to verify that everything is working. When ready, click OK to register the credentials, then click save to save the data source. You will then be taken to the data settings page.

image

From here you can choose to not only index the metadata in the data catalogue, but also the data itself, and you can choose how frequently it is updated. The indexing option is currently disabled in the preview, so I have little to say about it at this point. Its purpose is to improve discoverability of the data via Power Query. The second section is the selection of tables to expose to the OData feed. You can choose from any tables or views, but as you can see from the example above, if your table/view contains any unsupported types (in this case geography fields), the entire object will be unavailable for publishing.

Clicking OK brings you to the users and groups page. From here, you can select those that will be able to use these data sources in Power Query, or manually refresh workbooks in the browser. As with all things SharePoint, it’s a good idea to use a group here.

Once done, your data connection is ready, and your OData feed is available. To use it, you’ll need to discover its address. You can do this by clicking on the ellipsis beside its name in the list.

image

image

As you can observe from the URL, this is a cloud service. You should be able to connect to the service from anywhere, and it will connect through the gateway to serve up the data. While this is great from a mobility standpoint, if you happen to be on premises, this would be quite inefficient, as the data would first need to be transferred to the endpoint in the cloud, and then back to the source network.

The good news is that the gateway is able to detect when you are accessing the feed locally, and it will redirect you to the source without sending the data up to the cloud and back. The bad news for us preview users is that this is the only thing working at the moment. Therefore, for the preview period at least, in order to access the OData feed, you must be on a local network. Specifically, you must be able to resolve the server name defied in the connection string.

If you meet these conditions, you can test the feed using Power Query. In Excel, go to the Power Query tab, and select “From OData Feed” in the “From Other Data Sources” dropdown.

image

You will then be prompted to log in. You need to use an Organizational Account (Office 365) to do this. This is an account associated with the Power BI license. However on this screen it’s referred to as a “Microsoft Online Services ID”.

image

I personally feel this could easily be confused with a Microsoft Account (Live ID). What’s worse is that Microsoft Accounts can be used with Office 365, so it’s certainly less than clear as to which credentials should be used here.

Clicking on the Sign In button takes you to a standard authentication dialog. In the preview, there is a small bug that requires you to enter your account completely in lower case. Failing to do so will cache the wrong credentials, and you’ll be denied access moving forward. If you encounter this problem, the solution is to close Excel, clear the browser cache and to restart.

Once you have authenticated successfully, you can save the connection (Power Query will save it in its cache), and work with it as with any other data source.

Thus far, I’ve only been able to use Power Query to connect to the OData feed. I have tried using Power Pivot directly, but although it is supposed to support OAuth, I can’t seem to save my Office 365 credentials. I’ve only tried via these two mechanisms – if anyone has tried others, I would love to know about it.

Refreshing Data in a Power BI Enabled Workbook

To start with, at the time of this writing (Power BI preview) quite a number of features that will be available in GA have not been enabled and/or are not supported. Data refresh scheduling is as yet unavailable, and the data sources that can be refreshed are restricted to direct SQL connections. Models created with Power Query cannot yet be refreshed. As these features become available, I will update this article with any relevant findings.

Given the fact that the Gateway is required to support both an OData feed and for data refresh, you might think that you must use the OData feed in your data models in order for them to be refreshable. This is not the case. When a refresh is requested, the model is interrogated for its data connections. The data catalogue is then interrogated for a data source with a matching connection string, and if found, is used. The Gateway is then called to retrieve the data if it is on premises. If the data source is SQL Azure, the Gateway is still used, but the data is loaded directly from SQL Azure – it does not need to be sent to the Gateway first.

As mentioned above, Power Query queries cannot yet be refreshed by the gateway. The only type of connection that I’ve been able to successfully refresh thus far is one created directly in Power Pivot. When creating a connection in Power Pivot, pay close attention to the connection string. You may need it later if you have refresh issues.

In addition to the data sources supported by the gateway, two other data sources can be refreshed. Project Online has a number of OData feeds that can be refreshed directly, and public OData feeds (not requiring authentication) can also be refreshed. I don’t currently have an instance of Project Online, so I don’t have much to add apart from the fact that it is supposed to work. I have tested refresh with public feeds and they do in fact work well. The interesting thing I noticed was that while it worked in my Power BI tenant, it also worked in my regular Office 365 tenant. Apparently this feature has been there for some time.

As I mentioned, scheduled refresh is not yet available. When it is, this will be done from the BI Sites app, the same way that you “enable” a workbook. For now, it must be done manually. This is done the same way that it is with an on premises workbook. First open the workbook in the browser, and then, from the data tab, select “Refresh All Connections”.

image

The workbook will go dim, and you’ll see “Working on it…” or “Still working on it…” for a bit – it does take some time to refresh, depending on the data set. Using the methods that I mentioned at the beginning of this article, you can monitor the progress of the refresh, and the impact on the gateway machine. Also, for the moment at least, if your data source is SQL Azure, prepare for a long wait – refresh time takes an exceptionally long time (in my case, about 10 min for a 1 million row x 20 column set of simple data types). The Azure refresh time should be addressed by GA.

Monitoring The Gateway

There are already a number of tools in the preview to help with troubleshooting Power BI, chiefly aimed at the Data Management Gateway. They can be found on the “system health” tab in the Admin Center.

image

The default screen shows the CPU utilization of your gateway machines (in this case, I have all of 1…) at the top, and the availability of your gateways at the bottom. At first glance, you would think that while my gateway is good from an availability standpoint, it’s taking a lot of my CPU. The reality is that the top chart shows total CPU utilization on the machine. If you want to see the utilization of the Gateway itself, you need to choose the specific machine from the Display dropdown.

image

Here you can see that while the machine utilization hovers around 40% (blue), the gateway utilization is barely noticeable (red).

Finally, what is likely the most useful part of monitoring, the logs, is well hidden. The logs are useful for troubleshooting data refresh issues, and can be accessed by clicking on “logs” which is at the top of this screen beside “dashboards”.

image

In the browser, you can see basic event information and basic error information if you have errors. However, if you download the logs as a CSV, you will see much more detailed information. If you are having problems with data refresh, particularly in the preview, I strongly recommend downloading it. One of the important pieces of information that it contains is the connection string that is being used:

image

You can compare that to the connection string that is being used by Power Pivot in the workbook. You find the string in Power Pivot by first clicking on “Manage” in the Power Pivot tab, and then in the Power Pivot window, choose “Existing Connections” in the ribbon. You should see your connection under “PowerPivot Data Connections”. Select it, and click the Edit button.

image

In the next screen, click the “Advanced” button. You should then be presented with the data connection property dialog.

image

Here, at the bottom, you will find the connection string being used by Power Pivot, and this is what the Gateway uses to look for one of its registered data connections. If you find any discrepancies, the chances are that they are at the source of your refresh problems, and that they should be addressed.

This is an early, first glance walkthrough of some experiences using the Data Management Gateway. Hopefully it can be of some help for the early adopters. I will try to keep this updated as Power BI moves from preview to General Availability.