As a follow up of the Web Scraping post, here, In today’s post, we will use the Add Column by Example feature to extract links from web pages. (Note: You may have already read this part here, but I am sure that many readers have missed it),
Stay tuned for Part 3, where we will go step by step and create the initial queries of Power BI Custom Visuals report.
In the first blog post on Web Scraping, here, we learned how to scrap tables from the Web, by navigating through the HTML Document Children tags. But sometimes, you also want to extract the links for each element in the table. Unfortunately, you cannot extract the links by using the method I’ve shared. But let’s find out how we can extract links using the new Add Column by Example.
We will demonstrate the scraping of links from Microsoft MVP official website. Our goal is to extract the links of the MVP images and profile pages, and combine them with the names, country and award category which were extracted in the last post.
Start with the Power BI Report that you created in the last post. Click Get Data and select Other and Web. Then click Connect.
In From Web dialog paste the URL below and click OK.
In the Navigator dialog select Document, and click Edit.
Now is a tricky part… We would like to remove the Web.Page( … ) formula and keep the Web.Contents inner part. If you try to do it directly from the formula bar, the Query Editor will undo the change.
Click Advanced Editor, and remove “Web.Page(“ and its closing parenthesis “)” from the following formula:
Here is the modified formula:
Now, in the Query Editor preview, you will see the icon for mvp.microsoft.com. Don’t double click on it.
Right click on the icon, and select Text.
The Query Editor will handle the HTML file as a single column table, each line in the HTML file will be placed in a cell in the new table:
Now, let’s filter the lines we need. By viewing the HTML source in the browser, we can see that each MVP contains a link to his/her image. The links contain the string “/PublicProfile/”.
Click the filter control, select Text Filters and click Contains…
In Filter Rows, paste the text /PublicProfile/ and click OK.
Now, the Query Editor shows the img tags with our MVP photos.
Finally, we are going to start using the new feature.
In Add Column tab, click Column From Examples.
Enter the number that is highlighted below (7914) in the first cell of the right column, and press Enter. The Query Editor will magically extract the correct MVP ID from each row (I must say – This is a real fun).
In the black notification bar above, you can review the M expression, that was automatically generated for us. If you know M, this is where you can validate that a correct formula was generated, or learn new stuff. In our example: you can see that a new function Text.BetweenDelimiters was used (learn more about the new function here). As a new function in Power BI Desktop, if you try the same query in Excel you will get the following error:
We will need to wait for one of the next Office updates to include the new functionality including the Text.BetweenDelimiters function (Note: I tested the compatibility with Excel on Power Query version is 2.42.4611.241. The Power BI Desktop version that supports the new functionality is 2.45.4704.442).
So, make sure your queries are not required in Excel. They will not be refreshed.
Let’s continue the scraping. It’s fun. We can rename the new column as MVP ID, and create a new Column From Examples. This time, we will enter en-us in the first cell of Column2.
You can now see that Text.BetweenDelimiters is used again, to extract the locale from the URL. We can now click OK in the bar, and rename the new column as Locale.
Let’s create a third column from Examples. In the first cell we will type:
As a result, the Query Editor detects that we would like to concatenate the string “https://mvp.microsoft.com/” with Locale, the fixed string “/PublicProfile/Photo/” and MVP ID. We can click OK in the bar, to get the links for the MVPs’ photos.
What’s next? A quick reverse engineering on the MVP website, reveals that the links to the MVPs’ profile is “https://mvp.microsoft.com/en-us/PublicProfile/” followed by the MVP ID. It’s easy to create a new column with the relevant example.
We can delete the columns MVP ID and Locale, and rename the last two new columns as Image URL and Profile URL. Let’s also rename the new query as URLs.
Finally, to combine between the table of URLs and the MVPs query we have created in the last blog post, we can add Index column to the two queries.
Note: The MVPs query was created in the last blog post. We used the web connector to navigate the HTML tags and extract the MVPs‘ name, award category and country.
Don’t forget to add an Index to the MVPs query.
Back to the MVPs query, we can click Merge Queries in Home tab.
In Merge dialog, select the Index column of MVPs, select URLs as the second query to merge, and select the Index column of URLs. Then, click OK.
Finally, expand NewColumn, select Image URL and Profile URL, and click OK.
That’s it. We can now load the combined table of MVPs with their name, award category, country, profile link and photo link to Power BI.
The Columns from Examples feature (or Add Column by Example) is simply amazing. We used it to scrap URLs from web pages.
Coming Next: Step by step tutorial on Web Scraping of Office Store to extract Power BI Custom Visuals. You can buy the full report (including many hidden gems that will not be published elsewhere), learn more here.