Getting into data visualization for finding new opportunities and accelerating analysis? Me too! But there’s a few best practices I’ve learned along the way when connecting datasets by URLs.
When might you want to connect by URL?
Here’s a few outputs I find I’m generally connecting on the URL level where I need to really clean the URLs to make sure I’m getting a big picture view of my data sets:
- Google Analytics
- Google Search Console
- Screaming Frog
When you’re handling BIG data it isn’t worth taking a shortcut and potentially missing connections from messy URLs. Do I even need to keep going?
☑️ Hop into the Power BI query editor and let’s get going!
New to Power BI? 📽📽 Power BI for Digital Marketers (Watch Lesson 1 with Wil Reynolds)
I prefer to play it safe and make a duplicate column of my original URL column and then transform the duplicate.
Why? So that if I make a mistake or change my mind about the way I’m transforming my data I can just make another duplicate and start again instead of deleting my applied steps on the original and potentially affecting another column or chart.
Are you working with mixed-case URLs? I hope not, but either way, your first step should be to transform your URL column into lowercase just in case.
- The Clean Query returns the original text value with non-printable characters removed. Right click your URL column and select Transform > clean from the dropdown.
- The Trim Query removes whitespace values from the beginning and end of the text value.Right click your URL column and select Transform > trim from the dropdown.
I like to separate protocols by splitting the column with a delimiter – that way I “split” the column into the protocol and the rest of the URL so I can reference http vs https later with a slicer or a pie chart if I want to know if there’s a major discrepancy between the two.
- You can also click Split Column in the top navigation under the Home tab.
Step 2: Enter the custom delimiter :// and split at Each occurrence of the delimiter (technically any option should work here).
Your URL column should now be split into 2 columns — one containing the protocol and one containing the URL.
You’re not alone.
Ahhh, the dreaded “what is this thing at the end of my URLs?”. Obviously you can’t fully identify when one data set is adding something weird on the end of the URLs, or when some URLs feature a string at the end and some don’t.
- Right click on your column and select Replace Values to replace the appending value with a blank.
- Use Split Column by delimiter to split it into another column, using the value as the delimiter.
Step 1: Use the Split by Delimiter function and split using a “?” to create a column holding your parameters and a canonical version of the URL.
Trimming trailing slashes is a must but you can’t use split by delimiter here – if there isn’t a trailing slash on the URL you’ll just lop off the last path.
To remove trailing slashes, we’ll use a formula that will trim any URLs with a trailing slash and leave any URLs without a trailing slash alone.
- This will automatically create a new column named “Trim”. To enter our formula, select the gear to the right of Inserted Trimmed Text in your Applied Steps to edit the formula.
- We should now have our original URL column, the protocol column, the “Clean” URL column, the parameter column, and the Trimmed URL column.
🎬 Catch up with us on YouTube for more Power BI goodness!
Now that you’ve cleaned your URLs into canonical versions you may have duplicates in your Trimmed column which could prevent you from creating a relationship through this column. To learn how to properly clean a list and de-duplicate — subscribe below to read our upcoming blog post on cleaning & de-duping your data!