Skip to content

Latest commit

 

History

History
47 lines (40 loc) · 2.15 KB

regex-parse-url.md

File metadata and controls

47 lines (40 loc) · 2.15 KB

Regex: Parse URL String

Explore the query with some demo data here.

Description

URL strings are a very valuable source of information, but trying to get your regex exactly right is a pain. This re-usable snippet can be applied to any URL to extract the exact part you're after.

SELECT 
   REGEXP_EXTRACT(<url>, r'(?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?') as host,
   REGEXP_EXTRACT(<url>, r'(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}([a-zA-Z0-9-./]+)') as path,
   REGEXP_EXTRACT(<url>, r'\?(.*)') as query,
   REGEXP_EXTRACT(<url>, r'#(.*)') as ref,
   REGEXP_EXTRACT(<url>, r'^([a-zA-Z]+)://') as protocol
FROM
   '<project.schema.table>'

where:

  • <url> is your URL string (e.g. "https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#Ref1"
  • host is the domain name (e.g. yoursite.com)
  • path is the page path on the website (e.g. pricing/details)
  • query is the string of query parameters in the url (e.g. myparam1=123)
  • ref is the reference or where the user came from before visiting the url (e.g. #newsfeed)
  • protocol is how the data is transferred between the host and server (e.g. https)

Usage

To use, just copy out the regex for whatever part of the URL you need. Or capture them all as in the example below:

WITH my_url as (select 'https://www.yoursite.com/pricing/details?myparam1=123&myparam2=abc#newsfeed' as url)
SELECT
  REGEXP_EXTRACT(url, r'(?:[a-zA-Z]+://)?([a-zA-Z0-9-.]+)/?') as host,
  REGEXP_EXTRACT(url, r'(?:[a-zA-Z]+://)?(?:[a-zA-Z0-9-.]+)/{1}([a-zA-Z0-9-./]+)') as path,
  REGEXP_EXTRACT(url, r'\?(.*)') as query,
  REGEXP_EXTRACT(url, r'#(.*)') as ref,
  REGEXP_EXTRACT(url, r'^([a-zA-Z]+)://') as protocol
FROM my_url
host path query ref protocol
www.yoursite.com pricing/details myparam1=123&myparam2=abc#newsfeed newsfeed https

References Helpful Links