BigQuery

How to find emails in your BigQuery data?

black laptop computer

If you are a GA360 customer and export your data from Google Analytics to BigQuery, you might have from time to time the problem, that somehow email addresses sneaked into your data. This is a violation of privacy laws (e.G. GDPR in the EU) as well as the Google license agreement. For several clients, I check from time to time or regularly monitor their incoming data to prevent these data leaks. In this article, I will share an easy SQL query on how to find email addresses in BigQuery.

Did you find eMails in your BigQuery data? Read here what to do NOW.

Based on the BigQuery Export Schema I checked in which fields an email address possibly could wrongfully end up. The most common fields are the page URL, events and esp. in custom dimensions.

I wanted the results as an easy list with just 4 columns, thus the query ended up being quite long due to the UNION ALL’s. The 4 columns of the result are:

  • fullVisitorId
  • date
  • key of the field
  • value of the field

With these results, it is easy to find and remove the email addresses from the BigQuery tables.

How to execute the query?

There are 2 changes you need to make:

  1. Change the start and end date in lines 1+2. Start with a small date range, since the queried data volume might be big, and thus expensive.
  2. Change the project and dataset in each of the subqueries. Sadly, BigQuery does not allow to use of variables for table names, thus you need to change this in each subquery. It’s the part “<project>.<dataset>”.

And here it goes:

DECLARE startdate STRING DEFAULT "20220101";
DECLARE enddate STRING DEFAULT "20220101";

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "userId" as key, userId as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(userId, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.adContent" as key, trafficSource.adContent as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.adContent, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.campaign" as key, trafficSource.campaign as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.campaign, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.campaignCode" as key, trafficSource.campaignCode as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.campaignCode, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.keyword" as key, trafficSource.keyword as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.keyword, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.medium" as key, trafficSource.medium as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.medium, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.referralPath" as key, trafficSource.referralPath as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.referralPath, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "trafficSource.source" as key, trafficSource.source as value
FROM `<project>.<dataset>.ga_sessions_*` as t1
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(trafficSource.source, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.transaction.transactionCoupon" as key, hits.transaction.transactionCoupon as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.transaction.transactionCoupon, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.referer" as key, hits.referer as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.referer, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.page.pagePath" as key, hits.page.pagePath as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.page.pagePath, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.page.pageTitle" as key, hits.page.pageTitle as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.page.pageTitle, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.page.searchKeyword" as key, hits.page.searchKeyword as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.page.searchKeyword, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.eventInfo.eventCategory" as key, hits.eventInfo.eventCategory as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.eventInfo.eventCategory, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.eventInfo.eventAction" as key, hits.eventInfo.eventAction as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.eventInfo.eventAction, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, "hits.eventInfo.eventLabel" as key, hits.eventInfo.eventLabel as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(hits.eventInfo.eventLabel, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

UNION ALL

SELECT
fullVisitorId, extract(date from timestamp_seconds(t1.visitStartTime)) AS date, CAST(cdh.index as string) as key, cdh.value as value
FROM `<project>.<dataset>.ga_sessions_*` as t1, UNNEST(hits) as hits, UNNEST(hits.customDimensions) as cdh
WHERE _TABLE_SUFFIX BETWEEN startdate AND enddate
AND REGEXP_CONTAINS(cdh.value, r"([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]*)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\.")
GROUP BY 1,2,3,4

Leave a Reply