Google Tag Manager

How To Fix Your Web Analytics Setup If You Track PII Data

black iphone 5 beside brown framed eyeglasses and black iphone 5 c

It happened again. Around one year ago I found thousands of eMail addresses in clear text in the BigQuery data of one of my clients. I wrote another story about how to find them. This one is about what immediate measures you can take to respond to such data leaks.

Kind of PII data

Not only if you are in the EU and must comply with GDPR data privacy regulations, but in general, you should care about your customer’s data and never send personally identifiable information (PII) data into Google Analytics.

While the IP Address is actually also considered PII data but technically necessary, the bigger concern is about other types of PII. These would be Names, Addresses, Passwords, Credit Card numbers and eMail addresses. You don’t want these in your GA Data and actually, Google even forbids you to send these in their Terms and Conditions. Thus, Google could close your account if you are not following their T&Cs.

Names could be anything, thus super hard to identify in your dataset. Credit Cards are easier to find since they follow the strict rule of 16 digits in just certain different types of writing/spacing. But luckily I never found some in my client’s datasets.

Well, eMail Addresses are super easy to find since they follow a strict pattern:

something@something.something

Where something can‘t be a space.

Esp. the @-sign (%40 in URL encoded) is a clear indicator for an eMail Address. See this article about How To Find Emails in BigQuery.

What kind of leaks are common?

eMail Addresses are the most common data leak in web tracking data. But how does it happen and where?

When we are talking about a web analytics setup like GA with GTM or other Tools like Adobe and Tealium, there are 2 main types of leaks:

  1. Event or payload leak
  2. URL leak

1. Event or payload leak

The eMail Address is in this case added somehow to the payload send out to the tracking software. This could happen if automatic form tracking is enabled or if the email is added as a variable. The data will be leaked only to Google Analytics (and maybe BigQuery). This is bad but can be corrected easily.

2. URL Leak

In this case, the PII data will be added to the URL. This is actually the worst case. Because in that case, not only GA would get the data from the current URL page path, but also a lot of other third-party tools would get the data because they read the page path.

What can be done as an immediate action?

1. Event or payload leak

To remove the eMail Address only from the GA hit payload, a Custom Task is the perfect fit to solve this. Read more about CustomTasks at Simo Ahavas Blog.

Important when removing the data is, that you keep some indicator about able to tell you how often that actually happened. This is to monitor the problem. The script replaces the email with the string “(email_removed)”. Thus it would be easy to create a report looking for this particular string.

The following script will replace all eMail Addresses with “(email_removed)”.

The used RegEx to find eMail Addresses is:

([A-Za-zÀ-ÖØ-öø-ÿ0–9-_\.+]+)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0–9-]*)\.

And the Script is this one:

function () {
	return function (model) {
		
		model.set("sendHitTask", function (sendModel) {
			var hitPayload = sendModel.get("hitPayload");
			var originalHitPayload = sendModel.get("hitPayload");

			// remove email from payload
			var i, hitPayload, parts, val;
			hitPayload = hitPayload.split('&');
			for(i = 0; i < hitPayload.length; i++)
			{
				parts = hitPayload[i].split('=');
				// Double-decode, to account for web server encode + analytics.js encode
				try {
					val = decodeURIComponent(decodeURIComponent(parts[1]));
				} catch(e) {
					val = decodeURIComponent(parts[1]);
				}
				val = val.replace(/([A-Za-zÀ-ÖØ-öø-ÿ0-9-_\.+]+)(@|%40)([A-Za-zÀ-ÖØ-öø-ÿ0-9-]*)\./gi, '(email_removed)');
				parts[1] = encodeURIComponent(val);
				hitPayload[i] = parts.join('=');
			}
			hitPayload = hitPayload.join('&');


			// finally, send the hit
			sendModel.set("hitPayload", hitPayload, true);
			originalSendHitTask(sendModel);
		});
	};
}

2. URL Leak

If you are having a URL leak, you should turn off all tracking pixels at all on that page, in order to prevent the email to be leaked to third-party providers. This can be done with an exception trigger in GTM.

This trigger rule uses the same RegEx as above and searches the Page URL.

Every tag must then be adjusted with an exception using the newly created trigger.

If you are using the Zones Feature from GTM360, you can even use RegEx to stop further Zones from executing. This can come in quite handy depending on your actual setup.

Conclusion

Email Address leaks are not only bad for your clients but also for you as a consultant or in-house employee. I would recommend using CustomTasks to monitor certain RegEx patterns which are associated with PII data.

The scripts above are for demonstration purposes only and might not be suited for production. Make sure you check them yourself carefully that they fit into your setup, before publishing to production.

Leave a Reply