Pseudonymization and anonymization: safeguarding personal data

The European Union's General Data Protection Regulation requires that organizations safeguard personal data during collection, storage, and processing.

Organizations can meet their GDPR obligations by converting data into unreadable or unusable formats. Pseudonymization and anonymization are the two most common ways to achieve this. This article will define both methods and explain how they relate to GDPR compliance.

Key takeaways

  • Pseudonymization replaces original data with pseudonyms. This allows reversible de-identification and contributes to data protection.
  • Pseudonymization reduces the risk of data exposure and makes it easier and safer to process and analyze personal data.
  • Pseudonymized data qualifies as personal data under the GDPR. This is because supplying additional information connects it to a data subject.
  • Pseudonymization is well-suited to business operations. Anonymization is suitable for non-production environments, including many research studies.
  • Automating pseudonymization processes and choosing the right techniques are critical parts of effective data management.

What is pseudonymization?

Pseudonymization replaces identifiers, making it harder to identify individuals. Artists often use pseudonyms to conceal their origins, while law enforcement agencies regularly use them to refer to suspects. However, the practice also applies when processing personal data.

Under GDPR, organizations use pseudonymization to safeguard personally identifiable information. Companies routinely convert email addresses, names, phone numbers, and even financial data into alternative formats.

Pseudonymization is a recommended security control under EU regulations. It minimizes the risk of harm during data breaches, making it very difficult for attackers to decode the identities of data subjects.

However, pseudonymization is not the same as anonymization. Pseudonymization can be reversed by providing extra information, while anonymization cannot. This means that pseudonymized data counts as personal data under GDPR, while fully anonymized data does not.

Pseudonymization definition

According to Article 4(5) of GDPR, pseudonymization is the:

"Processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person."

If we unpack this definition, pseudonymization has three core components:

  1. Pseudonymization involves replacing personally identifiable information about a data subject with alternative values or words.
  2. Data handlers can only restore personally identifiable information by using additional information. This could include encryption keys or tokens.
  3. Controllers must store additional information to restore personal data securely. This information must also be separated from pseudonymized data. For example, token libraries should be encrypted and stored on a separate server.

How is pseudonymization used in data protection?

Pseudonymization secures personally identifiable information (PII) and protects data against malicious actors. It also makes personal data usable for controllers and processors. Companies can convert personal data into a form that complies with data protection laws and is relevant to their business needs.

For example, a healthcare company may scramble patient names when processing personal data, creating a database of pseudonyms. The rearranged names mean nothing to outsiders, but with the right decryption tools, users can convert them back into names and identify patients.

Legal companies may use pseudonymization to hide clients' identities when managing projects or emailing colleagues. This limits the risk of disclosing sensitive business data to competitors or leaking personal data to external attackers.

Understanding the scope of pseudonymization

Pseudonymization does not just apply to obvious personal identifiers such as names, passport numbers, or images. It extends to a wide range of indirect identifiers. If malicious actors can use identifiers in combination to identify an individual, controllers must conceal them from outsiders.

For example, companies should protect IP addresses, postal addresses, home and workplace telephone numbers, and even the identity of an individual's physician or partner.

Benefits of using pseudonymization

1. Reducing data exposure

Protecting data is the most significant benefit of pseudonymization. Pseudonymized data is not understandable to outsiders. It should be very hard to connect data to data subjects. When employees disclose customer files or leave databases unprotected, unauthorized readers should struggle to exploit stolen data for criminal purposes.

2. Managing regulatory risk

Under GDPR, organizations must do everything they can to protect user privacy and safeguard personal data. Leaving names or contact numbers in an unprocessed format is extremely risky. Data thieves can immediately use this data to cause harm to data subjects. So if you fail to use pseudonymisation GDPR fines will likely result.

3. Enabling compliant data processing

Pseudonymizing personal data strikes a balance between data protection and efficiency. Organizations can create data sets for classification and analysis. This is critically important in a world where Big Data is supreme.

Companies or universities can carry out cutting-edge research. The identities of data subjects remain concealed, enabling organizations to meet their GDPR obligations.

4. Allowing users to exercise their rights

European data protection law requires that organizations facilitate data portability and subject access requests. Pseudonymization allows companies to move data securely and efficiently when meeting these requests.

5. Flexible data protection

Pseudonymization tools offer varying degrees of flexibility. Users decide which data points to scramble and tailor data protection strategies to different risk assessments. Organizations can apply weaker settings for low-priority data sets. They can also add robust controls for the most sensitive data.

6. Cutting the notification burden

GDPR requires that organizations notify data protection authorities when a personal data breach occurs. However, notification requirements only apply when there is a risk of harm to data subjects. Pseudonymizing data can cut the risk of harm. This may make notification unnecessary, although a thorough risk assessment is always vital.

Implementing pseudonymization: best practices and techniques

Data controllers must implement pseudonymization with care. There is no point in masking personal data if outsiders can combine pseudonymized data with other information to identify the data subject.

Organizations need to assess which forms of data require protection. They also need to determine whether it remains possible to discover a data subject's identity despite the use of masking techniques. Here are some best practices to help you achieve these aims.

Pseudonymization best practices

1. Understand the different forms of pseudonymization

There are many different ways to pseudonymize raw data, and organizations must choose the right method for their situation. Popular techniques include:

  • Data masking. Data masking substitutes alternative words or symbols for personal data. This is the most simple form of pseudonymization. Replacement terms are usually randomly assigned by automated masking tools. Terms within a single data set tend to have a standardized format. This enables smooth data handling.
  • Encryption. Encryption scrambles personal data by applying hashing techniques. Users can only decode replacement terms with the correct decryption key. Companies must secure and rotate encryption keys to prevent unauthorized access. But encrypted data should remain secure otherwise.
  • Tokenization. Tokenization replaces personal data items with tokens. Tokens are unique identifiers generated by algorithms. Users can only read data in its original form by presenting the relevant token. Without the token, data remains inaccessible and unavailable to unauthorized actors. As with encryption, companies must manage tokens to secure their usage. Best practices recommend using dynamic tokens that change frequently, minimizing the risk of exposure.

2. Pick the right pseudonymization method for your data set

Each of the above techniques has its strengths and weaknesses. And every data-handling operation is different. Choosing the right style is essential when balancing privacy and data availability.

Data masking is the most efficient and easy way to create pseudonymized personal data. Organizations can retain the structure of original data sets. This makes it easier to carry out searches and analyze data.

Masking data is inherently less secure than encryption or tokenization. There is a higher risk of re-identification. Attackers can gain access to the mapping settings used to create pseudonyms. With that information, they can easily decode entire databases.

Use case: A retailer shares employee data with a third-party training partner. This information includes data about educational qualifications, internal appraisals, and pay. But masking conceals names, addresses, and other identifiable data.

Tokenization and encryption are more secure, and both methods offer scope to analyze data. However, using tokens or encryption keys is complex. Processing large quantities of data can be challenging while maintaining data security.

Securing token maps and encryption keys poses challenges for many organizations. Generating tokens can impose additional resource overheads, increasing compliance costs. So masking data may be preferable when balancing costs and security.

Use case: A data controller processes customer payments but needs to remain GDPR compliant. Finance teams use tokenization to secure all credit card details. Sensitive data resides in a secure database. The original numbers are only accessible with a unique token.

3. Use automation to remove human error and streamline data protection

As a best practice, organizations should use automation wherever possible to put in place security measures. Automated masking or tokenization reduces the risk of human error. And it limits the exposure of human employees to pseudonymization methods.

Automation tools detect the existence of sensitive information in corporate documents and databases. They automatically apply algorithms to hide personal data without manual input from data protection officers. Automation ensures that all relevant data passes through security filters. Automation also operates consistently, eliminating data entry or formatting errors.

4. Take a systematic approach to pseudonymization

Organizations must approach pseudonymization carefully to secure personal data and avoid compliance penalties. There are four broad areas to consider in your data processing plans:

  • Goals. What do you intend to achieve by pseudonymizing data?
  • Risk analysis. Does pseudonymizing data create new compliance risks? Are your encryption keys or tokens properly secured against external attacks?
  • Responsibility. Who decides what data is pseudonymized and what techniques to use? Do controllers and data processors need to agree on shared security measures to protect data?
  • Documentation. Have you recorded the decisions taken and the outcomes of risk assessments? Can you provide regulators with evidence that you have secured all relevant personally identifiable data?

5. Guard against re-identification attacks

All organizations that use pseudonymization need a plan to mitigate re-identification attacks. This attack targets pseudonymized information and seeks to restore its original form. There are two main types: inference and linkage attacks.

  • Linkage attacks leverage access to large amounts of information about internet users. Sources could be publicly accessible data or illegally sold stolen data. Attackers can use this information to guess the identity of pseudonymized subjects.
  • Inference attacks use AI and Machine Learning to guess the contents of pseudonymized databases.

Mitigation actions include limiting the amount of personal data companies process and applying strong cybersecurity controls. Using dynamic tokens or robust encryption is also advisable.

Pseudonymization can never be entirely secure. Compliant organizations must show that they are aware of re-identification risks and have taken action to deal with them.

Pseudonymization vs. anonymization

Understanding the difference between anonymization and pseudonymization is a critical part of GDPR compliance. Companies that apply the wrong data protection techniques can suffer severe fines and reputational damage.

Pseudonymization vs. anonymization

The most important thing to remember is that anonymization refers to data that is not identifiable. If someone gains access to anonymized data, there should be no way to recover the original identifier.

Anonymization is inflexible. Once personal data is anonymized, restoring the identity of data subjects is impossible. This makes anonymization essential when processing the most sensitive data. For example, any studies involving genetic information should anonymize their records.

By contrast, pseudonymized data is reversible, making it possible to recover the identifier. A form of intermediary stands between personal data and its replacement. This intermediary could be an encryption key or a data masking map.

Pseudonymizing data provides scope for flexibility. It is a more sophisticated data protection tool than anonymization, which makes it less secure when improperly applied. However, it also enables researchers or analysts to use personal data without breaching GDPR.

Pseudonymization in practice

Organizations use complex solutions when handling personal data, making it difficult to ensure total coverage. So, understanding the practical applications of pseudonymization is critically important.

Pseudonymization in practice

Here are some ways that GDPR-compliant organizations apply pseudonymization in the real world:

Temporary data storage. Organizations often store personal data in pseudonymized form before proceeding to full anonymization. For example, a company may store customer information while they have a legal justification to hold it. But when that justification expires, they must move from pseudonymized to fully anonymized data.

Medical research. Medical researchers rely on large data sets of confidential healthcare information. Healthcare providers can make this information available to researchers in pseudonymized form. This makes it possible to assemble data sets and create data visualizations while shielding the identity of the original data subjects.

eCommerce customer analysis. An eCommerce company might want to enlist a third-party partner to analyze the buying patterns of its customers. In this case, the data controller would need to shield personal data from the third party, while enabling access to relevant customer data. This could be creating tokens for relevant data points while keeping confidential data off limits.

ENISA recommendations and guidelines for pseudonymization

The European Union Agency for Cybersecurity (ENISA) guides data controllers using pseudonymization. The organization's guidelines are an invaluable resource when applying data protection by design and securing personal data.

"Pseudonymisation techniques and best practices" is the best starting point. This 2019 report includes a series of recommendations and best practices that every DPO should build into their GDPR strategy. Examples include:

  • The most secure options are encryption or similar techniques like Message Authentication Codes (MAC). Alternative techniques are vulnerable to brute forcing or dictionary search attacks.
  • Organizations must balance utility against security. Using more than one pseudonymization approach may be appropriate to allow data processing while securing confidential data.
  • Storing mapping tables is essential to enable notification following data breaches. However, restoration systems require robust security to prevent unauthorized access.
  • Robust access controls are a must. Only authorized employees should have access to the pseudonymization secret.
  • Encrypting pseudonymization secrets is recommended. Secure key management is also essential.
  • Companies should adopt a risk-based approach to pseudonymization. They should never rely on pseudonymization alone as a security measure.

The report features an in-depth assessment of different scenarios. This includes masking of personal data, as well as sections on IP address and email pseudonymization. It relates these methods to GDPR requirements, allowing readers to make a balanced analysis of their data protection needs.

Use ENISA guidance to build compliant pseudonymization systems that balance the need to secure personal data with efficiency.

Remember that GDPR seeks to protect privacy while enabling compliant data processing. By following data protection best practices, you can benefit from data collection and avoid compliance penalties.

FAQs

Is pseudonymized data still personal data according to the GDPR?

Under GDPR, personal data is information that can identify a data subject, either on its own or in combination with other information. Pseudonymized data does not directly identify the data subject. But it still qualifies as personal data.

The reason is that pseudonymization can be reversed. If malicious actors gain possession of the correct token or encryption key, they could easily convert pseudonymized data into its original form.

European regulators therefore treat pseudonymized data in the same way as unformatted data. Controllers and processors must implement security controls and prevent unauthorized access, whether personal data processing is pseudonymized or not.

Is anonymized data still considered personal data?

Anonymized data is not generally considered personal data under GDPR. This is because anonymization removes the link between data and the original data subject. If data is fully anonymized, there is no way to establish who it belongs to. And if there is no data subject, GDPR does not apply.

Disclaimer: This article is for informational purposes only and not legal advice. Use it at your own risk and consider consulting a licensed professional for legal matters. Content may not be up-to-date or applicable to your jurisdiction and is subject to change without notice.