Skip to main content
  1. Blog/

Before The Breach - PII

·12 mins
Padlock on a keyboard with credit cards

Do you know what personal data your company is actually collecting? If you can’t answer that question confidently, you’re not alone. However, in Singapore, that uncertainty is expensive. SingHealth and its data intermediary IHiS paid a $1 million fine in 2019 for losing dispensed medication records of 159,000 patients. Six years later, Marina Bay Sands paid $315,000 for exposing information on 665,496 loyalty programme members.

In this article we discuss what it means to collect Personally Identifiable Information (PII) and Personal Data Protection (PDP) and how to understand and protect this data within an organization.

The Plan #

  • Understand your customer
  • Understand what information you collect from the customer
  • Build a list

First, understand who your customer is. If you’re a B2C, then your customers are individual users; if you’re a B2B then you deal with other businesses. Perhaps you’re a bit of both. Start here and figure out who pays you for a product or service. For example, let’s say you’re a retail business that sells Jewelry both online and in stores. Your customers are the people buying your jewelry.

Next, figure out what information you collect from your customers, either when they buy from you or otherwise. This could be their name, phone number, email address or address. Perhaps your marketing team collects their date of birth to target birthday promotions. Look through your sales process, online or in-store, and also the marketing efforts. Do you have a loyalty programme, what kind of campaigns do you run.

Then make a list of the data you collected so that we can move on to the next step. In the case of our example we could end up with something like this:

Data FieldCollection PointBusiness PurposeRequired?
Full NameCheckout, Loyalty SignupOrder fulfillment, personalizationYes
Email AddressCheckout, NewsletterOrder confirmations, marketing campaignsYes
Phone NumberCheckoutDelivery updatesReview
Physical AddressCheckoutDelivery fulfillmentYes
Date of Birth (Full)Loyalty SignupBirthday campaignsReview
GenderAccount CreationProduct segmentationReview

What Data Should I Collect? #

  • Analyze whether the data you collect is in line with the requirements
  • Identify “Bad Ideas”

From the list of data you have gathered earlier, sit down and critically think through the reasons why your business collects this information. Here are some examples:

Date Of Birth #

  • Why do we collect our customer’s date of birth?
  • Are we planning to run birthday campaigns?
    • Do we need their whole date of birth, or just the Day/Month, or even just the Month?
  • Are we selling a product that requires proof of age?
    • Is there another way we can find this information out?

Gender #

  • Why collect customer’s gender?
  • Do we sell products for Men and Women?
    • Do we segment our marketing emails so that we target products accordingly?

Through this process, ask questions to uncover whether the data being collected is a bad idea. If no one in the business can answer why specific data is being collected or they tell you “We collect it just in case.” then that’s a bad idea. For example, do you need the birth year or even the day? If you only run email campaigns, why collect their mobile phone number?

At this point, you will have a list of data that you need to collect as a business and a set of data that you probably shouldn’t collect. With the list of data that you don’t need, build out a separate plan to stop collecting and storing that data. Then move on.

Here’s an example of a completed data collection audit for our jewelry store:

Example of a completed data collection audit

Here is an Excel sheet that you can download and use for this exercise.

Who Should I Speak To? #

  • Speak to CPO, CMO, COO and similar roles.
  • Make a list of the data they say that their teams collect.

The three key people to speak with are your Product person, Marketing person and Operations person. These could be CPO, CMO, COO or Head of Product, Head of Marketing or Head of Operations depending on how your business hands out titles. Some non-tech native businesses may have a Digital transformation role. Have them take you through the flows for:

  • A Sale
  • Customer Service interaction
  • Loyalty Signup
  • Marketing Signup
  • Marketing Campaign

At each stage, note down the information that the customer has to part with (during a sale or self service) and what information is used (for marketing).

How Do I Trust This Information? #

  • Verify the accuracy of the data collected by sampling Database servers
  • Consider promoting shadow satellite systems to fully supported ones in the org.

You are going to have to cross-reference what the stakeholder told you about the data with what is actually in the systems you run. This means looking where the data is stored. Typically this will be a database like MySQL, Postgres, or MongoDB. The DB will also usually be attached to a dashboard or a product of some sort (could be third party). Get someone on the engineering side to give you a list of all databases and tables in each database server.

Here are some examples of how you can explore a database (note that these examples are MySQL specific). This first one shows you a list of all the tables in a specific database (in this case jewelry_store) and how many rows they have as well as when they were created.

SELECT
    TABLE_NAME,
    TABLE_TYPE,
    ENGINE,
    TABLE_ROWS,
    CREATE_TIME
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'jewelry_store';

+------------+------------+--------+------------+---------------------+
| TABLE_NAME | TABLE_TYPE | ENGINE | TABLE_ROWS | CREATE_TIME         |
+------------+------------+--------+------------+---------------------+
| customers  | BASE TABLE | InnoDB |         10 | 2026-01-28 09:58:22 |
| orders     | BASE TABLE | InnoDB |         15 | 2026-01-28 09:58:22 |
+------------+------------+--------+------------+---------------------+

As you can see, this specific database has two tables named customers and orders.

Here is how you would list the fields in the customers table:

DESCRIBE customers;

+----------------+----------------------------------+------+-----+---------------------+----------------+
| Field          | Type                             | Null | Key | Default             | Extra          |
+----------------+----------------------------------+------+-----+---------------------+----------------+
| customer_id    | int(11)                          | NO   | PRI | NULL                | auto_increment |
| first_name     | varchar(50)                      | NO   |     | NULL                |                |
| last_name      | varchar(50)                      | NO   |     | NULL                |                |
| email          | varchar(100)                     | YES  |     | NULL                |                |
| phone          | varchar(20)                      | YES  |     | NULL                |                |
| address        | varchar(200)                     | YES  |     | NULL                |                |
| city           | varchar(50)                      | YES  |     | NULL                |                |
| state          | varchar(50)                      | YES  |     | NULL                |                |
| zip_code       | varchar(10)                      | YES  |     | NULL                |                |
| customer_type  | enum('online','in-store','both') | NO   |     | NULL                |                |
| loyalty_points | int(11)                          | YES  |     | 0                   |                |
| created_at     | timestamp                        | YES  |     | current_timestamp() |                |
+----------------+----------------------------------+------+-----+---------------------+----------------+

To take a look at the data in each table, you can query a small number of rows:

SELECT * FROM orders LIMIT 3;

+----------+-------------+---------------------+------------+------------------------+---------------+-----------------+----------+----------+------------+-------------+----------------+-----------+
| order_id | customer_id | order_date          | order_type | item_name              | item_category | metal_type      | gemstone | quantity | unit_price | total_price | payment_method | status    |
+----------+-------------+---------------------+------------+------------------------+---------------+-----------------+----------+----------+------------+-------------+----------------+-----------+
|        1 |           1 | 2024-12-15 10:30:00 | online     | Diamond Solitaire Ring | rings         | 14k White Gold  | Diamond  |        1 |    2499.99 |     2499.99 | credit_card    | delivered |
|        2 |           2 | 2024-12-18 14:45:00 | in-store   | Pearl Strand Necklace  | necklaces     | Sterling Silver | Pearl    |        1 |     899.00 |      899.00 | cash           | completed |
|        3 |           3 | 2024-12-20 09:15:00 | online     | Sapphire Stud Earrings | earrings      | 18k Yellow Gold | Sapphire |        1 |    1250.00 |     1250.00 | paypal         | shipped   |
+----------+-------------+---------------------+------------+------------------------+---------------+-----------------+----------+----------+------------+-------------+----------------+-----------+
SELECT * FROM customers LIMIT 3;

+-------------+------------+-----------+--------------------------+----------+----------------+-----------+-------+----------+---------------+----------------+---------------------+
| customer_id | first_name | last_name | email                    | phone    | address        | city      | state | zip_code | customer_type | loyalty_points | created_at          |
+-------------+------------+-----------+--------------------------+----------+----------------+-----------+-------+----------+---------------+----------------+---------------------+
|           1 | Sarah      | Mitchell  | sarah.mitchell@email.com | 555-0101 | 123 Oak Street | Boston    | MA    | 02101    | online        |            450 | 2026-01-28 09:58:43 |
|           2 | James      | Rodriguez | j.rodriguez@email.com    | 555-0102 | 456 Maple Ave  | Cambridge | MA    | 02139    | in-store      |           1200 | 2026-01-28 09:58:43 |
|           3 | Emily      | Chen      | emily.chen@email.com     | 555-0103 | 789 Pine Road  | Brookline | MA    | 02445    | both          |            875 | 2026-01-28 09:58:43 |
+-------------+------------+-----------+--------------------------+----------+----------------+-----------+-------+----------+---------------+----------------+---------------------+

Here’s a quick summary of the commands used and what they do:

SQL StatementDescription
SELECT TABLE_NAME, TABLE_TYPE, ENGINE, TABLE_ROWS, CREATE_TIME FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'jewelry_store';Lists all tables in a database (MySQL)
DESCRIBE customers;Shows the table structure of fields belonging to the table customers
SELECT * FROM orders LIMIT 3;Retrieves a sample of 3 rows from a table to inspect actual data. NEVER run a SELECT * on a production database without an accompanying LIMIT x

A point to consider in the future as you implement a robust PII monitoring system is to think about shadow systems and data. I have seen many cases where a team only focused on a specific area of the product will cobble together little satellite systems to serve their needs. These shadow systems are not sanctioned and will usually be built quickly which means very little attention has been paid to security and reliability. They can become a risk to the business and so you have to decide how to handle them. In most cases, I’ve found that these systems provide a lot of utility.

One thing you can consider is to begin an audit process to formalize these systems rather than letting them live in the shadow realm. Once you’ve identified these systems, consider adding them to your risk-assessment backlog so that they get a proper audit and once they pass, there is no reason why they can’t be brought into the group of official organizational systems.

What Do I Do with the Bad Ideas? #

  • Build and evangelize a data collection policy org-wide.
  • Stop collecting data considered a “bad idea”
  • Phase out existing “bad idea” data slowly while watching for systemic failure

From a security standpoint, you will have to make the decision on what data you do not collect. How you arrive at the decision isn’t necessarily what is important. Instead, drawing up a good organization-wide policy around the why of not-collecting certain types of data is. This will form the basis of your actions on choosing to stop collecting and deleting already collected data that you shouldn’t have. Deleting data proactively is often not a topic you read about, so if you arrive at this point, first off congratulations on taking your user data privacy more seriously. Next the process has to be planned similar to a feature deployment. The logical flow to this will look something like this:

  • Alter the product source code to stop writing to and reading from those specific fields in your database. Monitor this for some time.
  • Redacting the data in your database. For example: if the field is a date and you need to remove the year and day but not the month, set a different date for the day and month. Like if the date was 04/07/1988, then you can set it to 01/07/1970. We can use the Unix epoch (01/01/1970) day and year to redact this data. If it was a string, you can consider covering half of it with asterisks. If you collected ID number, then you can redact most of the digits. Say you had an ID number of S1844933F, you can redact it as S18****** . Again, monitor this for some time.
  • Remove the entire field or delete it completely when you are sure that none of your systems will break if it cannot read that data.
  • Remember that the data should be redacted at the database level. The masking should not be just a UI thing. Further, keep in mind that even redacted data can eventually be linked to a specific user through other pieces of data.
  • Lastly, you have to address the data stored in backups as well. While I don’t recommend going into your backups and redacting the data, add an extra step in your backup/restore plan to redact the data once restored.

Saying no and walking away doesn’t solve anything. The goal is working with teams to find solutions that meet business needs without unnecessary data collection. A good example of this is perhaps when a marketing team wants to collect the full date of birth. Your position should not be to say “No you can’t collect that.”, but instead steer the conversation in a manner similar to:

You: “Why do you want to collect the date of birth of the customer?” Marketing: “We want to run a birthday campaign.” You: “How does that work?” Marketing: “We will send them an email with a discount code on their birthday month.” You: “Ok, can you consider only collecting their birth month then? Since you won’t need day or year, I’d advise against collecting it.”

Keeping the List Current #

  • Build in triggers on which this entire exercise will repeat
  • If no triggers, consider a time-based review every 3 or 6 months.

As is the characteristic of technology, it won’t remain the same. Business needs change, products, tech and even people change. So the process you build requires revisiting. You have to know what your triggers are to kick off another exercise and how you can make the process easily repeatable. Some triggers can be whenever you launch a new major feature, when you begin working with a new vendor or third party provider or even at a set date every 3 or 6 months.

Closing Thoughts #

This is a process that you build, run, and maintain. Your list of data will likely never be complete, but that’s what you signed up for. The idea of even having a process like that can go a long way in greatly diminishing fines you pay to bodies like the Singapore PDPC. There is no definitive guide on how fines are quantified, but one thing that the PDPC will recognize is proactiveness in protecting user data. Similarly with other regulatory bodies, demonstrating that you take your user data seriously by building an audit practice like this can go a long way. You’re not infallible and you will make mistakes but minimizing the blast radius is what you should aim for.