Figure 2. Identifier risks

My Number and Privacy: Identifiers and Excess Risk

last time,"My Number and Privacy: Requirements for Identifiers" and wrote the following:

Next, as a hypothetical example, let's consider a common number for taxes and social security (including pensions). This would need to meet the requirements of both the basic pension number and the taxpayer number. In other words, it would require multiplying the "long term" required by pension management with the "wide range" required by taxes. In other words, an identifier that is "stable over a wide range for a long period of time" is required. This does not meet requirement 3, so it is a bad combination.

This time, I would like to use this as a basis to think about the characteristics of the risks posed by identifiers.

Identifier types

Now, before we get into the main topic, let's review the definition of an identifier and look at its types.

identifierA combination of attributes that can uniquely distinguish (identify) an individual (or thing) from others in a group.

That's right. Not only what we normally think of as an "individual number," but also things like "name, gender, date of birth, and address" (the four basic pieces of information) are identifiers.

In classifying these, we will use three categories: "scope of use," "period of use," and "reusability." "Scope of use" refers to how many people, companies, organizations, etc. use the identifier, "period of use" refers to how long the identifier will be used, and "reusability" refers to whether the identifier can be reused for other people.

Classification by scope of use
  • Omnidirectional Identifier (omnidirectional identifier): An identifier that is used regardless of the other party
  • Unidirectional Identifier (directional identifier, sectoral identifier): An identifier used only within a relationship with another party
Classification by period of use
  • Persistent Identifiers (persistent identifier): An identifier that remains the same for a long period of time, usually as long as the entity exists.
  • Short-term Identifiers (ephemeral identifier): An identifier that changes over a short period of time.
Classification by reusability
  • Reusable Identifiers (reassignable identifier): An identifier that can be reused for other entities.
  • Non-reusable identifiers (non-reassignable identifier): An identifier that cannot be reused.
Every identifier can be classified in all sorts of ways. For example, a name can be used anywhere, so it is a "non-directional identifier" from the perspective of the scope of use, a "long-term identifier" because it is used for a long period of time, and a "reusable identifier" because there can be multiple people with the same name. In other words, a name is a "non-directional, continuous, reusable identifier." In this article, we will focus on the first two of these and consider the "Nayoro risk" out of the many risks. There are also risks that arise from reusable identifiers, but we will leave that for another time.

Identifier matching risk captured by period x range

The risk of privacy violation can be caused by a variety of things. For example, "information leaks" that are often reported in newspapers are a typical example. However, here we will focus on the "Nayoro risk" as the main one. This is because, in reality, it is not the "information leak" itself that causes actual damage related to privacy, but rather the damage that occurs when the leaked information is "linked" with other information (for example, prior knowledge held by the victim's acquaintances), resulting in the formation of an "image of the person" that the person does not want. Even if information is leaked, there is almost no risk if it sinks into the ocean without being seen by anyone.

One aspect of the risk to Nayoro due to identifiers can be captured in terms of period x area range, as shown in the following figure.

Risks of Identifier Distribution
Figure 1 Risks of identifier distribution

The horizontal axis is the period during which the identifier is in circulation, and the vertical axis is the range in which the identifier is in circulation. The period refers to how long the identifier will be used, and the range refers to how far its use will spread. For example, web cookies are only used on the site they are in, so their range of circulation can be said to be extremely narrow. So-called sectoral identifiers can also be said to be narrow, as they are only used in that sector. On the other hand, US SSNs and the like are used everywhere, so the range of identifiers in circulation is wide.

Figure 1 shows an identifier that has a long circulation period but a narrow circulation range. Pension numbers are an example of this type of identifier. Figure 2 shows the opposite, an example of an identifier that has a short circulation period but a wide circulation range. For example, tax numbers that change every year are an example of this type of identifier.

Figure 2. Identifier risks
Figure 2 Identifier risk – short term

The risk of using these identifiers together is the sum of the risks of each.

Risks of combining identifiers: In the case of federation
Figure 3: Risks of combining identifiers: In the case of federation

On the other hand, if the identifiers are integrated, as shown in the following figure, the risks will increase compared to when they are used in combination without integration. If the purpose of use is limited to the respective uses that they had when they were separate, the benefits will not change, so this increase can be said to be an increase in risk without benefit. We will call this excess risk.

Risks of combining identifiers: the case for integration
Figure 4: Risks of combining identifiers: In the case of integration

 

The reason why we said that "a common number for taxes and pensions is a bad idea" is because of the risk of excess use like this.

Cost of collaboration

So why do people promote a system that generates such excessive risk? I think it is probably because "collaboration requires costs." There are two types of costs: operational costs and system costs.

First, regarding operational costs, will these increase if we collaborate?

Originally, operations were carried out using individual numbers/identifiers. If so, I don't think that costs would increase by linking them. All you need to do is set the "scope" appropriately. (This is why it is said that the definition of the sector is important in the sectoral system.)

Next, let's look at system costs. Certainly, once integrated, there will be no need to convert different identifiers, so the cost will be cheaper. So, how much cheaper will it be? Let's calculate the conversion cost.

First, let's assume that the individual and institution identifiers are both 4-byte integers. They are unsigned ints, ranging from 0 to 4,294,967,295. There are only 1 million Japanese people, but there are about 3 billion of these, so this should be fine for the time being. We will concatenate these. We could encrypt this directly with a key, but that seems a bit vulnerable to plain text attacks, so we will introduce 43 bytes of random numbers, take an XOR, concatenate the result with the random number to make 8 bytes, and encrypt it with AES16. In reality, there will be costs for random number generation and packing, but these are small amounts of calculation and will be ignored here.[1]For the test, I used $ openssl speed aes to simulate on my Core i3 3.06GHz iMac (OS X 10.8.1) which is two years old and cost about 11 yen. The results are
Doing aes-128 cbc for 3s on 16 size blocks: 21715758 aes-128 cbc's in 3.00s
So, it looks like we can achieve about 700 million items/second/CPU. This is with a Core i3 3.06GHz, so if we had a Core i7 3.90GHz, we could achieve about 1000 million items/second/CPU. With two 4-CPU machines, we get 8000 million items/second. There probably aren't many cases where we can process all 1.3 million people, but even if there were, it would take less than XNUMX seconds.
In short, it's not a big cost. It's a small price to pay if it can eliminate excess risk.[2]There's no reason not to do it.

Conclusion

So, that's the conclusion.
  • Combining identifiers with different characteristics creates a significant risk of over-matching.
  • The costs of collaboration are so low that they do not justify the excess risks.
  • Therefore, a common tax and pension number would not make sense.
In the next issue, I would like to discuss the risks associated with reusable identifiers, the quality of information associated with identifiers, and other issues.[3].

(footnote)

  1. Actually, you should use AEAD algorithm such as GCM instead of CBC. CBC requires integrity, and the cost is not insignificant. With hmac(md5), you can only do about 240 million operations per second per CPU.
  2. In fact, it would be much cheaper than implementing various security measures to control excess risk.
  3. Well, when it comes to the number system, we need to discuss what we want to achieve, that is, the "purpose," before considering the risks. If the purpose is not decided, it is not clear whether the risk is worth taking. For example, I think we need a clear "purpose," such as reducing administrative procedure costs by half.

5 thoughts on "My Number and Privacy: Identifiers and Excess Risk"

  1. I found it an interesting read.
    I think it would be good to mention that the business costs of starting a linkage include the cost of creating a mapping table between identifiers. Furthermore, since the number/identifier is personal information when trying to linkage, it seems necessary to obtain the individual's consent to use it for linkage, and I feel that this business cost is very, very large.
    That said, in the case of the former, as long as we have standardized name, date of birth, and street name of the address, the existence of a common number does not seem to significantly reduce the burden of creating the mapping table, and in the case of the latter, it is not about a common number in the first place, so if you think about it carefully, it seems that the conclusion is essentially the same.

    1. The cost of mapping between identifiers is roughly the same whether it is an integrated identifier or a linked identifier. Therefore, it is not included in the discussion here. For information on mapping, please see a separate entry (Regarding the storage and linking of common numbers/My Numbers) http://www.sakimura.org/2012/03/1558/ ) for more details.

      Regarding the comment, "Moreover, if you try to link the numbers/identifiers, they are personal information, so it seems necessary to obtain the individual's consent as to whether or not it is OK to use them for linking, and I feel that this operational cost will be very, very large," the conclusion is that this is an irrelevant argument. The consent required is exactly the same whether it is an integrated identifier or an integrated identifier. In fact, the request for consent for an integrated identifier is even stronger because it is "sharing." And, in terms of the My Number system, they are trying to solve this by "law = social consent."

      So, essentially nothing changes; neither the conclusion nor the process of the discussion.
      Of course, I have taken into consideration what you have said when writing this.

      Instead, I would like to ask: why did you think consent was not required when using a unified identifier?

      1. The reason why I thought that consent was not required when using the integrated identifier was because, in my mind, the integrated identifier and the My Number Bill were perceived as one and the same.
        ① Legislation to eliminate the need for integrated identifiers and consent
        ② Integrated identifier (no legislation exempting consent)
        3) Legislation to eliminate the need for linked identifiers and consent
        ④ Linked identifiers (no legislation exempting consent)
        Of these, only ① and ④ came to mind. ② and ③ were not even in my head.
        That's how I started writing my comment, but while I was writing, I was vaguely thinking about the above (though I hadn't organized my thoughts that well), and I ended up writing with the conclusion that "essentially the conclusion doesn't change." This may sound like I'm defending myself, but I think many people are just as unaware of option ③.

        Regarding the mapping cost between identifiers, the mechanism of the linked link platform was not immediately apparent from the above context, and the only possible options for "number" were the integrated identifier and the linked identifier.
        ① Linking codes using an integrated identifier (number)
        ② Linking using the integrated identifier (number)
        ③ Linking by mapping between linking identifiers without using codes using linking identifiers (numbers)
        ④ Linking by code using a linking identifier (number)
        I thought of it as a comparison between ② and ③ among the options above. If ② is naturally excluded to reduce privacy risks, the rest require some kind of mapping, so I understand that there is not much difference in terms of workload.
        Unlike the premise of the discussion linked to above, we are assuming that, rather than relying on the four basic pieces of information, once the number, whether the integrated identifier or the linked identifier, is known, the code can be easily obtained from the linked platform.
        Having considered the above, I am quite unsure as to whether my understanding of the statement "the cost of mapping between identifiers is roughly the same whether they are integrated identifiers or linked identifiers" is correct.

          1. understood.
            Thank you very much.
            I would be happy if you could write an explanatory article on the issues I have been wondering about at some point.
            I think most people in the world misunderstand that they will be able to easily link things together if they don't do something complicated like introducing an integrated identifier (common number) and converting the code on the linking platform. This is far removed from the intuition of ordinary people, so I feel like unless I explain it very well, people won't be convinced.
            It may or may not be a high priority, and I'm sure you're busy, so please just ignore it.

Leave a comment

This site uses Akismet to reduce spam.For details of how to process comment data, please click here.