At its heart, security risk assessment is a very simple business:

  1. You list all of the risks you can think of
  2. For each risk:
    1. You assign a probability to that risk
    2. You assign an impact, usually a financial impact, to the risk happening
    3. You multiply the impact by the probability

You then order the risks by the weighted impact and deal with them in that order. Using this technique, three types of risks bubble to the top:

  1. Risks that are very likely and also have at least a moderate impact if they do happen.
  2. Risks that are very unlikely but have a huge impact if they happen.
  3. Risks that are moderately likely and have moderate impact if they do happen.

This information is then used by risk managers who can use the weighted impact figure as an indication of the benefit of addressing a particular risk – one part of a cost-benefit analysis - the cost in that case being the cost of implementing a risk reduction strategy. Residual risk (the risk that is left over after you have implemented the risk reduction strategy) can also be factored in to these calculations. Of course, the devil is in the detail and the practicalities of performing these calculations are often much more difficult than the simple summary given here might suggest. For example:

  • You can only account for risks that you can anticipate. What about the “unknown unknowns” that are not taken into account?
  • It can be extremely difficult to come up with an accurate assessment of the probability of a particular risk occurring. Often this is a very subjective process because even when accurate historical data is available, translating that data to the probability of a risk impacting a specific organisation is problematic.
  • The calculation of the impact of a risk can be very subjective. In particular, should the indirect impact of the risk be taken into account (e.g. the time required by staff members to clean up after an incident) and if so, it is not always clear where the line should be drawn between costs that are business-as-usual and costs that are a result of the incident.

In any case, this is the most frequently used model for security risk assessment and it occurred to me that there might be some benefit in considering how a similar model could be applied in the case of a data privacy impact assessment. The purpose of this article is to describe some of my preliminary thoughts on construction of a model for quantitative data privacy impact assessment. I intend, in due course, to apply this model to the possible solutions to the Carrier-Grade NAT information gap.

In general, the aim of the model is to consider a scenario where there are multiple possible courses of action, each of which has a potential data privacy impact. How can the various courses of action be assigned either an absolute or relative quantitative privacy impact “score”? In the first instance, the model is broadly the same as the risk assessment model described above wherein all of the possible courses of action are listed, and figures analogous to probability and impact are assigned to each one. The combination of these two factors can then be used to compare the courses of action.

Calculating “probability”

Considering first the probability component of the risk assessment model. The probability that is of interest here is the probability that the course of action under consideration will lead to a privacy breach. Many of the equivalent issues of calculating probability in risk assessment situations are likely to occur. However, there are a number of interesting factors that can come into play, particularly in cases where relative risk assessments are being carried out:

  • A risk that involves a large chain of control failures is, generally speaking, less likely than a risk that involves a single control failure. This observation gives rise to the security principle of “defence in depth”.
  • Two risks that both involve the same large chain of control failures, with one risk involving a single extra control failure are only differentiated by the probability of the single extra control failure. For example:
    • Risk A requires failure of controls a, b and c.
    • Risk B requires failure of controls a, b, c and d.
    • Unless control d fails with 100% certainty when the other three controls fail, it is possible to observe that:
      • Risk B is less likely than Risk A to occur.
      • The only thing that differentiates the two risks is the probability of failure of control d.
    • The deployment of novel technologies is inherently more risky than the deployment of long-established technologies. There are much less likely to be “unknown unknowns” arising with long-established technologies. Of course it’s possible that “unknown unknowns” could arise in any technology, otherwise there would be no such thing as a zero-day vulnerability.

In the context of privacy impact assessment, it is fortunate that the assessment is usually being performed to assess the privacy impact of several candidate courses of action and therefore the probability and impact only need to be calculated for a relatively small number of scenarios. In such cases it may be possible to assign a meaningful relative probability even in cases where a calculation of absolute probability is not possible.

Calculating a counterpart for “Impact”

Moving on now to the analogy for impact in the risk assessment model.

Different courses of action may have different impact characteristics in case a breach takes place. Factors that could influence the impact would include:

  • The type of data compromised in the breach – A course of action where a breach could expose sensitive personal data would have a higher impact than a course of action that would only expose non-sensitive personal data.
  • The volume of data compromised in the breach - A course of action where a breach could expose the data of a larger number of individual data subjects would have a higher impact than a course of action that would expose the same data of a lesser number of data subjects.

Quantifying the impact may be a bit trickier than it appears at first glance. In simple cases the number of subjects whose data is impacted by the breach could work. However this will not always work, such as in the following example scenarios:

  • One breach leads to the exposure of non-sensitive categories of personal data for a large number of data subjects. Another breach leads to the exposure of sensitive personal data of a small number of data subjects. How is the impact of these two scenarios to be compared? Obviously number of data subjects is not adequate. It is necessary to assign a relative quantitative measure to the sensitivity of the data being lost.
  • One breach leads to the exposure of a small amount of data about a large number of data subjects. Another breach leads to the exposure of a large amount of data about a small number of data subjects. Again, the likely impact on the privacy of the individual data subjects needs to be taken into account.

One final thought that should be considered is a scenario where a breach leads to the exposure of a small amount of data about data subjects for which organisation A (carrying out the privacy impact assessment) but a large amount of data about data subjects to which organisation A has no obligation? Can this be taken into account somehow? Should it be? Generally speaking, when conducting a risk assessment a line needs to be drawn somewhere otherwise it would never be possible to complete all possible risks and in this case it would be reasonable of organisation A to consider only their own obligations with the data privacy implications of their actions factoring into the risk assessment of the other organisation(s) whose data is likely to be impacted by their actions.

Summary

In summary, it seems possible in principle at least, that a model for relative quantitative privacy impact assessment is possible. Such a model will include many of the same caveats and challenges that apply to current risk assessment models.

Very often when I try to start a discussion about law enforcement access to data, the conversation immediately leaps to wiretapping as if this is a slam-dunk argument against any form of law enforcement access to data. For me, this indicates a narrowness of thinking that is prevalent in those who advocate for privacy rights above all others. The purpose of this article is to broaden the discussion by presenting a commonly used data categorisation and discussing the reasons why law enforcement agencies need access to data.

One point on the scope of this article before I begin: the discussion below relates to electronic data and does not address the collection of other types of evidence such as statements, physical evidence, fingerprints, DNA, etc.

Wiretapping is understandably an emotive topic to which people feel visceral suspicion – wondering who could be listening to, or monitoring, their communication without their knowledge – particularly in light of the Snowden revelations of government mass surveillance. However, the situation is not that simple. There are different types of data that can be accessed by law enforcement and these are frequently categorised as follows:

  1. Subscriber data – who owns, or was controlling a particular identity (account, IP address, etc.) at a particular point in time.
  2. Traffic data – this is, briefly, metadata about communication that has taken place between two parties (but not the content).
  3. Content data – the actual content of communication collected through mechanisms such as wiretapping.

By way of a concrete example, consider the traditional phone system:

  • Information about the individual who owns or controls a particular phone number would be subscriber data;
  • Information about whether person A called person B on the phone would be traffic data;
  • The content of the call between person A and person B would be content data.

Actions that lead to the collection of data from each of these categories are considered to be progressively more intrusive, with subscriber data being the least intrusive and content data being the most intrusive. Increasing levels of intrusiveness come with increasing levels of judicial oversight. An investigator that was requesting measures that provide access to content data would be required to demonstrate a significantly greater level of suspicion before being granted an order than an investigator that was asking for access to basic subscriber information.

Considered in this context, wiretapping should be thought of as a technical means to an end – it is one mechanism used by law enforcement agencies to collect a certain type of data, specifically content data.

In a more general sense, the aim of law enforcement agencies is to enforce the law (as the name suggests!). What this means, amongst other things, is the identification, and investigation, of breaches of criminal legislation. Every country has its own legislation and there is clearly disagreement amongst different countries about what constitutes a crime – this is part of the problem with discussions about law enforcement access to data, particularly on the Internet. Investigation of criminal activity must be done with the expectation that all of the law enforcement activity will be scrutinised in a court in due course. Therefore the findings of the investigation must be supported by appropriately collected and managed evidence. For a given jurisdiction, the rules for admissibility and appropriate management of evidence are commonly laid out in the criminal procedure code or equivalent.

Of course, when it comes to something as difficult and nuanced as law enforcement access to data, there is nothing simple, particularly when talking about electronic evidence and even more particularly when talking about electronic evidence collected from another jurisdiction. Below, I have provided a (far from exhaustive) list of examples of the challenges presented by these issues. I plan to describe some of these challenges in more detail in later articles.

  • There is not universal agreement of what constitutes a crime. The canonical example provided here are countries that do not have laws that restrict free speech or that do not protect individual rights to privacy.
  • There are different levels of judicial independence around the world.
  • There are different levels to which the principle of rule of law applies around the world.
  • There are different types of law enforcement agencies with a variety of different powers in different jurisdictions.
  • The time taken to gain access to evidence that is located in a different jurisdiction can be a major impediment to investigations. In fact, identifying the jurisdiction that the evidence is located in can, in itself, sometimes present an insurmountable challenge.
  • Different jurisdictions have different rules about what constitutes evidence and, in particular, where and how electronic evidence fits in their criminal procedure code. This can lead to some significant practical difficulties.
  • Fundamental technological challenges that can prevent identification of criminals online, such as topics I have already covered like Carrier-Grade NAT and IPv6 Stateless Address Autoconfiguration. This is a separate problem from the use by criminals of obfuscation technologies such as Tor.

Conclusion

The individual right to privacy is critically important but it is not an absolute right. Law enforcement agencies need to gather evidence during criminal investigations and this requirement represents an important societal need, the right of victims of crime to expect that their crimes can and will be effectively investigated by law enforcement agencies in their jurisdiction.

Wiretapping is not the only form of law enforcement access to data. The issue of law enforcement access to data is far more complex than is suggested by any simplistic dismissal of the entire topic because of an objection to wiretapping. As I concluded in my previous article, a more level-headed discussion is required to find a sensible balance between privacy and law enforcement access to data.

The terms “Data Protection” and “Privacy” are often used together, and sometimes interchangeably, but it’s important to remember that they are two different things. This article provides definitions of both and describes some of the interesting current challenges in the areas of both data protection and privacy.

Data Protection

If you decide to give some personal information to an organisation, that organisation has a legal obligation to look after your data, and that responsibility is codified in data protection law.

First there is the relationship between you - the data subject - and the organisation that you provide your personal data to - the data controller. Under European legislation when you provide your data to an organisation, you provide your data for a particular purpose and the organisation must only use the data for that purpose. As well as this, the organisation must look after the security of your data and delete it when it is no longer required for the purpose you provided it for in the first place. There are other obligations like the fact that if you ask an organisation must provide you with a copy of all of the information they hold about you. The point is that the obligations of an organisation that takes your personal data are described in the law.

It is also possible that an organisation may collect data about you that you do not directly provide. For example, on organisation might collect information about your usage of their services and associate that with your identity. You have the exact same rights when it comes to information collected about you in this way. 

Privacy

Privacy, and the right to privacy in particular, is legally defined in different ways in different jurisdictions, with often a substantial body of case law supporting it. Generally speaking the right to privacy is defined as a way of preventing government interference in a person’s family life, home life and correspondence. Examples would include Article 8 of the European Convention on Human Rights or the Due Process Clause of the 14th Amendment to the US Constitution.

The meaning of the term “private life”, for example is expanded upon in case law of the European Court of Human Rights where it has been determined that private life is a broad concept incapable of exhaustive definition (e.g. Niemietz v. Germany).

The right to privacy is not absolute, unlike say the right not to be a slave (Article 4 of European Convention on Human Rights). As described in paragraph 2 of Article 8 of the European Convention on Human Rights, the right to privacy needs to be balanced, in accordance with the law, against the interests of democratic society in national security, public safety, economic well-being of the country, prevention of disorder or crime, protection of health or morals and the protection of the rights and freedoms of others.

Discussion

Data protection raises many very interesting practical questions, particularly in light of the new General Data Protection Regulation (GDRP). For example:

  • How can a data controller provide a subject’s information in a useful machine-readable format to meet the right to data portability if there is no such format available within a specific usage domain?
  • To what extent can a person restrict the processing of their information and still avail of a free online service?

The meaning of the right to privacy in the modern world also raises very significant legal and societal questions, but importantly these are not the same questions that are addressed by data protection legislation. For example:

  • If your right to privacy is protected by law in your own country, what about your information that is stored in another country? What right do you have to privacy from surveillance by a foreign government?
  • Where is the right balance between individual right to privacy and other societal requirements such as the rights of victims of crime?
  • Considering, for example, the recent Cambridge Analytica scandal, what impact are online services having on democratic structures and processes?

There is also a category of questions that relate to the overlap between areas of privacy and data protection. These fall into two broad categories.

  • Firstly, the questions arising from exchange of personal data for free services online. For example:
    • To what extent are people willing to provide personal data to private companies in exchange for free services?
    • Should people who are receiving a free service in exchange for access to their personal data have an expectation to privacy that goes beyond what they are entitled to by data protection legislation?
    • Should there be limits on what can be done with personal data that is provided in exchange for free services? In other words, limits on the range of business models that can support free services online?
  • Secondly, questions that relate to law enforcement access to data held by private corporations. For example:
    • How can the rule of law be enforced online without effective mechanisms for collecting evidence online?
    • How can evidence effectively and efficiently be collected across multiple jurisdictions?
    • What obligations should be placed on corporations that hold data on foreign citizens to cooperate with law enforcement agencies in the countries in which those citizens reside, or third jurisdictions where foreign citizens may be under suspicion of committing crimes?
    • To what extent should corporations that hold data cooperate with interception or monitoring orders of foreign jurisdictions?
    • Is a more effective mechanism than Mutual Legal Assistance possible to address some of these issues?

Conclusion

The recent coming into force of the GDPR raises many interesting practical challenges for organisations that process personal data. However, the interactions between individual right to privacy, online business models and the balance between privacy and other societal needs (such as law enforcement access to data) are much more fundamental and far-reaching.

Meaningful, level-headed conversations need to take place between stakeholders on all sides of the debate so that effective balances can be found.

The need for individual right to privacy and the need for law enforcement to be able to effectively investigate crime are sometimes portrayed as being irreconcilably in direct conflict with each other. Both needs are legitimate and ignoring the challenges presented by areas of conflict will not make the problem go away.

My recently published Internet Draft presents a conceptual model that allows for both sets of requirements to be met simultaneously. The reason for this publication is to show that, with some creative thinking, it is possible to identify win-win solutions that simultaneously achieve both privacy and law enforcement goals. This post contains a summary of the main ideas presented in that paper.

Current regulatiory regimes typically oblige ISPs to keep records to facilitate identification of subscribers if necessary for a criminal investigation and in the case of IPv6 this will mean recording the prefix(es) have been assigned to each customer. IPv6 addresses are assigned to organisations in blocks that are much larger than the size of the blocks in which IPv4 addresses are assigned, with common IPv6 prefix sizes being /48, /56 and /64.

From the perspective of crime attribution, therefore, when a specific IP address is suspected to be associated with criminal activity, records will most likely available from an ISP to identify the organisation to which the prefix has been assigned. The question then arises how an organisation approached by law enforcement authorities, particularly a large organisation, would be able to ascertain which host/endpoint within their network was using a particular IP address at a particular time.

This is not a new problem, with many difficulties of crime attribution already present in the IPv4 Internet.

IPv6 Stateless Address Autoconfiguration (SLAAC) describes the process used by a host in deciding how to auto configure its interfaces in IPv6. This includes generating a link-local address, generating global addresses via stateless address autoconfiguration and then using duplicate address detection to verify the uniqueness of the addresses on the link. SLAAC requires no manual configuration of hosts, minimal (if any) configuration of routers, and no additional servers.

Originally, various standards specified that the interface identifier should be generated from the link-layer address of the interface (for example RFC2467, RFC2470, RFC2491, RFC2492, RFC2497, RFC2590, RFC4338, RFC4391, RFC5072, RFC5121). RFC7217 (A method for generating semantically opaque interface identifiers with IPv6 stateless address auto configuration (SLAAC)) describes the currently recommended method whereby an IPv6 address configured using the method is stable within each subnet, but the corresponding interface identifier changes when the host moves from one network to another.

In general terms, the approach is to pass the following values to a cryptographic hash function (such as SHA1 or SHA256):

  • The network prefix
  • The network interface id
  • The network id (subnet, SSID or similar) – optional parameter
  • A duplicate address detection counter – incremented in case of a duplicate address being generated
  • A secret key (128 bits long at least)

The interface identifier is generated by taking as many bits, starting at the least significant, as required. The result is an opaque bit stream that can be used as the interface id.

On the other hand, RFC4941 (Privacy Extensions for Stateless Address Autoconfiguration in IPv6) describes a system by which interface identifiers generated from an IEEE identifier (EUI-64) can be changed over time, even in cases where the interface contains an embedded IEEE identifier. These are referred to as temporary addresses. The reason behind development of this technique is that the use of a globally unique, non-changing, interface identifier means that the activity of a specific interface can be tracked even if the network prefix changes. The use of a fixed identifier in multiple contexts allows correlation of seemingly unrelated activity using the identifier.  Contrast this with IPv4 addresses, where if a person changes to a different network their entire IP address will change.

To prevent the generation of predictable values, the algorithm must contain an cryptographic component.  The algorithm assumes that each interface maintains an associated randomised interface identifier. When temporary addresses are generated, the current value of the interface identifier is used.  

From the crime attribution perspective, both the recommended stable and temporary address generation algorithms pseudo-randomly select addresses from the space of available addresses. When SLAAC is being used, the hosts auto-configure the IP addresses of their interfaces, meaning there is no organisational record of the IP addresses that have been selected by particular hosts at particular points in time.

My Internet Draft presents a record-retention model whereby it is possible for an organisation, if required to do so as part of a criminal investigation, to answer the question “Who was using IP address A at a particular point in time?” without being able to answer any more broadly scoped questions, such as “What were all of the IP addresses used by a particular person?”

The model described  assumes that the endpoint/interface for which the IPv6 address is being generated has a meaningful, unique identifying characteristic. Whether that is the layer two address of the interface or some other organisational characteristic is unimportant for the purpose of the model.

The host generates an IPv6 address using any of the techniques described above, but most likely the technique described in RFC4941. Having completed the duplicate address detection phase of SLAAC but before beginning to use the IP address for communication, the host creates a structure of the following form:

 

typedef struct {
   const char *LOG_ENTRY_TAG=”__LOG_ENTRY_TAG__”;
   unsigned char *ip_address;
   unsigned int identifying_characteristic_length;
   unsigned char *identifying_characteristic;
   unsigned int client_generation_time;
   unsigned int client_preferred_time;
   unsigned int client_valid_time;
} log_entry;

The fields are all mandatory, and populated as follows:

  • LOG_ENTRY_TAG has the fixed, constant value “__LOG_ENTRY_TAG__”
  • ip_address contains the 16 byte IPv6 address.
  • identifying_characteristic_length contains the byte length of the identifying_characteristic field.
  • identifying_characteristic is a variable length byte string, organisationally interpreted, to represent the identifying characteristic of the host generating the IPv6 address.
  • client_generation_time contains the time, in seconds since the unix epoch, as recorded by the client creating the IPv6 address, at which the address was generated.
  • client_preferred_time contains the period, in seconds, starting at client_generation_time for which the client will use this IPv6 address as its preferred address.
  • client_valid_time contains the period, in seconds, starting at client_generation_time for which the client will consider this IPv6 address to be valud.

When the structure has been populated, the host encrypts the structure using AES-128 in CBC mode with the selected IPv6 address being used as the encryption key. The host then submits the record above to a specified multicast address and port but, when sending the record, sends it using the unspecified IPv6 address (i.e. “::”) as the source IP address. When records are received by the logging server, listening to the specified multicast address, the logging server creates a new log entry consisting of:

  • The time the record was received, ideally calibrated to a global standard time (e.g. NTP) with the granularity of a second.
  • The encrypted record received as a binary blob.

If and when it becomes necessary to query the recorded entries, the following (representative) process can be followed:

  1. Taking the IP address for which the attribution information is required, iterate through all recorded log entries and use the IP address as a decryption key and attempt to decrypt the record.
  2. Examine the decrypted data and check whether the first 17 bytes have the values “__LOG_ENTRY_TAG__”.
    • If so:
      1. This indicates that the log entry has been successfully decrypted.
      2. The IP address contained in the log entry can be verified against the IP address that was used as a key to confirm that the log entry contains the correct value.
      3. The identifying characteristic can then be read from the log entry, along with the time at which the host generated the IP address.
      4. The time in the record can be correlated with the time in the log entry recorded by the server so that any time differential can be compensated for.
    • If not:
      1. This indicates that the log entry has not been successfully decrypted and that the current log entry pertains to a different IP address.
      2. Move on to the next log entry and try again.

It would be computationally feasible to use this process on a large number of log entries but, if necessary, the number of log entries can be reduced by selecting a range of log entries based on the time recorded by the server.

In order to decrypt a specific log entry without knowing the target IP address, a brute force approach must be adopted. Presuming a known 64-bit address prefix, means that there is a space of 2^64 possible addresses to search for each individual log entry.

The privacy of the records comes from the pseudo-random nature of the IPv6 address generation mechanism, the very feature that is desirable from a privacy perspective.

The model presented here provides a balance between the needs for individual privacy at the network layer while also providing a mechanism for recording data that would be required in a criminal investigation. The balance that has been proposed here is at the point where it is possible to identify, using this technique, who was using a specific IP address at a specific point in time without being able to extract any more information such as all of the people who were using a particular IP or all of the IP addresses that were used by a particular endpoint.

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site you are accepting the use of cookies in accordance with our privacy policy.
Privacy Policy Accept