This is the second part of my analysis of Carrier-grade NAT source port re-identification implications. Before reading this, it is important that you have read the first part of this series. In this second post I analyse the re-identification characteristics of the various port selection methodologies described in the first part.

Picking up from where the previous article left off, I mentioned that there are two categories of port assignment methodology used in Carrier-Grade NATs[1]:

  • Dynamic assignment: whereby port allocations are made per-session or per-customer as required. This maximises port utilisation but generates substantial volumes of logs. To reduce the log volume, it is possible to allocate of a port range to each subscriber, rather than an individual port per session.
  • Static assignment: whereby ports or port ranges are reserved for each internal address before subscriber connections are initiated. Port ranges can either be contiguous or non-contiguous.

IP Address Selection

On a slight aside, it is also possible that a Carrier-Grade NAT appliance may have more than one assigned external IP address. In this case there is the additional complexity of which external IP address will be used in a mapping initiated by an internal IP address. There are two possibilities:

  • Arbitrary: where an IP address is selected at random from the pool of external IP addresses.
  • Paired: where all sessions associated with the same internal IP address are mapped to the same external IP address.

The problem with the “Arbitrary” method is that certain protocols may break if the IP associated with an upper-layer session changes while the session is underway. Examples could include some types of network games and streaming content. Therefore it is recommended that NAT devices use the “Paired” approach for selecting an IP address from the external pool.

What this means is that if there are N internal IP addresses and M external IP addresses, approximately N/M internal IP addresses will be paired with each of the external IP addresses. This ratio will be the same regardless of how source ports are assigned (or logged). Because this analysis is examining the differential re-identification power of logging source port versus not logging source port, the use of Paired IP addressing will make no difference and IP address selection is therefore not considered further.

Dynamic Port Assignment Analysis

The first category of port assignment is dynamic port assignment whereby port allocations are made per-session or per-customer as required. As I mentioned in the previous article, the port selection mechanisms used are:

  • Port preservation: The NAT attempts to preserve the port number that was used internally when assigning a mapping to an external IP address and port. In cases of a port collision - where two internal IP addresses attempt to use the same port number - some NATs will override the previous mapping to preserve the same port, others will assign a different IP address from the pool of external IP addresses (presuming other addresses are available) and finally the NAT will pick a different port.
  • Parity preservation: Some NATs will preserve the parity of the internal port when selecting an external port. In other words an even numbered internal port will be mapped to an even numbered external port, and similarly for odd numbered ports.
  • Port randomisation: In cases where port preservation is not being performed, the NAT should obfuscate selection of the external port. Algorithms are available to preserve port parity if necessary while still obfuscating the external port.

If port preservation is in use, then the source port number on the external side of the NAT will be the port number that was selected by the originating host on the internal side of the NAT. In this case the re-identification characteristics of logging source port are the same as if the NAT was not present at all. The algorithm that is used to select source port by the operating system of the originating host will predominantly determine the re-identification characteristics. It may be possible in some specific scenarios to determine that multiple hosts are sharing an IP address if, for example, the operating systems select sequential port numbers for sequential connections it may be possible to identify multiple series of sequential port numbers from the same IP address but this will only be possible if (a) a very small number of internal IP addresses are using the NAT and (b) the volume of log entries contained in the log file being analysed supports this type of analysis.

If port randomisation is in use then the source port on the external side of the NAT will be pseudo-randomly selected from the pool of available ports. This will lead to a roughly pseudo-randomly distributed range of source ports from the perspective of an external IP address, meaning it will be extremely difficult to disentangle the log using the source port information to re-identify the activity of any individual internal IP addresses.

The use of parity preservation does not really change the fundamentals of the analysis above because port preservation or port randomisation can both be done in a way that will preserve port parity.

In summary, except for one or two special cases, there is very limited scope for enhanced re-identification using source port where dynamic port assignment is in use.

Static Port Assignment Analysis

Turning now to static port assignment whereby ports or port ranges are reserved for each internal address before subscriber connections are initiated. Port ranges can either be contiguous or non-contiguous but in either case there is a statically configured relationship between one or more ports and a specific internal IP address.

In this situation the re-identification power of source port logging will depend on the number of ports/size of the port range allocated to a specific internal IP address. It is generally the recommended behaviour that ports are mapped within the three port ranges (0-1023, 1024-49151 and 49152-65535). Considering just the second two port ranges (since ports from 0-1023 are not usually used as source ports), there is an available range of 64,512 ports.

If there are N internal IP addresses and the ports are statically distributed evenly between them, there will be N/64,512 ports assigned to each IP address. It is possible that not all ports are assigned to the internal IP addresses, with some being retained for future use, so in general there will be N/M ports assigned to each IP address where M is the number of ports assigned to all internal IP addresses.

The N/M ports that are assigned to a specific internal IP address may or may not be contiguous so the question that next arises is how it would be possible to use the source port to correlate multiple log entries to identify the activity of an individual internal IP address.

One approach would be to make a number of educated guesses and analyse the logs attempting to determine whether a particular hypothesis matches the data. For example, one could hypothesise that the ports assigned to a given internal IP address are contiguous. The logs could then be searched for entries from a particular IP address and then sorted by port number, looking for ranges of ports in the log entries. This might work if the activity of only a small number of internal IP addresses is represented in the log. In such a case, it would be possible to identify clusters of port numbers and with a reasonable degree of confidence conclude that the activity from the clustered port numbers represented the activity of different internal IP addresses. If, on the other hand the activity of a significant number of internal IP addresses then this type of analysis would become much more difficult because the port ranges assigned to internal IP addresses would begin to form a contiguous range and with increasing numbers of internal IP addresses, it would become increasingly difficult to separate the activity of different internal IP addresses. Regardless of the number of internal IP addresses represented in the logs, such an analysis would also depend on the availability of a sufficient number of records to be able to identify clusters of port numbers with adequate statistical significance.

Another possibility would be that two or more contiguous sub-ranges of ports are assigned to each subscriber. In this case it would be much more difficult to see how logging source port helps to re-identify identify internal subscribers because (a) the port use will appear much more fragmented and (b) it would not be apparent that the two (or more) contiguous sub-ranges of ports relate to the same internal subscriber.

In summary, the parameters that will influence the ability to differentially de-identify subscribers based on logs containing IP address only versus IP address and source port when static port assignment is in use are:

  1. Number of internal IP addresses.
  2. Number of ports that have been allocated in total from the available range.
  3. Methodology used for statically assigning ports to internal IP addresses.
  4. Number of internal IP addresses represented in the logs being examined.

Additionally, the volume of records available for analysis in the log being examined also strongly influences the confidence of any results found. The attacker who has gained access to the logs knows none of these parameters.

It is therefore reasonable to conclude that the re-identification risk of storing source port along with IP address is about the same as storing IP address. This will be valid except in a small number of very specific cases – such as the case where a log with a large number of entries in which can be found the activity of a very small number of internal IP addresses that have passed through a NAT that has statically assigned a single contiguous port range to each internal IP address.

Identification of the Internal IP Address or Subscriber Identity

What the above analysis has considered is the correlation of activity based on IP address and source port. None of the above will enhance the ability of an attacker that is in possession of logs from an Internet-facing server to identify the internal IP address that is behind the NAT or, ultimately, the identity of the subscriber.

There might be application-layer information in the logs that would allow the identification of the internal IP address or subscriber, but that information would be present regardless of whether source port is logged and is therefore out of scope of this analysis. It does not have any influence on the differential re-identification power of logs containing IP address versus logs containing IP address and source port.

Conclusion

It might be possible in some remote cases to correlate unrelated sessions with increased resolution if source port is logged versus logging only IP address but such an analysis would depend on all sorts of parameters that could not be known to an attacker in possession of compromised logs.

This analysis is examining the differential re-identification power of Internet-facing server logs where only IP address is retained versus the case where IP address and source port are retained – when those logs are analysed in the absence of access to the ISP’s records. In other words, it is an assessment the differential privacy risk of a data breach involving the leak of logs with just IP address versus logs with IP address and source port.

In that regard, what this analysis shows is that even if analysis of the leaked logs allows the correlation of multiple sessions together based on source port (which is possible only in a small set of specific circumstances), the inability to access ISP records will mean that it will not be possible to identify with any increased resolution the internal IP address or subscriber identity.

In conclusion, the analysis would seem to indicate that the logging of source port does not substantially increase the re-identification risk arising from the loss of logs of Internet-facing servers.

 

[1] https://tools.ietf.org/html/draft-chen-sunset4-cgn-port-allocation-05

 

 

 

I am of the opinion that the best approach that is available to address the Carrier-Grade NAT information gap is source port logging at Internet-facing servers. A question arose during a recent discussion on this topic where it was asserted to me that the storage of source port in Internet-facing server logs would necessarily increase the ability of a nefarious actor with access to the logs to re-identify someone from their activity at that Internet-facing server.

Notwithstanding the point that source port logging is the “least worst” of the available options issue from a privacy point of view for addressing the Carrier-Grade NAT information gap, I am not willing to accept the assertion that source port logging increases re-identification power of Internet-facing server logs without seeing some evidence to support the claim. No evidence was provided at the time of the discussion, and although it is not my claim and the burden of proof should really be on the person making the claim, I am willing to investigate this matter and see where the available evidence leads me.

I originally intended to write the entire analysis in a single article but it was getting far too long, so I have broken it up into several pieces. In this first post I describe some of the complexities of how NATs select external ports when they are performing mappings.

Defining the problem

The assertion is that the storage of source port in Internet-facing server logs increases the resolving power of logs to re-identify people. Crucially, this must be done in the absence of access to any other source of information (e.g. ISP records) and also cannot take into account anything that would be present in the logs anyway (e.g. resolving based on URLs requested or other application layer information – because this information will be present in the logs whether source port is logged or not).

Therefore what is required is to measure the differential resolving power of Internet facing logs that contain source IP address versus those that contain source IP address and source port but are in all other ways identical.

How could this be measured?

Suppose there are X people sharing a particular IP address, A, behind a Carrier-Grade NAT. If only IP address is recorded then it is not possible, using only the IP address, to tell which of those X people was responsible for a given log entry. To what extent would the logging of the source port enable identification of one or more of the X people using the IP address?

A second approach that could be considered is whether, with access to a substantial amount of logs, whether it is possible to correlate entries that would enable association of multiple log entries with a specific individual where such a correlation would not be possible of source ports are not logged.

How NATs create internal-external mappings

RFC6888 defines the common requirements for Carrier-Grade NATs. The first requirement is that “If a CGN forwards packets containing a given transport protocol, then it must fulfil that transport protocol’s behavioural requirements.”

For the purposes of this article, TCP and UDP behavioural requirements are considered.

RFC4787 (NAT behavioural requirements for unicast UDP) defines the following port mapping behaviours:

  • Endpoint-independent mapping: The NAT reuses the port mapping for subsequent packets sent from the same internal IP address and port to any external IP address and port. In other words, an internal IP address and port always map to the same external IP address and port. RFC7857 (Updates to NAT behavioural requirements) clarifies that this behaviour may be extended to connections originating from different internal source IP addresses and ports as long as their destinations are different.
  • Address-dependent mapping: The NAT reuses the port mapping for subsequent packets sent from the same internal IP address and port to the same external IP address, regardless of the external port. In other words, an internal IP address and port number always map to the same port number from the perspective of a specific IP address on the Internet.
  • Address and port-dependent mapping: The NAT reuses the port mapping for subsequent packets sent from the same internal IP and port to the same external IP and port while the mapping is still active. In other words, separate external port numbers are used for each mapping through the NAT.

RFC5382 (NAT behavioural requirements for TCP) adds an additional port mapping behaviour:

  • Connection-dependent mapping: The NAT never reuses a port mapping. In other words, for each connection a new mapping is allocated.

How NATs select external ports for mappings

RFC4787 (NAT behavioural requirements for unicast UDP) describes the following port selection behaviours:

  • Port preservation: The NAT attempts to preserve the port number that was used internally when assigning a mapping to an external IP address and port. In cases of a port collision - where two internal IP addresses attempt to use the same port number - some NATs will override the previous mapping to preserve the same port, others will assign a different IP address from the pool of external IP addresses (presuming other addresses are available) and finally the NAT will pick a different port.
  • Port overloading: Some NATs always use port preservation even in cases of collision. This is known as port overloading. RFC4787 recommends that port overloading is not done and RFC5382 (NAT behavioural requirements for TCP) requires that NATs must not perform port overloading.
  • Parity preservation: Some NATs will preserve the parity of the internal port when selecting an external port. In other words an even numbered internal port will be mapped to an even numbered external port, and similarly for odd numbered ports.
  • Port randomisation: In cases where port preservation is not being performed, RFC6056 (Port randomisation recommendations) recommends that the NAT should obfuscate selection of the external port. If port preservation is being used, RFC6056 further recommends that the NAT should obfuscate the selection of the external port if the port needs to be changed. The algorithms described in RFC6056 can also be adapted to preserve port parity if necessary while still obfuscating the external port. RFC7857 (Updates to NAT behavioural requirements) also recommends that NATs should follow these port randomisation recommendations.

In the case of Carrier-Grade NAT, RFC6888 (common requirements for Carrier-Grade NATs) touches on the topic of port selection only by describing that there are three competing requirements, without commenting or making recommendations on appropriate balance:

  • Carrier-Grade NAT port allocation scheme should maximise port utilisation.
  • Carrier-Grade NAT port allocation scheme should minimise log volume.
  • Carrier-Grade NAT port allocation scheme should make it hard for attackers to guess port numbers.

Detailed discussion of the issues arising from attempts to balance these three requirements in practice has already been carried out. Briefly, what this document describes are two categories of port assignment methodology:

  • Dynamic assignment: whereby port allocations are made per-session or per-customer as required. This maximises port utilisation but generates substantial volumes of logs. To reduce the log volume, it is possible to allocate of a port range to each subscriber, rather than an individual port per session.
  • Static assignment: whereby ports or port ranges are reserved for each internal address before subscriber connections are initiated. Port ranges can either be contiguous or non-contiguous.

The security considerations of the document also touch on port randomisation. Port randomisation can be performed within blocks of ports assigned to a specific subscriber.

Summary

This article describes some of the complexity of port assignment to NAT mappings. In the next article I will consider the implication for re-identification resolving power.

 

 

In a previous article I described a situation I often encounter where discussions about law enforcement access to data immediately leap to wiretapping. Obviously wiretapping is worthy of careful consideration because it is a very intrusive measure and the oversight that is in place to prevent misuse of this measure must, of course, balance the needs of the investigation against the individual right to privacy of the suspect. I maintain that discussions about law enforcement access to data that focus exclusively on wiretapping grossly oversimplify of the situation. In this article I want to present some research that I have done to try to establish how much wiretapping is actually going on.

The results below indicate that the overwhelming majority of law enforcement requests received by Facebook, Google and Microsoft in 2017 (almost 96%) related to subscriber identity information/non-content data. Only 4.3% of all requests received resulted in disclosure of content in any form - and not all of these were in the form of wiretapping. Therefore, 4.3% is a huge overestimate of the amount of wiretapping requests being issued, with the vast majority of requests relating to subscriber information/non-content data.

In summary, to frame the law enforcement requirement for access to data in the context of wiretapping demonstrably misrepresents the requirements of law enforcement in the vast majority of criminal investigations.

Sources of Data

The only readily accessible sources of information that I have been able to find are the transparency reports published periodically by the various multinational service providers. For the purpose of this analysis, I have looked at the Google, Facebook and Microsoft transparency reports. The Amazon transparency report was also reviewed but the data it contained did not allow for any meaningful interpretation.

Analysis of Data

Because these corporations are all headquartered in the United States, all requests for content data must (apparently) be forwarded through the United States authorities. Therefore, the reporting on requests received for content data are only present in the figures for the United States. It is not possible to tell whether these orders originated with United States or international authorities.

Generally speaking the transparency reports include data in various categories and report on:

  • The number of requests of that type received
  • Percentage of requests of that type where some data produced
  • The number of users/accounts specified in all requests of that type

It is not possible using these three figures to definitely determine the percentage of accounts/users for which data was produced but for the purpose of generating some sort of estimate, it has been assumed that if the percentage of requests received is applied to the number of accounts requested this will yield a meaningful approximation. Of course, this estimation methodology will not be accurate if a small number of requests represent a large percentage of the requested users but more information would be required to discriminate between these possibilities.

Most organisations produce bi-annual transparency reports, so the figures that have been calculated here cover the entire year of 2017.

No explanatory notes are available for any of the source data so some educated guesses have been made. The following sections describe the interpretation of the data that has been used in each case.

Google

The Google data is available in a long, difficult to analyse spreadsheet where it is possible to find the number of requests of different types received from different countries. The categories of requests that are presented are (not all categories are present for all countries):

  • Emergency disclosure requests
  • Other court orders
  • Pen register orders
  • Preservation requests
  • Search warrants
  • Subpoenas
  • Wiretap orders

For the data of all countries apart from the United States, the most applicable figure for “all” requests for data received from that country appears to be the “Other Legal Requests” entry. For the United States figures, the figures that appear to best represent “all” requests for data received would seem to be the sum of the “Search Warrants” plus “Subpoenas” entries.

For the wiretapping calculation, only the United States figures contain wiretapping data, and in this case the sum of the “Pen Register Orders” and the “Wiretap Orders” entries have been used.

Facebook

Facebook produce two sets of transparency figures per year, one for the first half of the year and another separate set of data for the second half of the year. The data is presented in a tabular form with a single row of information for each country from which requests have been received.

No explanatory notes are available, so it has further been assumed that the columns titled

  • Total Data Requests,
  • Total users/accounts requested, and
  • Percent requests where some data produced

encompass all requests that have been received by Facebook, and the columns that relate to

  • Pen Register/Trap and Trace
  • Court Order (18 USC 2703(d))
  • Title III

encompass all requests that have been received by Facebook for wiretapping or other content disclosure. Brief searching online indicates that Title 18, Section 2703 of the US Code relates to the disclosure of the content of wire or electronic communication and Title III relates to electronic surveillance. Therefore, in the interests of over-representing rather than under-representing the wiretapping figures, all three of these sets of figures have been aggregated in the results below.

Microsoft

Microsoft produce two sets of transparency figures per year, one for the first half of the year and another separate set of data for the second half of the year. As with the other transparency reports, the figures do not allow determination of the percentage of accounts/users for which data was actually produced, only the percentage of requests for which data was produced is reported, but for the purpose of estimating it has been assumed that if the percentage of requests received value is applied to the number of accounts requested will yield a meaningful approximation. This estimation methodology will not be accurate if a small number of requests represent a large percentage of the requested users.

No data is provided about how many user accounts were subject of requests resulting in disclosure of content. The only figure provided is the total number of user accounts specified in all law enforcement requests.

Summary of Findings

 

 

 

Requests for subscriber data (from all jurisdictions) for which some data was produced

Total user accounts

Total content requests received for which some data was produced

Total user account content requests received

Percentage of requests for which some data was produced that were requests for content

Percentage of user accounts for which content was requested

 

 

Google

 

58,935[1]

100,774[2]

553[3]

1,328[4]

0.94%

 

1.32%

 

 

Facebook

 

120,249[5]

181,775[6]

6,541[7]

9,374[8]

5.44%

5.16%

 

Microsoft

 

32,046[9]

56,400[10]

2,002[11]

No Data

6.25%

No Data

 

Totals

 

211,230

338,949

9,096

-

4.3%

-

 

[1] The sum of the value of the total requests entry from the “Other Legal Requests” row for each country, multiplied by the corresponding percentage for H1 2017 and H2 2017 added to the value of the total requests entry of the “Search Warrants” and “Subpoena” rows for the United States for H1 2017 and H2 2017.

[2] The sum of the value of the user/accounts specified entry from the “Other Legal Requests” row for each country, multiplied by the corresponding percentage for H1 2017 and H2 2017 added to the value of the total requests entry of the “Search Warrants” and “Subpoena” rows for the United States for H1 2017 and H2 2017.

[3] Value of the total entry from the “Wiretap orders” row multiplied by the percentage specified in the corresponding percentage column plus the corresponding value calculated for the “Pen Register Orders” row for H1 of 2017 added to the same value calculated for H2 of 2017.

[4] Value of the user/accounts specified entry from the “Wiretap orders” row multiplied by the percentage specified in the corresponding percentage column plus the corresponding value calculated for the “Pen Register Orders” row for H1 of 2017 added to the same value calculated for H2 of 2017.

[5] Value of the “Total data requests” column multiplied by the percentage specified in the “Percent requests where some data produced” column for H1 of 2017 added to the same value calculated for H2 of 2017.

[6] Value of the “Total users/accounts requests” column multiplied by the percentage specified in the “Percent requests where some data produced” column for H1 of 2017 added to the same value calculated for H2 of 2017.

[7] Value of the “Court Order (18 USC 2703(d))” column multiplied by the percentage specified in the “Court Order (18 USC 2703(d) Percentage” column, added to the equivalent value for the “Pen Register/Trap and Trace” columns, added to the equivalent value for the “Title III” columns for H1 of 2017 added to the same values calculated for H2 of 2017.

[8] Value of the “Court Order (18 USC 2703(d)) Accounts” column multiplied by the percentage specified in the “Court Order (18 USC 2703(d) Percentage” column, added to the equivalent value for the “Pen Register/Trap and Trace” columns, added to the equivalent value for the “Title III” columns for H1 of 2017 added to the same values calculated for H2 of 2017.

[9] The sum of the “law enforcement requests resulting in disclosure of content” plus “law enforcement requests resulting in disclosure of only subscriber/transactional (non-content) data” for H1 2017 added to the same figure for H2 2017.

[10] Estimated as follows: The sum of the “law enforcement requests resulting in disclosure of content” plus “law enforcement requests resulting in disclosure of only subscriber/transactional (non-content) data” divided by total number of law enforcement requests received. This proportion multiplied by the “accounts/users specified in requests” figure. The result of calculation of this ratio for H1 2017 and H2 2017 are added together.

[11] The sum of “Law enforcement requests resulting in disclosure of content” for H1 2017 and H2 2017.

 

 

There is hardly a criminal case these days that does not involve a component of electronic evidence – almost everybody has a smartphone, for instance, which is basically a small computer in our pockets. Electronic evidence raises some very interesting legal and practical challenges not the least of which is the skills and knowledge required by investigators to appropriately handle this type of evidence.

The topic being covered in this article is whether and how electronic evidence is defined in the law as a category of evidence. It is generally accepted that, where possible, electronic evidence should be defined specifically in the law and I will describe below some of the challenges that can arise if this isn’t done.

What's so special about electronic evidence?

Some of the characteristics of electronic evidence make it difficult to collect and manage properly.

First of all, it is invisible to the naked eye. It is also difficult to authenticate the validity of electronic evidence and, in many cases, it will require specialist skills to access and interpret. Electronic evidence is also extremely volatile. It can easily be deleted, changed, manipulated or damaged. It may also be ephemeral, in the sense that it might only be available for collection for a short period of time.

These difficulties are not unique to electronic evidence. Several different forms of trace evidence (DNA, fingerprints, etc.) share these features. Electronic evidence also has some other unique characteristics that can present challenges. Here are a few examples:

  • The evidence may not be located in the jurisdiction where the crime is being investigated. It may, in fact, be very difficult to determine which jurisdiction the evidence is actually located in.
  • The volume of data gathered may be extremely large. Often a huge volume of data (measured in terabytes) need to be analysed in order to identify the substantive evidence (the size of which might be measured in kilobytes). This means that the time taken to identify and analyse a specific piece of evidence is both (a) significant and (b) unpredictable. This, of course, presents huge resourcing and time management challenges for investigative units.
  • There may be inadmissible information, such as privileged communication, mixed in with the evidence.

It is the combination of all of these factors, particularly when more than one difficulty arises at the same time, which makes the collection and management of electronic evidence so interesting.

How has evidence traditionally been defined?

In most criminal procedure codes there is a definition of the categories of evidence that are admissible in court proceedings. This will often include things like witness testimony, forensic examination, results of a search of property, documentation, and so on. Some countries have updated their criminal procedure codes to include provisions specifically allowing for the admissibility of electronic evidence, whereas other countries interpret the existing provisions as allowing for the admissibility of electronic evidence.

For example, some might say that a document includes a document in any form (including a document in electronic form). The term “document” is then interpreted broadly to include any file stored on a computer. Another approach used is to say that any evidence that is the result of a properly conducted search is admissible and a search could include a search of any computers found, so any evidence collected through the subsequent search of the computer is thereby admissible.

Why is it important to have a specific definition of electronic evidence?

The problem with these approaches arises because electronic evidence covers such a broad range of sources of evidence, collected in such a broad variety of ways. The contortions required to shoehorn electronic evidence into a traditional evidence category become more and more difficult as different types of electronic evidence are encountered. Here are a few examples:

  1. If a broad interpretation of “document” is being used to encompass electronic evidence:
    1. Usually these provisions require production of an “original” version of the document. What does an “original” mean in the context of electronic evidence?
    2. What about when it comes to seizing categories of electronic evidence that could in no way be interpreted as a document? Examples that spring to mind are the content of RAM or data captured directly from a network.
    3. It is usually a requirement that copies of seized documents be provided to the defendant. Will this include copies of seized electronic evidence? If so, it may be the case that the investigators are handing back control of valuable assets (e.g. bitcoin wallets) to the suspect.
  2. If electronic evidence is admissible under “search” provisions:
    1. Which aspects of the electronic evidence analysis are covered by a search order. Does the scope of the search order also include forensic acquisition? What about the subsequent analysis of the acquired data? What about live data imaging?
    2. What if an investigating officer arrives at a scene and finds a computer connected to (for example) Dropbox. Is it acceptable for the investigator to browse around the Dropbox folder, even though that data is, in all probability, stored in a different jurisdiction?

Conclusion

In this article I have deliberately side-stepped all of the practical issues of collection and management of electronic evidence, focussing instead on what electronic evidence actually means and considering some of the aspects of its admissibility in court. It is by no means a simple matter to draw up an all-encompassing definition of electronic evidence but it is important to consider these matters and make sure that the legal framework supports electronic evidence.

 

Very often when I try to start a discussion about law enforcement access to data, the conversation immediately leaps to wiretapping as if this is a slam-dunk argument against any form of law enforcement access to data. For me, this indicates a narrowness of thinking that is prevalent in those who advocate for privacy rights above all others. The purpose of this article is to broaden the discussion by presenting a commonly used data categorisation and discussing the reasons why law enforcement agencies need access to data.

One point on the scope of this article before I begin: the discussion below relates to electronic data and does not address the collection of other types of evidence such as statements, physical evidence, fingerprints, DNA, etc.

Wiretapping is understandably an emotive topic to which people feel visceral suspicion – wondering who could be listening to, or monitoring, their communication without their knowledge – particularly in light of the Snowden revelations of government mass surveillance. However, the situation is not that simple. There are different types of data that can be accessed by law enforcement and these are frequently categorised as follows:

  1. Subscriber data – who owns, or was controlling a particular identity (account, IP address, etc.) at a particular point in time.
  2. Traffic data – this is, briefly, metadata about communication that has taken place between two parties (but not the content).
  3. Content data – the actual content of communication collected through mechanisms such as wiretapping.

By way of a concrete example, consider the traditional phone system:

  • Information about the individual who owns or controls a particular phone number would be subscriber data;
  • Information about whether person A called person B on the phone would be traffic data;
  • The content of the call between person A and person B would be content data.

Actions that lead to the collection of data from each of these categories are considered to be progressively more intrusive, with subscriber data being the least intrusive and content data being the most intrusive. Increasing levels of intrusiveness come with increasing levels of judicial oversight. An investigator that was requesting measures that provide access to content data would be required to demonstrate a significantly greater level of suspicion before being granted an order than an investigator that was asking for access to basic subscriber information.

Considered in this context, wiretapping should be thought of as a technical means to an end – it is one mechanism used by law enforcement agencies to collect a certain type of data, specifically content data.

In a more general sense, the aim of law enforcement agencies is to enforce the law (as the name suggests!). What this means, amongst other things, is the identification, and investigation, of breaches of criminal legislation. Every country has its own legislation and there is clearly disagreement amongst different countries about what constitutes a crime – this is part of the problem with discussions about law enforcement access to data, particularly on the Internet. Investigation of criminal activity must be done with the expectation that all of the law enforcement activity will be scrutinised in a court in due course. Therefore the findings of the investigation must be supported by appropriately collected and managed evidence. For a given jurisdiction, the rules for admissibility and appropriate management of evidence are commonly laid out in the criminal procedure code or equivalent.

Of course, when it comes to something as difficult and nuanced as law enforcement access to data, there is nothing simple, particularly when talking about electronic evidence and even more particularly when talking about electronic evidence collected from another jurisdiction. Below, I have provided a (far from exhaustive) list of examples of the challenges presented by these issues. I plan to describe some of these challenges in more detail in later articles.

  • There is not universal agreement of what constitutes a crime. The canonical example provided here are countries that do not have laws that restrict free speech or that do not protect individual rights to privacy.
  • There are different levels of judicial independence around the world.
  • There are different levels to which the principle of rule of law applies around the world.
  • There are different types of law enforcement agencies with a variety of different powers in different jurisdictions.
  • The time taken to gain access to evidence that is located in a different jurisdiction can be a major impediment to investigations. In fact, identifying the jurisdiction that the evidence is located in can, in itself, sometimes present an insurmountable challenge.
  • Different jurisdictions have different rules about what constitutes evidence and, in particular, where and how electronic evidence fits in their criminal procedure code. This can lead to some significant practical difficulties.
  • Fundamental technological challenges that can prevent identification of criminals online, such as topics I have already covered like Carrier-Grade NAT and IPv6 Stateless Address Autoconfiguration. This is a separate problem from the use by criminals of obfuscation technologies such as Tor.

Conclusion

The individual right to privacy is critically important but it is not an absolute right. Law enforcement agencies need to gather evidence during criminal investigations and this requirement represents an important societal need, the right of victims of crime to expect that their crimes can and will be effectively investigated by law enforcement agencies in their jurisdiction.

Wiretapping is not the only form of law enforcement access to data. The issue of law enforcement access to data is far more complex than is suggested by any simplistic dismissal of the entire topic because of an objection to wiretapping. As I concluded in my previous article, a more level-headed discussion is required to find a sensible balance between privacy and law enforcement access to data.

The need for individual right to privacy and the need for law enforcement to be able to effectively investigate crime are sometimes portrayed as being irreconcilably in direct conflict with each other. Both needs are legitimate and ignoring the challenges presented by areas of conflict will not make the problem go away.

My recently published Internet Draft presents a conceptual model that allows for both sets of requirements to be met simultaneously. The reason for this publication is to show that, with some creative thinking, it is possible to identify win-win solutions that simultaneously achieve both privacy and law enforcement goals. This post contains a summary of the main ideas presented in that paper.

Current regulatiory regimes typically oblige ISPs to keep records to facilitate identification of subscribers if necessary for a criminal investigation and in the case of IPv6 this will mean recording the prefix(es) have been assigned to each customer. IPv6 addresses are assigned to organisations in blocks that are much larger than the size of the blocks in which IPv4 addresses are assigned, with common IPv6 prefix sizes being /48, /56 and /64.

From the perspective of crime attribution, therefore, when a specific IP address is suspected to be associated with criminal activity, records will most likely available from an ISP to identify the organisation to which the prefix has been assigned. The question then arises how an organisation approached by law enforcement authorities, particularly a large organisation, would be able to ascertain which host/endpoint within their network was using a particular IP address at a particular time.

This is not a new problem, with many difficulties of crime attribution already present in the IPv4 Internet.

IPv6 Stateless Address Autoconfiguration (SLAAC) describes the process used by a host in deciding how to auto configure its interfaces in IPv6. This includes generating a link-local address, generating global addresses via stateless address autoconfiguration and then using duplicate address detection to verify the uniqueness of the addresses on the link. SLAAC requires no manual configuration of hosts, minimal (if any) configuration of routers, and no additional servers.

Originally, various standards specified that the interface identifier should be generated from the link-layer address of the interface (for example RFC2467, RFC2470, RFC2491, RFC2492, RFC2497, RFC2590, RFC4338, RFC4391, RFC5072, RFC5121). RFC7217 (A method for generating semantically opaque interface identifiers with IPv6 stateless address auto configuration (SLAAC)) describes the currently recommended method whereby an IPv6 address configured using the method is stable within each subnet, but the corresponding interface identifier changes when the host moves from one network to another.

In general terms, the approach is to pass the following values to a cryptographic hash function (such as SHA1 or SHA256):

  • The network prefix
  • The network interface id
  • The network id (subnet, SSID or similar) – optional parameter
  • A duplicate address detection counter – incremented in case of a duplicate address being generated
  • A secret key (128 bits long at least)

The interface identifier is generated by taking as many bits, starting at the least significant, as required. The result is an opaque bit stream that can be used as the interface id.

On the other hand, RFC4941 (Privacy Extensions for Stateless Address Autoconfiguration in IPv6) describes a system by which interface identifiers generated from an IEEE identifier (EUI-64) can be changed over time, even in cases where the interface contains an embedded IEEE identifier. These are referred to as temporary addresses. The reason behind development of this technique is that the use of a globally unique, non-changing, interface identifier means that the activity of a specific interface can be tracked even if the network prefix changes. The use of a fixed identifier in multiple contexts allows correlation of seemingly unrelated activity using the identifier.  Contrast this with IPv4 addresses, where if a person changes to a different network their entire IP address will change.

To prevent the generation of predictable values, the algorithm must contain an cryptographic component.  The algorithm assumes that each interface maintains an associated randomised interface identifier. When temporary addresses are generated, the current value of the interface identifier is used.  

From the crime attribution perspective, both the recommended stable and temporary address generation algorithms pseudo-randomly select addresses from the space of available addresses. When SLAAC is being used, the hosts auto-configure the IP addresses of their interfaces, meaning there is no organisational record of the IP addresses that have been selected by particular hosts at particular points in time.

My Internet Draft presents a record-retention model whereby it is possible for an organisation, if required to do so as part of a criminal investigation, to answer the question “Who was using IP address A at a particular point in time?” without being able to answer any more broadly scoped questions, such as “What were all of the IP addresses used by a particular person?”

The model described  assumes that the endpoint/interface for which the IPv6 address is being generated has a meaningful, unique identifying characteristic. Whether that is the layer two address of the interface or some other organisational characteristic is unimportant for the purpose of the model.

The host generates an IPv6 address using any of the techniques described above, but most likely the technique described in RFC4941. Having completed the duplicate address detection phase of SLAAC but before beginning to use the IP address for communication, the host creates a structure of the following form:

 

typedef struct {
   const char *LOG_ENTRY_TAG=”__LOG_ENTRY_TAG__”;
   unsigned char *ip_address;
   unsigned int identifying_characteristic_length;
   unsigned char *identifying_characteristic;
   unsigned int client_generation_time;
   unsigned int client_preferred_time;
   unsigned int client_valid_time;
} log_entry;

The fields are all mandatory, and populated as follows:

  • LOG_ENTRY_TAG has the fixed, constant value “__LOG_ENTRY_TAG__”
  • ip_address contains the 16 byte IPv6 address.
  • identifying_characteristic_length contains the byte length of the identifying_characteristic field.
  • identifying_characteristic is a variable length byte string, organisationally interpreted, to represent the identifying characteristic of the host generating the IPv6 address.
  • client_generation_time contains the time, in seconds since the unix epoch, as recorded by the client creating the IPv6 address, at which the address was generated.
  • client_preferred_time contains the period, in seconds, starting at client_generation_time for which the client will use this IPv6 address as its preferred address.
  • client_valid_time contains the period, in seconds, starting at client_generation_time for which the client will consider this IPv6 address to be valud.

When the structure has been populated, the host encrypts the structure using AES-128 in CBC mode with the selected IPv6 address being used as the encryption key. The host then submits the record above to a specified multicast address and port but, when sending the record, sends it using the unspecified IPv6 address (i.e. “::”) as the source IP address. When records are received by the logging server, listening to the specified multicast address, the logging server creates a new log entry consisting of:

  • The time the record was received, ideally calibrated to a global standard time (e.g. NTP) with the granularity of a second.
  • The encrypted record received as a binary blob.

If and when it becomes necessary to query the recorded entries, the following (representative) process can be followed:

  1. Taking the IP address for which the attribution information is required, iterate through all recorded log entries and use the IP address as a decryption key and attempt to decrypt the record.
  2. Examine the decrypted data and check whether the first 17 bytes have the values “__LOG_ENTRY_TAG__”.
    • If so:
      1. This indicates that the log entry has been successfully decrypted.
      2. The IP address contained in the log entry can be verified against the IP address that was used as a key to confirm that the log entry contains the correct value.
      3. The identifying characteristic can then be read from the log entry, along with the time at which the host generated the IP address.
      4. The time in the record can be correlated with the time in the log entry recorded by the server so that any time differential can be compensated for.
    • If not:
      1. This indicates that the log entry has not been successfully decrypted and that the current log entry pertains to a different IP address.
      2. Move on to the next log entry and try again.

It would be computationally feasible to use this process on a large number of log entries but, if necessary, the number of log entries can be reduced by selecting a range of log entries based on the time recorded by the server.

In order to decrypt a specific log entry without knowing the target IP address, a brute force approach must be adopted. Presuming a known 64-bit address prefix, means that there is a space of 2^64 possible addresses to search for each individual log entry.

The privacy of the records comes from the pseudo-random nature of the IPv6 address generation mechanism, the very feature that is desirable from a privacy perspective.

The model presented here provides a balance between the needs for individual privacy at the network layer while also providing a mechanism for recording data that would be required in a criminal investigation. The balance that has been proposed here is at the point where it is possible to identify, using this technique, who was using a specific IP address at a specific point in time without being able to extract any more information such as all of the people who were using a particular IP or all of the IP addresses that were used by a particular endpoint.

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site you are accepting the use of cookies in accordance with our privacy policy.
Privacy Policy Accept