CGN Source Port Logging - Re-Identification Part 2

CGN Source Port Logging - Re-Identification Part 2 Image by Highways Agency on flickr [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

This is the second part of my analysis of Carrier-grade NAT source port re-identification implications. Before reading this, it is important that you have read the first part of this series. In this second post I analyse the re-identification characteristics of the various port selection methodologies described in the first part.

Picking up from where the previous article left off, I mentioned that there are two categories of port assignment methodology used in Carrier-Grade NATs[1]:

  • Dynamic assignment: whereby port allocations are made per-session or per-customer as required. This maximises port utilisation but generates substantial volumes of logs. To reduce the log volume, it is possible to allocate of a port range to each subscriber, rather than an individual port per session.
  • Static assignment: whereby ports or port ranges are reserved for each internal address before subscriber connections are initiated. Port ranges can either be contiguous or non-contiguous.

IP Address Selection

On a slight aside, it is also possible that a Carrier-Grade NAT appliance may have more than one assigned external IP address. In this case there is the additional complexity of which external IP address will be used in a mapping initiated by an internal IP address. There are two possibilities:

  • Arbitrary: where an IP address is selected at random from the pool of external IP addresses.
  • Paired: where all sessions associated with the same internal IP address are mapped to the same external IP address.

The problem with the “Arbitrary” method is that certain protocols may break if the IP associated with an upper-layer session changes while the session is underway. Examples could include some types of network games and streaming content. Therefore it is recommended that NAT devices use the “Paired” approach for selecting an IP address from the external pool.

What this means is that if there are N internal IP addresses and M external IP addresses, approximately N/M internal IP addresses will be paired with each of the external IP addresses. This ratio will be the same regardless of how source ports are assigned (or logged). Because this analysis is examining the differential re-identification power of logging source port versus not logging source port, the use of Paired IP addressing will make no difference and IP address selection is therefore not considered further.

Dynamic Port Assignment Analysis

The first category of port assignment is dynamic port assignment whereby port allocations are made per-session or per-customer as required. As I mentioned in the previous article, the port selection mechanisms used are:

  • Port preservation: The NAT attempts to preserve the port number that was used internally when assigning a mapping to an external IP address and port. In cases of a port collision - where two internal IP addresses attempt to use the same port number - some NATs will override the previous mapping to preserve the same port, others will assign a different IP address from the pool of external IP addresses (presuming other addresses are available) and finally the NAT will pick a different port.
  • Parity preservation: Some NATs will preserve the parity of the internal port when selecting an external port. In other words an even numbered internal port will be mapped to an even numbered external port, and similarly for odd numbered ports.
  • Port randomisation: In cases where port preservation is not being performed, the NAT should obfuscate selection of the external port. Algorithms are available to preserve port parity if necessary while still obfuscating the external port.

If port preservation is in use, then the source port number on the external side of the NAT will be the port number that was selected by the originating host on the internal side of the NAT. In this case the re-identification characteristics of logging source port are the same as if the NAT was not present at all. The algorithm that is used to select source port by the operating system of the originating host will predominantly determine the re-identification characteristics. It may be possible in some specific scenarios to determine that multiple hosts are sharing an IP address if, for example, the operating systems select sequential port numbers for sequential connections it may be possible to identify multiple series of sequential port numbers from the same IP address but this will only be possible if (a) a very small number of internal IP addresses are using the NAT and (b) the volume of log entries contained in the log file being analysed supports this type of analysis.

If port randomisation is in use then the source port on the external side of the NAT will be pseudo-randomly selected from the pool of available ports. This will lead to a roughly pseudo-randomly distributed range of source ports from the perspective of an external IP address, meaning it will be extremely difficult to disentangle the log using the source port information to re-identify the activity of any individual internal IP addresses.

The use of parity preservation does not really change the fundamentals of the analysis above because port preservation or port randomisation can both be done in a way that will preserve port parity.

In summary, except for one or two special cases, there is very limited scope for enhanced re-identification using source port where dynamic port assignment is in use.

Static Port Assignment Analysis

Turning now to static port assignment whereby ports or port ranges are reserved for each internal address before subscriber connections are initiated. Port ranges can either be contiguous or non-contiguous but in either case there is a statically configured relationship between one or more ports and a specific internal IP address.

In this situation the re-identification power of source port logging will depend on the number of ports/size of the port range allocated to a specific internal IP address. It is generally the recommended behaviour that ports are mapped within the three port ranges (0-1023, 1024-49151 and 49152-65535). Considering just the second two port ranges (since ports from 0-1023 are not usually used as source ports), there is an available range of 64,512 ports.

If there are N internal IP addresses and the ports are statically distributed evenly between them, there will be N/64,512 ports assigned to each IP address. It is possible that not all ports are assigned to the internal IP addresses, with some being retained for future use, so in general there will be N/M ports assigned to each IP address where M is the number of ports assigned to all internal IP addresses.

The N/M ports that are assigned to a specific internal IP address may or may not be contiguous so the question that next arises is how it would be possible to use the source port to correlate multiple log entries to identify the activity of an individual internal IP address.

One approach would be to make a number of educated guesses and analyse the logs attempting to determine whether a particular hypothesis matches the data. For example, one could hypothesise that the ports assigned to a given internal IP address are contiguous. The logs could then be searched for entries from a particular IP address and then sorted by port number, looking for ranges of ports in the log entries. This might work if the activity of only a small number of internal IP addresses is represented in the log. In such a case, it would be possible to identify clusters of port numbers and with a reasonable degree of confidence conclude that the activity from the clustered port numbers represented the activity of different internal IP addresses. If, on the other hand the activity of a significant number of internal IP addresses then this type of analysis would become much more difficult because the port ranges assigned to internal IP addresses would begin to form a contiguous range and with increasing numbers of internal IP addresses, it would become increasingly difficult to separate the activity of different internal IP addresses. Regardless of the number of internal IP addresses represented in the logs, such an analysis would also depend on the availability of a sufficient number of records to be able to identify clusters of port numbers with adequate statistical significance.

Another possibility would be that two or more contiguous sub-ranges of ports are assigned to each subscriber. In this case it would be much more difficult to see how logging source port helps to re-identify identify internal subscribers because (a) the port use will appear much more fragmented and (b) it would not be apparent that the two (or more) contiguous sub-ranges of ports relate to the same internal subscriber.

In summary, the parameters that will influence the ability to differentially de-identify subscribers based on logs containing IP address only versus IP address and source port when static port assignment is in use are:

  1. Number of internal IP addresses.
  2. Number of ports that have been allocated in total from the available range.
  3. Methodology used for statically assigning ports to internal IP addresses.
  4. Number of internal IP addresses represented in the logs being examined.

Additionally, the volume of records available for analysis in the log being examined also strongly influences the confidence of any results found. The attacker who has gained access to the logs knows none of these parameters.

It is therefore reasonable to conclude that the re-identification risk of storing source port along with IP address is about the same as storing IP address. This will be valid except in a small number of very specific cases – such as the case where a log with a large number of entries in which can be found the activity of a very small number of internal IP addresses that have passed through a NAT that has statically assigned a single contiguous port range to each internal IP address.

Identification of the Internal IP Address or Subscriber Identity

What the above analysis has considered is the correlation of activity based on IP address and source port. None of the above will enhance the ability of an attacker that is in possession of logs from an Internet-facing server to identify the internal IP address that is behind the NAT or, ultimately, the identity of the subscriber.

There might be application-layer information in the logs that would allow the identification of the internal IP address or subscriber, but that information would be present regardless of whether source port is logged and is therefore out of scope of this analysis. It does not have any influence on the differential re-identification power of logs containing IP address versus logs containing IP address and source port.

Conclusion

It might be possible in some remote cases to correlate unrelated sessions with increased resolution if source port is logged versus logging only IP address but such an analysis would depend on all sorts of parameters that could not be known to an attacker in possession of compromised logs.

This analysis is examining the differential re-identification power of Internet-facing server logs where only IP address is retained versus the case where IP address and source port are retained – when those logs are analysed in the absence of access to the ISP’s records. In other words, it is an assessment the differential privacy risk of a data breach involving the leak of logs with just IP address versus logs with IP address and source port.

In that regard, what this analysis shows is that even if analysis of the leaked logs allows the correlation of multiple sessions together based on source port (which is possible only in a small set of specific circumstances), the inability to access ISP records will mean that it will not be possible to identify with any increased resolution the internal IP address or subscriber identity.

In conclusion, the analysis would seem to indicate that the logging of source port does not substantially increase the re-identification risk arising from the loss of logs of Internet-facing servers.

 

[1] https://tools.ietf.org/html/draft-chen-sunset4-cgn-port-allocation-05

 

 

 

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site you are accepting the use of cookies in accordance with our privacy policy.
Privacy Policy Accept