The Calculus of Threat Modeling
I have been designing secure and security products for 20 years. I always thought of this as “architecture” and it took me a long time to realize that a major part of what I was doing was threat modeling. There are many established approaches to threat modeling, but because I backed into the field, I had rolled my own. This post is to explicitly describe what I have been doing.
Theory of Threat Modeling
The core of my approach is to play “spot the security principal.” This is getting personal with software architecture, modeling a system as a bunch of competing and cooperating characters interacting with each other. Ideally all the characters act in good faith, but what if they don’t?
So what is a “security principal”? A principal is any active entity in system with access privileges that are in any way distinct from some other component it talks to. Corollary: a principal is defined by its domain of access (the set of things it has access to). Domains of access can, and often do, overlap, but that they are different is what makes a security principal distinct. Peer identities that are at the same level, but distinct identities, are distinct security principals, such as separate apps in iOS, Android, and the Windows App Store. Conversely, processes on the Windows desktop all run as the same identity and have no protection from one another, and thus are all a single security principal.
We care about finding all the distinct security principals, because all attack surfaces occur between distinct security principals. That the security principals are distinct is what makes it an attack surface; a malicious principal on one side attacks the other side to obtain some access that the attacker wants but does not have.
Now we get to the “threat” in “threat modeling.” All threats occur on attack surfaces, by definition. If some threat exists that is not against an attack surface, then you likely have missed an attack surface, and probably a security principal as well.
We now have a list of all threats, but not all threats are created equal. The severity of a threat is influenced by two factors:
Difference in privilege: This is the aggregate importance of all the resources that the victim principal has access to, but the attacking principal does not.
Complexity of the interface: The more complex and bespoke an interface is, the easier it is to hack.
Thus, the severity of threat across an attack surface ~= difference in privilege * interface complexity. The total set of attack surfaces against a specific security principal is called the principal’s trust boundary. For that principal to have any security at all, it is vital that all of its attack surface are at least nominally defended. If not, then the principal is said to be dominated by the other principal that has unmitigated access to it.
Practice of Threat Modeling
There are many threat modeling tools available, but they are really just substitutes for threat modeling best practice, which is for a threat modeling expert to meet with engineers who are experts on the system being modeled in a conference room with a white board. The goal of this meeting is the complete and accurate enumeration of security principals that I mentioned in the beginning.
To achieve that, I use an approximation of Adam Shostack’s 4-question frame:
1. What are you doing?
2. What can go wrong?
3. What did you do about it?
4. Did you do a good job?
This is my variant:
1. What are you doing?
a. Get the product team to describe the system with boxes and arrows on the white board.
2. What can go wrong?
a. Box by box, identify the security principal that owns each box.
b. Ask the team what resources that box has access to, and in particular, how important those assets are.
c. Ask the team if we really have all the connections to every given box.
d. Talk about baseball or something fun and unrelated to the product.
e. Ask again if we have all the connections to each box. The baseball chat and repeating the question is fishing for the “oh yeah!” realization that the team has left something out.
3. What did you do about it?
a. Inspect the complexity of the interface for each attack surface.
b. Ask the product team if they added any mitigations to that interface.
c. Be sure to take a picture of the white board after the team is done describing the system.
d. Go home and create a threat model diagram.
4. Did you do a good job?
a. Penetration test the system.
Here is a representative system diagram on a white board, of the non-existent Leviathan Blogging Service.
1. Begin with the system diagram from the meeting.
2. Collapse together any nodes that are actually executing as the same security principal. Looking at the components in the diagram, we have:
a. Reader: unauthenticated Internet user reading blog posts.
b. Blogger: authenticated user of the blog service, posting blogs.
c. Edit Server: server that bloggers interact with to create and edit their posts.
d. Content Server: server that hosts the content for people to read.
e. Beast: colloquial name for the ‘everything’ server that ingests updates from bloggers, and serves up content to readers.
f. DB: database to store the content.
g. Backup: a backup database to store the content so that it is recoverable.
h. Advertising Syndicator: external company that provides display ads for the blogging service.
Most of these nodes are distinct security principals, with a few special cases:
Content Server and Beast are running as the same security principal.
DB and Backup are slaves of Beast, i.e. dominated by Beast .
Thus, a collapsed principal diagram looks like this:
3. Note the principals in the diagram that are completely beyond the team’s control, typically “the Internet” and “customers” or “users”. These are the hypothetical attackers.
4. Calculate the severity of the threat for each connection from an attacker principal to one of the system’s principals.
5. Assume that the attacker gets to control each of the nodes considered above, and apply 3..4 recursively across all nodes until all have been evaluated.
6. Color the security principals with respect to the level of threat they are subject to. I use a simple red/yellow/clear three-level ranking, because the severity calculation is very approximate, so it is not worth getting more granular. I also color out-of-scope/hypothetical attackers blue.
This diagram is the result of the analysis with colors. It highlights the nodes that are most exposed to attack:
Edit server is red because editing is a complex interface, so there is considerable risk that a malicious blogger could compromise the Edit
Content Server is red because it is exposed to anyone on the internet, and so any flaws in it can be exploited by
Beast is red because it is running as the same principal as the Content server, so an attacker need only compromise the Content
DB and Backup are clear because they are dominated by Beast. In other words, an attacker who has control of Beast has no need to hack DB or Backup, because she already has total control of DB and Backup.
Content Server and Beast running as the same principal is a design error. If we change it to be a distinct principal, then we would get this result:
Edit server and Content Server remain red because they have the same
Beast is downgraded to amber because it is running as a distinct principal from Content server. It is not downgraded all the way to clear, because it is exposed to attack from two red
DB and Backup are unchanged.
Whether design errors are found or not, the threat modeling highlights the nodes that genuine attackers are most likely to attack. They are the nodes that penetration testers should focus on, and they are the nodes where the product team should concentrate their defensive efforts.
Postscript for “Security Boundary”
The term “security boundary” is commonly used with respect to Microsoft systems. A security boundary is defined by Mark Russinovich as a special form of trust boundary that is “a wall through which code and data can’t pass without the authorization of a security policy.” What Russinovich meant by that is that it is a complete, closed wall, one that does not have end-runs around the back side. It is trust boundary for which there are no unmitigated attack surfaces.
In practice, “security boundary” for Microsoft has come to mean a trust boundary that Microsoft has robustly defended over a long period of time. In private communications, Russinovich confirmed that this is consistent with his notion of a security boundary. It is a useful notion, because it consistently tells the team that if they breach a security boundary, it is a top priority to fix it. Conversely, breaching some other trust boundary that is not a security boundary is bad, but it is not fatal.
What this has lead a lot of people to believe is that if their particular attack surface is not part of a security boundary, that they don’t have to bother defending it. That is a toxic notion in a large organization, because it leads to the death spiral of no one doing anything to defend their attack surfaces if they are not the poorest attack surface on the block, and so they all tend to settle to the lowest point, and vulnerability increases. So, if you use the notion of a security boundary, be sure to use it as a goal, rather than an excuse.