Through Correlation and Context, Metadata Can Reveal Much More Than Many are Led to Believe...
I was eating chipotle burgers, topped with fresh guacamole, and talking with a friend last weekend. He found himself a little annoyed that the NSA is spying on people, but expressed a sentiment that it was not so bad since the NSA was only looking at metadata and they are not reading emails or listening in on phone calls. I almost got the impression that the fact that the NSA claims that they only look at metadata make it “better”.
What is Metadata?
My best answer is that metadata is data about data, not that such a definition really helps unless you think about it. Metadata adds context to other data. When I think of metadata, one of the things I first think of is photography. You obviously have a photograph that you can look at, and see the image, the colors, the shadows, and the composition. But the photograph includes context that you cannot necessarily see, like how large the digital picture is, when it was created, what the image resolution is, including data that is not even photographic in nature, like the author, the data and time, or geo-location information. Sometimes, the photograph meaning changes depending on that additional information.
Consider this photograph (left) of part of a white-furred animal. Now name that animal. The list of potential candidates is very long and you simply cannot identify the animal with any level of confidence. But even a little data about the photograph can vastly improve the chances of your success. If I tell you the photograph was taken in Colorado, you can probably limit your options to “mountain goat” or “wolf”. If you know the photograph was taken near Hudson Bay, you can probably easily identify the animal as a polar bear. But if I don’t tell you the photograph was taken in Peru would you ever guess “Alpaca?”
Consider if you will, the photograph on the right. I expect very few people will immediately recognize it, though when I asked a few people, I got answers like a black and white photo of a beach, satellite topography, piece of shale, and many other answers to try to identify what look to be striations or layers. I could give you f-stop and focal length, but it would likely not help unless you have more context. If I tell you there is no trick here, but only macro photography with a single reflected white halogen light source, and that I took that photograph in my basement office Monday morning, does that help? Probably not much, because you still have a relative lack of data about the picture; a lack of metadata.
So we can see value in metadata, but is there enough value to have a long-term project dedicated to the identification, gathering, dissemination and analysis of this data?
Assume the following map shows where Joe Green travels during days when he is not writing music. The red marks are some of the cell towers to which Joe connects, along with the timeframe of some of those connections. The location information relating to Joe’s calls are clearly metadata. No one is listening to his calls, just observing where Joe travels during a day. What can we tell from this metadata?
First of all, we know about Joe lives just off Sleepy Hollow Rd. in Annandale, VA. That time stamp from midnight to 7:30 am most likely shows Joe at home, in bed, asleep, and then getting ready for work. I might guess that stop at 7:50 near Pentagon City is a coffee shop (verified by checking Starbuck’s locations). At 8:15 Joe arrives at work. We can surmise that because his location is more or less steady from 8:15 to about 5:17. But Joe did not go right home. He made another stop at 5:43. Where? Given that the duration of that stop was about an hour, it could have been a quick supper, or groceries, or a drink, or a gym; any number of activities. So just by looking at the metadata about where he travelled, we know a lot about Joe; where he lives, where he gets coffee, where he works, that he takes Columbia Pike to the office, despite all those stupid traffic lights and the chaos that is defined by Bailey’s Crossroads. We know about how much gas he uses in a normal week (18 miles round trip to work), and more. Every Saturday at 8:00 am he shows up at the Army Navy Country Club (ANCC) for a round of golf. Since he plays at the ANCC we may wonder if Joe is a current or former officer in the military or a senior government employee in the National Security community. And, since he seems to keep a regular 8:00 am tee-time, we can probably assume he has some pull at ANCC. He takes just over 3 hours to play his round then leave the club, so he is probably not playing a full 18 holes. He appears to drive to Lubber Run Park for about five minutes. Maybe he only needs five minutes to walk his dog, Giuseppe Verdi (what else would his dog be named?), but so be it. As metadata, other than the fact he plays at ANCC none of that is especially interesting.
Now let’s consider Joe’s phone record (some meaningless numbers deleted). He calls 555-555-5555 every day at about 8:30 after he arrives at work and every day at the end of the day, which appears to be while he is driving. First bets are parents, spouse, girlfriend/boyfriend, or child. 111-111-1111 is likely a conference call number. Not sure who could get away with a 54 minute call with 222-222-2222 that started at 9:17 in the evening besides a close family member, so brother, sister, or parents. My bet is that 444-444-4444 belongs to the friend with whom he plays golf Saturday morning. So while we can start penciling in relationships, all in all, this is not very interesting information.
And metadata about his text messages is so boring I am not even including it. He rarely sends text messages, and they mostly go to the same “suspected family” numbers as above. Except, that every Saturday about the time he leaves Lubber Run Park, he sends a single text to 666-666-6666. This is slightly more interesting since he never has any other call or text communication with that number other than about 11:48 every Saturday.
So that is part of Joe’s life in the nutshell of metadata. It is just some random information that may or may not show anything of interest. And, in Joe’s case, it is rather unremarkable. But the power of metadata does not come in that data itself but in the ability of that data to be processed and correlated in an automated fashion. If the FBI is listening to a wiretap, computers can only do so much. In the end, the best intelligence about what is heard on the wiretap comes from a law enforcement official actually listening to the tape and identifying interesting things. The audio has to be processed. But metadata is raw data that can be processed by computer instead of by a person. Computers can take Joe’s raw metadata, and compare that data with everyone else’s data; with my data, with your data, with my daughter’s data, with the data from my mailman, and everyone else. Including, with the data of people who the NSA or FBI (or whomever is watching) may find of particular interest.
So yes, they can include Natasha’s data, because they know Natasha is up to no good. The available metadata shows the same type of information about Natasha that it shows about Joe. “They” know Natasha lives in a condo near Georgetown. They know what route she takes to work as a cultural attaché. They know where she goes to lunch every day. They know what numbers she dials and whom she texts. They know she works out at a gym in Rosslyn. They know she walks her Bichon Frisé, Boris (really, what else would she call him?), in Lubber Run Park every Saturday afternoon. They know where she shops for groceries. If Natasha is a “person of interest” that probably know what kind of toothpaste she uses. And, all the information they have about Natasha is input into the computer systems which are processing this metadata from everyone else, always looking for correlation. Looking for patterns and similarities…
Wait a minute. She walks Boris in Lubber Run Park Saturday after lunch. Do we know anyone else who is regularly in Lubber Run Park?
So, because some computer system full of metadata identifies a correlation, Joe Green suddenly becomes a person of interest instead of just a pile of metadata. Law enforcement and intelligence agencies assess the potential threat; who is Joe Green and to what does he have access? If Joe works in the Crystal City Comic Book shop, and plays golf at the ANCC with his USAF Ret. Col. brother, then this might quickly end up as temporary noise. If Joe holds a clearance and is currently employed in classified projects, law enforcement would likely subpoena additional phone records and information to try to build more context; is this a coincidence or is this true correlation? At some point, if they felt they had enough information that they thought actual email text, text message text, or phone call audio would be valuable, they would establish probable cause and pursue a warrant.
So the value in the metadata isn’t about the data itself. The value in the metadata is in the ability of computer systems to process large amounts of metadata, looking for correlations that may not otherwise be found. But to do that, you really need a lot of metadata, because you want the ability to add as much context as you can as rapidly as you can.
Maybe you have heard of a little NSA project to help maximize the quantity and quality of metadata for exactly this purpose…
As far as the second picture above, it is about an inch and a half section of the hamon from my Wakizashi, starting at the shinogi out to the sharp edge. The part of the blade from the shinogi to the mune is hidden off the top of the picture. The lines on the genuine hamon come from the folded steel which results in 1024 layers. That detail is all “data”, and not metadata. You don’t get that level of information unless you actually listen in on the phone calls.