|
The NSA was founded as an information aggregator to collect communications from foreign nations. Since its inception (I'm unsure of the date, but I'm pretty sure it was well established by the late 70's), it has been one of the big consumers of high powered computer hardware. Over the years they've bought really big boxen from IBM and Cray, middle class boxes like VAX and later Suns and untold numbers of Intel based processors. It is truly not an Enterprise class shop, rather, it's an Empire class shop. The investment in technology and the recruitment of the brightest brains in computer science is truly staggering.
I've seen a number of people here pooh-poo the 'wiretap' issue because the NSA doesn't have the bandwidth or capacity to read all of the email, or phone conversations, etc. The implication being that I'm not worried 'cause there's too much stuff for them to deal with.
I could whine on about it wasn't really a wiretap issue which is why they didn't go the FISA route, but, I'd rather point out that that argument is the answer to the wrong question. The appropriate question should be what portion of all that stuff do we have to deal with to obtain useful results? In other words, it's a matter of efficiency.
I have considerable professional experience in the Information Retrieval field. What I lay out here is not what the NSA has, but, rather, what I would set up given the premise that the system would analyze data streams on the order of a terabyte/hour.
The first thing to be captured is the information about the communication, known in the industry as meta-data, or data about data. This in and of itself is a useful byproduct of the preparatory, or grooming phase of the intake processing. However valuable this data is on its own, a major reason this data is needed is for use in the noise reduction of the communications stream being monitored, think spam filtering. Unlike us, trying to individually filter spam, the view of the traffic afforded by the meta-data would allow rejection of broadcast types of messages by an analysis of the traffic originated at the source. This traffic would still be cataloged, but probably not indexed, only archived.
The remaining traffic would be prioritized for indexing queues. Those digital sources with high priorities would be indexed in a matter of minutes from capture. Indexing yields another set of meta-data that is further analyzed. This set of data is about the content of the messages.
This new data is then inserted into the index collections and passed through filters for automated routing. Those filters are basically searches that have been saved to get new hits. These are not simple keyword searches, but are quite sophisticated.
Concepts expressed within the stream would be identified and parties would be marked as subject matter experts as the concept repeats in communications. Social and professional circles would be mapped with alacrity. A concept of interest thus becomes a person of interest which broadens into a circle of interesting parties.
The software to do this can be had by any one with enough money, off the shelf, today.
So, if you're not concerned because of the magnitude of the data set, I ask: "How efficient does this process need to be before you become concerned?" 1%? 10%? 50%?
Even this post is an answer to the wrong question because whatever they did do was done unconstitutionally. I think that even archiving spam violates the law.
-Hoot
|