In the previous section, Adding a New Telemetry Data, we walked through how to add a new Squid data source to Apache Metron. The inevitable next question is how I can I enrich the telemetry events in real-time as they flow through the platform? Enrichment is critical when identifying threats or as we like to call it "finding the needle in the haystack." The customers requirements are the following:
In this section, we will walk you through how to do requirement 3.
Whois data is expensive so we will not be providing it. Instead we wrote a basic whois scraper (out of context for this exercise) that produces a CSV format for whois data. To incorporate this data, complete the following steps:
google.com, "Google Inc.", "US", "Dns Admin",874306800000
work.net, "", "US", "PERFECT PRIVACY, LLC",788706000000
capitalone.com, "Capital One Services, Inc.", "US", "Domain Manager",795081600000
cisco.com, "Cisco Technology Inc.", "US", "Info Sec",547988400000
cnn.com, "Turner Broadcasting System, Inc.", "US", "Domain Name Manager",748695600000
news.com, "CBS Interactive Inc.", "US", "Domain Admin",833353200000
nba.com, "NBA Media Ventures, LLC", "US", "C/O Domain Administrator",786027600000
espn.com, "ESPN, Inc.", "US", "ESPN, Inc.",781268400000
pravda.com, "Internet Invest, Ltd. dba Imena.ua", "UA", "Whois privacy protection service",806583600000
hortonworks.com, "Hortonworks, Inc.", "US", "Domain Administrator",1303427404000
microsoft.com, "Microsoft Corporation", "US", "Domain Administrator",673156800000
yahoo.com, "Yahoo! Inc.", "US", "Domain Administrator",790416000000
rackspace.com, "Rackspace US, Inc.", "US", "Domain Admin",903092400000
1and1.co.uk, "1 & 1 Internet Ltd","UK", "Domain Admin",943315200000
The schema of this enrichment source is domain|owner|registeredCountry|registeredTimestamp. Make sure you don't have an empty newline character as the last line of the CSV file, as that will result in a null pointer exception.
We will use the whois_ref.csv file in step 5.
{
"config" : {
"columns" : {
"domain" : 0
,"owner" : 1
,"home_country" : 2
,"registrar": 3
,"domain_created_timestamp": 4
}
,"indicator_column" : "domain"
,"type" : "whois"
,"separator" : ","
}
,"extractor" : "CSV"
}
Because copying and pasting from this blog will include some non-ascii invisible characters, run the following command to strip them out:
iconv -c -f utf-8 -t ascii extractor_config_temp.json -o extractor_config.json
We now need to configure what element of a tuple should be enriched with what enrichment type. This configuration will be stored in Zookeeper.
{
"zkQuorum" : "$ZOOKEEPER_HOST:2181"
,"sensorToFieldList" : {
"squid" : {
"type" : "ENRICHMENT"
,"fieldToEnrichmentTypes" : {
"domain_without_subdomains" : [ "whois" ]
}
}
}
}
Because copying and pasting from this blog will include some non-ascii invisible characters, run the following command to strip them out:
iconv -c -f utf-8 -t ascii enrichment_config_temp.json -o enrichment_config.json
Step 5: Run the Enrichment Loader
Now that we have the enrichment source and enrichment config defined, we can now run the loader to move the data from the enrichment source to the Metron enrichment Store and store the enrichment config in Zookeeper.
/usr/metron/$METRON_RELEASE/bin/flatfile_loader.sh -n enrichment_config.json -i whois_ref.csv -t enrichment -c t -e extractor_config.json
hbase shell
scan 'enrichment'
To check if the Zookeeper enrichment tag was properly populated, run the following:
/usr/metron/0.1BETA/bin/zk_load_configs.sh -m DUMP -z ZOOKEEPER_HOST:2181
Generate some data by using the Squid client to execute http requests. (Do this about 20 times.)
squidclient http://www.cnn.com
Notice the enrichments: whois.owner, whois.domain_created_timestamp, whois.registrar, whois.home_country.