...
Analytics & Processing Examples
These examples are here for drawing out higher-level goals for Distill's functionality. This section can be removed once the goals have been solidified.
Here is a model data pipeline for SENSSOFT: RAW DATA>QUERY>FILTER/Q&A>TRANSFORMATION>PRIMITIVE FEATURE EXTRACTION>TRAINED MODELING>DERIVED FEATURE EXTRACTION
There are a few different classes of libraries that Distill might include in support of this pipeline; they have different consequences for workflows with in larger analytic pipelines.
- QUERY: We may want to be able to recreate previous queries used for other analyses, not necessarily "save" queries.
- FILTERING/Q&A: Elimination of data from query return, when that data can't be eliminated by query alone because some pattern to be filtered is fully nested within some query index.
- EX: Filter out specific save events from osquery object access data that do not coincide with click/keyboard activity with KM Logger.
- EX: Random resampling of km-logger events time-series–random sample every 1/min interval
- TRANSFORMATION: Native format of Lucene-like DataStores is a list of records, called as JSON through querie
- EX: Query data from one or more data sources (bunch of JSON), impose structure on JSON so that we represent as list object of logs ordered by timestamp (TS)
- EX: Query data from one or more data sources (bunch of JSON), impose structure on JSON so that we represent as list object of logs ordered by timestamp (TS)
- PRIMITIVE FEATURE EXTRACTION
- EX: Query for UserALE.js data and type=="click", then by userId, then aggregate across some time interval (e.g., count) by unique userId (within-user), return
- EX: Query for UserALE.js data and type=="click", then by userId, then aggregate over logs by unique user Id (e.g., count, mean, media, mode, variance, range) (between-user)
- EX: Using count data (EX a), bin by "path", create probabiliy of clicking on X path.
- TRAINED MODELING
- EX: Call or recreate PRIMITIVE FEATURES, then feed features to Graph Methods, NN or HMM, etc. (see this paper), return model params to Python Env.
- EX: Call or recreate PRIMITIVE FEATURES, then feed features to Graph Methods, NN or HMM, etc. (see this paper), return model params to Python Env.
- DERIVED FEATURE EXTRACTION
- EX: Call or recreate 5, extract model features as in 4, return to Python Env.
- Build intervals from matching sequences of raw events
- Filter out unwanted events
- Noisy/irrelevant events
- May be conditional on neighboring events
- "dangling" events (e.g. a stop event with no corresponding start)
- Noisy/irrelevant events
- Collapse duplicate events into a single event (when is this preferable to creating an interval?)
- Create "sandwiches" (a set of events bookended by, e.g., a related start and stop event)
- Replace some logs/data with other logs/data
...